Reliability Engineering for Agentic Systems

A different kind of engineering

In traditional software, a bug is deterministic. In agentic systems, the same input can fail differently every time.

When you introduce a component that reasons — that interprets, generates, and makes judgment calls — the entire reliability model changes. The system can be running correctly and producing wrong results simultaneously. Not because something is broken, but because part of the system is probabilistic by design.

This isn't a quality problem you solve with better testing. It's a category problem that needs a different kind of engineering — one that most teams building with AI haven't developed yet.

New territory, new failure modes

We've been building agentic systems long enough to start cataloguing what goes wrong. Here are some of the patterns we keep seeing.

The instructions to the model are two hundred lines of “be careful to…” and “make sure you don't…” — a pile of patches masquerading as architecture. Every edge case adds another paragraph. It works until it doesn't, and when it doesn't, nobody knows which paragraph to blame. The prompt isn't engineered. It's accreted.

The pipeline produces good results, but no one can test each stage independently. When something goes wrong, you can't tell whether the failure came from the retrieval, the prompt, or the model itself. There are no evals, no isolation, no observability. The only test is running the whole thing end-to-end and hoping the output looks right.

Every model is treated as interchangeable. But the boundary between deterministic code you control and probabilistic model output you don't is the most consequential architectural decision in the system — and most teams don't even think about where to draw it. A frontier model can handle ambiguity and open-ended reasoning. A small private model can execute narrow, well-defined tasks reliably. The engineering is knowing which parts of your pipeline need which — and designing the seam deliberately.

And then there's the economics. In traditional software, serving one more user costs nearly nothing. In agentic systems, every request burns compute. The engineering challenge isn't just “does it work” — it's “does it work efficiently enough to be viable at scale.” Efficiency isn't optimization. It's existential.

The same discipline, extended

The principles you already believe in — decomposition, testability, operational efficiency — applied to the one medium where they're hardest to enforce.

Code Akriti's approach to agentic systems is the same approach we take to everything: eliminate ambiguity systematically. Shrink the surface area where probabilistic behavior is allowed. Make everything around it deterministic, observable, and independently testable. Know exactly where the boundary is between what you control and what the model decides.

This is reliability engineering — applied to systems where part of the machinery reasons instead of computes. The failure modes are novel. The economics are unfamiliar. But the engineering posture is the same one that has made complex systems trustworthy for decades: understand the failure modes, design controls around them, and verify that the controls hold.

Most teams building with AI are still improvising. We've moved past that. The failure modes are catalogued. The engineering controls are in place. And the discipline that makes traditional software precise now extends to the systems that need it most.

Back to Code Akriti →

When your system isprobabilistic by nature.

A different kind of engineering

New territory, new failure modes

The same discipline, extended

When your system is
probabilistic by nature.