Agents vs. chatbots: it is an architecture, not a feature

Sometime in the last eighteen months, every product page in our industry started saying "agent." Look under the hood of most of them and you find the same thing: a chat completion API, a long system prompt, and a function-calling schema. That is a chatbot with tools. It is genuinely useful — for answering questions. It does not complete workflows, and no amount of prompt engineering will change that, because the difference between a chatbot and an agent is not the model, the prompt, or the feature list. It is the architecture around the model.

This post describes the loop we run in production — a planner, an orchestrator, and a verifier, operating over mediated tools and durable state — and why each piece exists. None of it is exotic. All of it is necessary.

The chat loop, and why it cannot carry a workflow

A chatbot is a request-response system. Context goes in, text comes out, and the only state is the transcript itself. For conversation, that is fine — the transcript is the product. For a workflow, the transcript is a terrible database: it is append-only, it gets lossy the moment you summarize to fit a context window, it has no types, no transactions, and no way to distinguish "the model said it would do X" from "X actually happened."

Run a fourteen-step due diligence workflow through a chat loop and you are asking one context window to simultaneously be the plan, the working memory, the tool log, the error handler, and the deliverable. Under length pressure the model starts dropping whichever of those it considers least salient. In our experience, it is usually the error handler.

The agent loop

We separate three roles, and we keep them separated on purpose — partly so each can be built and measured independently, and partly because no role should be allowed to grade its own work.

The loop: the planner emits a plan with success criteria, the orchestrator executes it over mediated tools and checkpointed state, and the verifier gates delivery. Failures route back as structured replan signals.

The planner turns a task into an explicit plan: a dependency graph of steps, each with typed inputs, expected outputs, and success criteria written down before execution starts. The orchestrator executes that graph — it owns durable state, mediates every tool call, and checkpoints after each step. The verifier scores finished artifacts against the success criteria and decides whether anything ships.

Task decomposition: the plan is an artifact

In a chat wrapper, the plan — to the extent one exists — is a paragraph of intentions somewhere in the transcript. In our runtime, the plan is a first-class data structure:

plan = planner.decompose(task)
# plan.steps: [{id, action, inputs, depends_on, success_criteria}]

while not plan.complete():
    for step in plan.ready_steps():          # dependencies satisfied
        result = orchestrator.execute(step)  # mediated tools, typed I/O
        state.checkpoint(step.id, result)

    verdict = verifier.check(artifact, step.success_criteria)
    if verdict.failed:
        plan = planner.replan(plan, verdict) # or escalate to a human

Making the plan explicit buys three things. It is inspectable — a reviewer or an auditor can see what the agent intended to do before it did anything. It is schedulable — independent branches of the graph can run concurrently, because the dependencies are data, not vibes. And it is recoverable — when step nine fails, the orchestrator knows exactly which downstream steps are invalidated and which results from steps one through eight are still good.

Tool mediation: the model proposes, the runtime disposes

"The model can call functions" is not tool use; it is tool suggestion. In a wrapper, the model emits a function call and application code executes it more or less on faith. Mediation means the orchestrator sits between the model and every side effect:

Every tool is a typed contract. Arguments are validated against a schema before execution and results are validated after — a malformed tool result is an error, not more text in the transcript.
Tools are allowlisted per workflow. The research agent has no deploy credentials. The code review agent can open a PR but cannot merge one.
Timeouts, retry budgets, and rate limits belong to the orchestrator, not the model. The model never holds credentials at all — it asks, the runtime decides.
Every call is recorded as an audit event: tool name, duration, outcome. (Metadata only — content never persists; that is a separate architectural commitment.)

Verification is a separate job

If you ask the model that produced an artifact "are you sure?" in the same context, you are mostly measuring agreeableness. Our verifier runs in a fresh context: it sees the artifact, the success criteria, and the original inputs — not the producer's chain of reasoning, which is precisely the thing that might have been wrong. It frequently runs on a different model than the producer, since we route steps across frontier models anyway.

Each artifact type has a rubric. A code review is checked for whether each finding refers to code that actually exists and behaves as described; a research memo is checked claim by claim against its cited sources. When verification fails, the failure routes back to the planner as structured data — "criterion 3 failed: claim cites a source that does not contain it" — not as "please try again," which is how wrappers handle it, and which converges on the model apologizing fluently while repeating the mistake.

Failure is the normal case

At workflow lengths, failure stops being an edge case and becomes arithmetic. A twelve-step workflow in which each step succeeds 95% of the time completes 54% of the time if any single failure kills the run. Nobody ships a product on those odds. Reliability comes from recovery, not from hoping every step works, and recovery requires a taxonomy:

Transient failures — timeouts, rate limits, a flaky upstream — are retried with backoff against a budget, invisibly.
Structural failures — a step's assumption turned out false, a source disappeared, an API changed shape — trigger a replan from the last good checkpoint. Steps one through eight do not re-run because step nine broke.
Semantic failures — low confidence, conflicting sources, an ambiguous requirement — escalate to a human, with the specific uncertainty named. An agent that cannot say "I am not sure" is not safe to delegate to.

Why you cannot prompt your way into this

A bigger prompt gives the model more instructions. It does not give the system a transaction boundary. The wrapper failure modes are predictable because they are structural:

State lives in the context window, so it degrades under exactly the conditions — long, complex tasks — where you need it most. There is no atomicity: the model "decides" to do something, the side effect fails, and nothing reconciles intent with reality. Tool errors come back as text the model is free to misread or ignore. Verification is self-assessment, in the same context that produced the error. And cost compounds: a chat loop replays the entire transcript on every turn, while an orchestrator hands each step only the context that step needs.

You can demo a wrapper convincingly. You cannot run one against forty PRs a week and keep your engineers' trust.

Questions that separate the two

If you are evaluating anything that calls itself an agent, five questions get to the truth quickly:

Show me the plan artifact for a real run — before execution, not a post-hoc summary.
What happens, mechanically, when step N fails? Which steps re-run?
Who executes tool calls, and what validates their inputs and outputs?
What gates delivery, and is it the same model and context that produced the work?
Can I see per-step latency and cost?

If the answer to most of these is "the model handles it," you are buying a chatbot. That may be fine — chatbots are useful. But know what you are buying.