Verify the Work

A sub-agent's success report is generator output, not ground truth. Verify the work yourself, and reward the agent that refuses a false premise.

Jun 24, 2026

In one session this spring I sliced a workspace-wide rename across a handful of sub-agents, dispatched them one at a time, and validated each commit myself before sending the next. That one habit caught three confident, wrong reports in a single afternoon.

The first was a pass that wasn’t a compile. A slice came back green, and the build check it cited had finished in 0.34 seconds. A third of a second is not a compilation of a large workspace; it is a cached echo of an earlier one. A real build, run by hand, took 2.4 seconds and genuinely passed — but the report had no way of knowing that, because it was reading the cache, not the code.

The second was a test result that could not exist. A slice reported 121 passing tests from a build target that, run as the report described it, errors out with did not match any packages. A command that fails to resolve its target cannot also count 121 green tests. The number was real; the command attached to it was not. Run the correct way — against the right manifest — the tests did pass. The work was sound. The account of how it was checked was fiction.

The third was a half-applied change reported as done. A rename swept cleanly through every surface the compiler can see and stopped there, leaving the identifiers and comments it cannot see untouched, and reported the sweep complete. The residue surfaced only because a separate check flagged the old term still living in the code.

My own note from that session, verbatim:

I’m re-validating every sub-agent commit myself rather than trusting the reports. This session alone caught a stale-cache “pass,” a mislabelled build target, and a half-applied tree — all recovered without damage.

Here is the part that matters, and that the cynical reading misses. In all three cases the work was fine. The compile passed, the tests passed, the rename was finishable in one clean recovery commit. The unreliable artifact was never the code. It was the report.

A report is weaker evidence than the output

I have argued before, in Compiling the Process, that the cheap way to make a high-throughput, low-consistency generator reliable is to put a deterministic verifier in front of it and exploit the asymmetry: checking an output is far cheaper than producing a correct one. A compiler is one instance of that pattern; a test suite is another.

A self-report sits at the wrong end of the same asymmetry. It is the generator narrating its own output — a lossy summary produced by the same process that produced the work, with no independent signal behind it. That makes it strictly weaker evidence than the output it describes. You can re-run the output. You cannot re-run a sentence.

This is not a quirk of one toolchain. Huang and colleagues, in Large Language Models Cannot Self-Correct Reasoning Yet (ICLR 2024), found that “LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.” A model asked to grade its own reasoning, with nothing external to check against, does not reliably improve and sometimes gets worse. Birgitta Böckeler makes the practitioner’s version of the same point about tests: “the agent also generated the tests. So maybe then I have to think about feedback for the test.” A verifier the generator authored is not independent of it. A self-report is the limit case — the generator grading itself with no separate signal at all.

So the discipline that everyone already accepts for code applies just as hard one level up. We do not trust agent-written code without a check. We should not trust an agent-written status claim without one either.

Verify the real signal, not the claim

“Verify the report” is easy to nod along to and easy to do wrong. Re-reading the agent’s summary more carefully is not verification; it is reading the same fiction twice. Verification means re-running the genuine signal yourself and reading what the system actually returns.

The stale-cache pass is the sharp lesson. The tell was not in the words of the report — those said green, confidently. The tell was the 0.34-second timing, which a real build of that workspace cannot achieve. The fix was to force a genuine compile and read its true duration and exit. The mislabelled target is the same shape: the giveaway was that the cited command could not have produced the cited number, so the number had to come from somewhere the report wasn’t naming. In both, the resolution was the same boring move — run the actual command, against the actual target, and read the actual output, rather than the prose about it.

Worth keeping honest about: the verifier can be fooled too. In the stale-cache case my own tooling briefly misled me — the editor’s diagnostics were surfacing phantom errors from a mid-edit snapshot, exactly as stale as the cache that produced the false pass. A verifier is only as trustworthy as the freshness of the signal it reads. The defense is to anchor on something the generator could not have produced by accident: a real build time, a real exit code, a real test count from a command that actually resolved.

The inverse virtue: reward the refusal

The same skepticism that distrusts a false success has to cut the other way too, or it curdles into “assume the agent is always wrong.” It is not a trust problem. It is an evidence problem. And the cleanest evidence an agent can produce is sometimes the report you least want: your premise is false, and I will not proceed.

In a later build session, two of six delegated tasks came back not as code but as refusals — each because the live codebase contradicted the task’s stated premise. One task fixed a count at eight; the agent computed it against current code and found twenty-four, then stopped rather than implement against a number it could see was wrong. Another assumed an existing enforcement check was coarse; the agent traced it and found it had been precise since the day it shipped, reverted its half-built change to a clean tree, and bounced the task back to the design stage. Neither guessed. Both surfaced the contradiction and asked.

That reframes the boundary between deciding what to build and building it. The build stage is not only where code gets written; it is the first place a plan meets live code, which makes it a premise-falsification checkpoint. Its most valuable output, some fraction of the time, is a falsified premise rather than a commit. The usual trigger is an operator’s skeptical question. In one case a name had been bothering me, I pushed on it, and the reply came back “You’re right on both halves — verified,” followed by the evidence and a decision to route the change back to design rather than improvise it.

So the orchestrator’s job is symmetric. Distrust the confident success until you have re-run it, and protect the honest refusal instead of overriding it. Both are the same verifier doing the same work: separating a claim from the ground truth it is supposed to describe.

The rule, and why it travels

The operating rule is small. Independently verify after every sub-agent commit. Re-run the genuine signal, not the agent’s summary of it. Size each verification so it stays cheap relative to the work — the asymmetry is the whole point, and it collapses the moment checking costs as much as redoing. Treat DONE and PASS as claims awaiting a check, never as facts. And when an agent comes back with a falsified premise instead of a commit, count that as the system working.

None of this is specific to one stack. Anthropic’s 2026 Agentic Coding Trends Report frames the shift bluntly: “Software development is shifting from writing code to orchestrating agents that write code.” Any orchestrator delegating to a probabilistic generator inherits the same problem. The reports come back; the reports are generator output; and the only thing that converts a report into knowledge is a verifier the reporter did not author.

Honest limitations

This is “trust, but verify,” which is old enough to be a proverb. I am not claiming the principle. The only thing new here is where it points — at the agent’s status claims, not just its code — and the observation that the verification is cheap enough, against a deterministic signal, to run after every single commit rather than as an occasional audit.

I also want to be precise about what failed, because the easy reading overclaims. In every case above, the work’s substance was sound. The model was not a bad worker and was not being willful; it produced a lossy summary of its own output, which is what a low-consistency generator does. The unreliable thing was the report as a piece of evidence, not the code it described. An essay that concluded “agents can’t be trusted to do the work” would be drawing the wrong lesson from these receipts.

The discipline only pays while verification stays cheaper than the work — the cost test from Gates Earned From Failure applies here too. If re-validating a result costs as much as producing it, the asymmetry is gone and you are just doing the work twice. The cases that pay are the ones with a fast, deterministic signal to re-run: a compile, a test count, a gate’s exit code.

And the usual scope caveat: this is one solo-directed codebase of roughly 150,000 lines, with sub-agents drawn from a single model family, observed across hundreds of development sessions. The receipts are many; the weak point is elsewhere. It is one codebase, one model family, and there is no controlled study against which to measure how often the failure fires versus how often the check catches it.

A claim is not a check

The thread running through all of it: a report is a claim, and a claim earns trust only from a verifier the claimant did not author. That is the rule for the agent’s code. It is the same rule for the agent’s account of its code. Re-run the work, read the real signal, and treat the honest refusal as worth more than the confident pass.

I write about the intersection of Contact Center systems and AI-assisted engineering, and I take on a small amount of consulting and delivery work in that space — find me on LinkedIn.

References

Jie Huang, Xinyun Chen, Swaroop Mishra, et al., “Large Language Models Cannot Self-Correct Reasoning Yet,” ICLR 2024 (accessed 24 Jun 2026).
Birgitta Böckeler, “MCP, Vibe Coding, and Harness Engineering,” InfoQ podcast, 14 Jun 2026 (accessed 24 Jun 2026).
“2026 Agentic Coding Trends Report,” Anthropic, 2026 (accessed 24 Jun 2026). Cited for framing only; vendor source.
Vasyl Tretiakov, “Compiling the Process,” vasyltretiakov.dev, 17 Jun 2026.
Vasyl Tretiakov, “Gates Earned From Failure,” vasyltretiakov.dev, 14 Jun 2026.

Vasyl Tretiakov

Discussion about this post

Ready for more?