Pigeons and variance.
One passing run is a coin flip. Two is a release gate. The difference between those two sentences is most of what turned the experiment into something I’d bet on.
Phase 1 of the experiment ended on iteration 10 with one passing run. Maya, the participant, reached goal_reached with a verdict score of 8. Across the entire phase 1 ledger that was the only natural success, and it came after the agent had shipped five hacks and three unrooted workarounds. I could have called the loop converged. Many systems get released on less.
I didn’t, because one pass is a coin flip.
Pigeons.
In 2015 a group at UC Davis trained pigeons to classify breast-tissue biopsy slides as benign or malignant. A single pigeon got around 50% on the easy slides and barely above chance on the hard ones. That’s the headline most people remember.
The other headline, which is the one I think about more, is what happened when they had a flock of four pigeons vote and took the majority. The flock landed somewhere north of 90% accuracy on the same slides. On some categories it beat actual radiologists. Same pigeons, same training, same slides — just averaged.
The signal didn’t get sharper; the noise got averaged out. Two independent rolls of a noisy process tell you something a single roll can’t. This is the entire theory of ensemble methods, of clinical trial replication, of why you don’t poll one voter.
Two of two.
Phase 2’s success criterion was two natural participants reaching a real verdict, in the same study, on the same build. Not one. Not eight of nine. Two of two.
The threshold is deliberately low on count and high on consistency. Two passes don’t cost much to run. What they cost the loop is permission to ship on a lucky single-participant run. A flake at one participant pulls the whole release down, so the agent has to fix what made the flake possible, not just shepherd one participant across the line. Five product-side changes later, two of two natural participants reached real verdicts on noemica. That’s the entire phase-2 result, and that’s the gate that now runs on every release as phase 7.
The deeper move is what counts as a “real verdict.” Reaching the end of the flow doesn’t count. Clicking launch doesn’t count. Even seeing a participant card doesn’t count by itself. What counts is the outer participant coming out of the run with an actual insight about the inner study’s target — the thing a real customer would have wanted out of noemica. The gate measures value extracted, not flow completed.
An 80/20 the agent didn’t see.
On iteration 11 the agent ran into a real participant who sat through the wait and then quit five minutes before the payoff. The natural participant’s verdict captured it cleanly:
“A brief interruption near the end left me back at square one with no results in hand. What I walked away with: a study in progress, two simulated participants exploring the sandbox, and a waiting screen. That’s infrastructure, not insight.”Maya, outer participant, iter 11
The agent diagnosed this correctly: the running page wasn’t surfacing meaningful progress, and the dominant copy at the top literally said “Close this tab.”The next iteration shipped a fix that did two things. The first was a copy change — right call. The second was making participant rows clickable as soon as any single participant’s verdict landed, instead of waiting for the full cohort. That second change reads like a clean usability fix and it’s the one I rolled back two iterations later.
The design rationale — which lived only in my head, never in the codebase — was that a noemica study’s value is distributed roughly 80/20. Any single participant’s verdict is worth maybe twenty percent of what the run is worth. The other eighty percent comes from reading the verdicts together: the patterns across participants, the contradictions, the synthesis the designer can produce once it has all the inputs in hand. Letting users drill into one participant before the cohort finishes feels generous. It also pulls them away before they get the part that matters.
The fix would have shown surface success in the metrics that the agent could see — did the participant interact, did they reach a screen. It would have failed quietly on the metric that doesn’t live anywhere in the codebase: did the user get the value. The agent didn’t see that metric because it was never written down. It was institutional knowledge, in the most precise sense of the term — load-bearing for the product, invisible to anyone who hadn’t built it.
That’s the strongest argument I have for keeping a human in the loop on design decisions while the agent does the implementation work. Some product principles are articulable only after they’ve been articulated. The agent will sand them off in service of a more legible metric every time, not because it’s careless but because it can only see what’s in front of it.
What the release gate is.
Phase 7 is phase 2’s configuration running on every noemica release. The same loop, the same brief, the same two-of-two requirement, against production noemica.io. If two natural participants finish a study and come out the other side with real insights into their target product, the release is good. If not, the release isn’t.
Tests pass means the code did what the spec said. The gate passes means a person could get the value the product promises. Those are different bars. Most releases pass the first long before they pass the second.
What the gate is calibrated for, in one line: not the user who entered the funnel, the user who came out with insights. There are products optimized for the first number. This isn’t one of them.