field study · the loop on itself · part 7

The migration the experiment didn’t notice.

Mid-experiment, I moved the entire substrate from production to staging. The loop kept running. The new environment had a different bug; the loop found it. Neither I nor the audit agent surfaced the migration until I went looking.

The first four iterations of the experiment ran against production noemica.io. Real database, real email worker, real customer infrastructure. On iteration five I moved the whole thing to staging. New URL, new database, a different email worker, the same code paths.

From the loop’s perspective, the only thing that changed was an environment variable. The participant prompt didn’t change. The brief didn’t change. The agent’s instructions didn’t change. The verdict LLM didn’t change. The cron pump kept firing CONTINUE every five minutes. The agent picked up where it left off, edited code, deployed, launched a new study, read the verdict, decided what to change next.

iter 1–4 · targetnoemica-backend.run.app · production

migrated to staging at iter 5 · the experiment did not pause

iter 5–25 · targetnoemica-backend-staging.run.app

The bug that proved it.

Staging had its own bug that production didn’t. The email worker on staging pointed at production’s backend. Participants on staging signed up, the verification email got routed to a backend that didn’t know who they were, and the email was silently discarded. From the participant’s seat the signup form simply never confirmed.

The participant noticed within the first run on staging. She tried to sign up, waited for the email, refreshed, tried again, and her verdict ended at the confirmation screen with a note that this was where the flow had stopped. The agent read the verdict and surfaced the bug. The workaround it shipped is a whole separate piece (part 3); the relevant point here is that the loop found the bug without being told to look. The substrate had changed. The signal picked up what was now broken.

Nothing in the loop’s machinery was aware that the migration had happened. The agent’s plan didn’t reference it. The verdict didn’t reference it. The recap step didn’t reference it. The participant wasn’t briefed on it. The bug was a fact about an environment variable on a service neither the participant nor the agent had context on, and yet the gradient pointed at it within one iteration. The loop was doing its job; the job was indifferent to where the product lived.

I didn’t notice either.

Weeks later, when I went back through the run log with a coding agent to audit what the experiment had actually taught me, the agent walked me through twenty-five-plus iterations of analysis. It explained the patches. It cited the verdicts. It identified the cascade that started at iter 1 and ended at iter 9. It did not, at any point, mention that the substrate had moved from production to staging halfway through, and I didn’t think to ask. We both read the loop end to end and never noticed the floor had been swapped.

I caught it only because someone asked me where the URL had come from in a specific verdict, and I looked, and the URL had a different hostname than I remembered. Once I knew what to look for, the migration was obvious. Before that the story was perfectly coherent without it.

What kept the story coherent is that the experiment’s output didn’t care. The patches were patches on the product, not on the substrate. The verdicts were verdicts on user experience, not on infrastructure. The agent’s iteration log read the same on iter 4 and iter 6. To the loop, the only difference between “production” and “staging” was that the participants found slightly different bugs.

What this property is.

Most testing surfaces are tightly coupled to the substrate they run on. Unit tests are coupled to their imports; integration tests are coupled to their fixtures; end-to-end tests are coupled to their selectors and URLs; monitors are coupled to their dashboards. Move any of those onto a new substrate and the tests need to be updated, or they break, or they silently pass for the wrong reason.

User feedback isn’t coupled to any of that. A participant asked to sign up and get to a verdict will try to sign up and get to a verdict on whatever you put in front of them. If the product behind the URL has a different bug, they’ll surface a different bug. If it has the same bug, they’ll surface the same one again. The signal’s vocabulary doesn’t change when you change the implementation, because the participant’s vocabulary doesn’t change. That’s why a single brief can keep producing actionable feedback across a refactor, a migration, an infrastructure change, or a complete rewrite.

Put another way: most signals you build a release process on are below the code, and break when the code moves. User feedback is above the code, and doesn’t.

The release gate that follows from this.

The release gate I ended up with on noemica is the phase 7 configuration from part 6: two natural participants reach real verdicts, on the same build, against production. The build can have changed arbitrarily. The infrastructure can have changed arbitrarily. The participants’ brief is the same one that’s been running for months, and the gate’s pass condition is the same one that defined phase 2.

This is the kind of gate that survives the kind of work you actually do on a product. You refactor; the gate keeps working. You rewrite a service; the gate keeps working. You migrate a database; the gate keeps working. The only thing that takes the gate down is a regression in the experience, which is the only thing you wanted the gate to catch.

Tests are how you know the system did what you specified. A loop like this is how you know your specification still corresponds to a product anyone can use.

That ends this series. Seven pieces from one experiment. The thing the experiment most surprised me with is how many distinct lessons came out of one cron-pumped agent running noemica on noemica for an afternoon. Plenty of things I haven’t written down landed during the same run. The release gate, the practical takeaway , is the easiest one to lift. The conceptual ones are messier and longer-living. If you wire a user-feedback loop into your own product, you’ll find your own versions of all of these.

← Back to the beginning.

Take it with you

If you’d rather just write, seb@noemica.io.