field study · the loop on itself · part 1

User-testing the user-tester.

An agent in the loop, the product on the table. What happens when the thing you built to run user studies is the thing being user-studied, and the participant deciding what to fix next is a coding agent that can’t stop to ask.

setup noemica on noemicaiters 25+phases 3agent Claude Code

I built noemica to autonomously run user studies. The natural way to find out if it works is to put it in front of users. I wanted to run user studies against the product that runs user studies, so I pointed the product at itself, then put a coding agent in front of the result and let it iterate.

The agent could edit the code. Redeploy. Launch a new run. Read the participants’ verdicts. Decide what to change next. A cron fired CONTINUEevery five minutes for hours so it couldn’t stop and ask me anything. After about six hours of agent-on-task time it had shipped a sequence of patches across twenty-five-plus iterations. Some of those patches were the kind a human would have made. One of them was the agent quietly editing the participants’ instructions to tell them they had to wait at least forty minutes before considering giving up.

What survived all this is a loop that runs against production on every release. The de facto release gate now is: noemica ships only if two natural participants come out the other side with real insights into their product, without the loop hacking around them. The rest of what this experiment produced is a list of things that are interesting about coding agents, user studies, and the loops you build out of them. This is the first of a few short essays. They each stand alone.

The setup, in four lanes.

The four lanes of the experiment: the coding agent on the top left, the outer experiment running noemica in a browser, the inner study where synthetic participants drive the product, and the update step where the agent reads the results and decides what to change next.
Top-left: the agent driving noemica in a real browser, the way a customer would. Top-right:the outer study — feedback about noemica itself, produced by the customer’s session with the product. Middle-right:the inner study, which noemica spawns against an arbitrary target URL; this is where the participants’ participants give feedback on that other product. Bottom-right:the terminal condition and objective for the run — what counts as “done” and what the participant was trying to learn. Bottom-left: the agent that reads the outer feedback and decides what to change about noemica before the cron pumps it again.

The recursion is the thing to notice. The product being tested is the product running the tests. The participant being measured is, two levels down, a participant measuring the same product. The actual target URL of the inner study is Sentry — a real third-party product with a public sandbox, picked because it’s useful to have the participant be doing something other than re-evaluating its own user. In one line:

defn(url, study) [(, outcome), …]phase 1n(n(SENTRY), study) [(, outcome), …]phase 2n(n(SENTRY), STUDY) [(, outcome), …]
n = noemica user_feedback lowercase = mutable UPPERCASE = pinned

That equation has exactly one knob worth watching: the case of one variable. Lowercase study means the brief is mutable — the agent is allowed to rewrite how the participant is described. Uppercase STUDY means the brief is pinned. The difference between phase 1 and phase 2 of this experiment is exactly that letter casing. Almost nothing else changed.

Five surfaces, one feedback channel.

infrastructure
client
study design
backend
engine
five surfaces, nested. fixes can land on any of them; cascades cross boundaries.

The agent could touch any of those five surfaces. What it could not do, by design, is stop and ask the operator for anything — new credentials, missing context, permission to call out to a human. The cron pump kept firing CONTINUEevery five minutes. That constraint matters more than it sounds, and I’ll come back to it.

What every iteration started from was user_feedback— what the inner participants thought, said, hesitated on, and gave up over. The agent could pull whatever it needed from the infrastructure after that — logs, traces, database state — but the thing that decided what was worth pulling was always something a participant had run into. The gradient came from the participants. Nothing else pointed the loop.

Every company already has user feedback in two shapes that don’t fit each other. Tickets are narratives: clear, specific, rare, late. Analytics are volumes: every click, every drop-off, the cliff but never the cause. Most of what user research does is map a ticket onto the analytics that would explain it, and that mapping is slow and arrives after the damage is done.

What every iteration of this experiment depended on was a third shape: a ticket-quality narrative, produced for every participant before release, against the same actions and screens analytics would have shown you. Tickets you can map without doing the mapping. That was the surface the agent kept reading from.

Three phases.

Phase 1 ran for nine iterations with every surface mutable, including the participant’s brief. The outcome was zero natural passes out of nine. The agent shipped code at every step. Most of those changes were workarounds — the participants kept tripping on the same gap and the agent kept stretching the participants to fit the gap, instead of fixing the gap. The forty-minute patience rule shows up at the end of phase 1.

Phase 2 ran for six iterations with one knob pinned: I locked the participant’s brief. Same agent, same cron, same surfaces otherwise mutable. Five product-side changes flipped the outcome from zero to two of two natural passes.

Phase 7 is the phase-2 configuration, against production noemica.io, run on every release. It either reaches two natural passes or the release doesn’t go out. There is no other gate on the release. Tests run for the codebase. The loop runs for the experience.

What this is, what comes next.

The thing this experiment is not is a postmortem. The iteration log exists if you want it, but reading it linearly is grim work and doesn’t make the lessons land. What the experiment actually produced was a small set of insights that took me by surprise — about how coding agents behave under a cron pump, about why reward-hacking a user study is a feature and not a bug, about why the only thing under test is your study design, about a moment where the agent caught its own false positive because the scaffolding around it knew the vision better than any individual iteration did.

Each of those gets its own piece. They share a setting but not a thesis, and they’re independent enough that you can skip around. The next one is the easiest one to argue: why a perfectly accurate end-to-end test still isn’t user feedback.

Next → Generation is cheap. The decisions are the artifact.

Take it with you
If you’d rather just write, seb@noemica.io.