field study · the loop on itself · part 2

Generation is cheap. The decisions are the artifact.

Where formal testing methods stop and user feedback begins. Formal tests prove the system did what the spec said. User feedback proves someone could figure out what to do. They aren’t the same thing, and the gap between them is wider than it sounds.

Two requests, in order. Notice how different they feel to do.

  1. Write a test for your product’s main flow.
  2. Write a test that would have caught the bug your users hit yesterday.

The first takes a tea break. Pick a level — unit, integration, end-to-end, whichever feels closest to the bug class. Open the page, click the buttons in order, assert the states. CI runs green. Coverage ticks up. Small dopamine hit.

The second is, in the formal sense, hard. To write that test you would have had to already know: what motivation a real user brought to the page, what they read into your copy, where they hesitated, which assumption your UI was quietly licensing, which moment of dissonance made them give up. None of that lives in the code. None of it can be inferred from the spec. You either knew it ahead of time, in which case the bug wouldn’t have shipped, or you didn’t, in which case there’s no test to write.

That gap doesn’t close by going up a level of formality. Types check that the values you pass match the shapes you declared. Lints check that the code follows the rules you wrote down. Unit tests check that functions return what you asserted. Integration tests check that components compose. End-to-end tests check that the whole flow does what the spec says it should. All of these are conformance checks. Every one of them is asking the same question, at different scales: did the system do what you told it to do?

User feedback is asking a different question entirely: could anyone, under their own motivation, figure out what they were supposed to do with what was on the screen? That question doesn’t reduce to spec conformance. The spec is what shipped the misread.

This isn’t hypothetical. While the experiment in part 1 was running, an iteration shipped a version of the running page whose dominant copy was literally “Close this tab.”A participant read it as an instruction. She did. The page rendered exactly to its spec. No formal test would have flagged the copy — the copy was the spec. What surfaced the bug was a person trying to use the product under their own motivation, deciding the page meant what it said, and acting on it. That’s the entire distinction in one frame.

The asymmetry, at every level.

Validating any formal test is cheap. You write the assertion, run it, see green or red, move on. Type, lint, unit, integration, e2e — doesn’t matter. Whatever the test was supposed to prove, the answer arrives in seconds.

Generatinga useful test is asymmetrically expensive, and the asymmetry is the same at every level. Most of the cost isn’t in typing the code; it’s in noticing, ahead of time, which behaviors are worth asserting against. You can’t derive that from the spec, because the spec is what shipped the bug. You can’t infer it from telemetry, because telemetry tells you that users dropped off, not why. The problem isn’t hard because writing tests is hard. It’s hard because the search space — what a person under their own motivation might try with your product — is enormous, and humans don’t obligingly fan out across it for you on demand.

validationsecondsgeneration · mechanicalminutesgeneration · usefulhard
The three things being conflated when people say “just write a test for it.” The first two are what formal testing does. The third is what user feedback is for.

A user study is the cheap shortcut around the hard column. You sidestep the generation problem by handing it to a person under their own motivation and watching what they do. You don’t have to anticipate what they’ll try. They just try things.

There’s a familiar shape to this. Linear problems you can solve analytically — closed-form, faster than the simulation. Non-linear ones you can’t; the only way to know the future state of three bodies under gravity, or a fluid past a certain Reynolds number, is to integrate it forward. You don’t predict, you run. User behavior sits in that second class. There is no closed form for what a motivated person will do with your product. The cheapest known path to the answer is to put someone in front of it and watch them decide. A user study isn’t a workaround for a hard problem — it’s the problem’s only known shortcut.

The path is the residue.

The output of any user study, real participants or synthetic, is structurally a sequence of (action, observation) pairs plus a verdict. You can mechanically convert a participant’s path into a regression test. Many people do. Take the actions, replay them with Playwright, assert that the final state is whatever the participant ended up at, and you have a test that pins the recovered behavior in place.

That conversion is the residue. The path is what’s left behind after the interesting work is done. The thing you actually paid for — what made the study worth running in the first place — is the cognition the participant did to arrive at the path: the half-second of hesitation in front of the size selector, the wrong read of the price strikethrough as “back in stock soon,” the moment they decided the copy was promising something the system couldn’t deliver and clicked away. Those are the decisions. They don’t survive the conversion to a test. They’re visible only at the moment they happened.

This is the trade. You wanted to find a bug that lives in the gap between your code and your user’s expectations — the gap formal tests can’t reach because formal tests are checking the wrong thing. A test written after the fact, even one mechanically derived from a participant’s path, can pin the behavior in place for next time. It does not surface the next gap. The next gap requires another participant willing to be confused on your behalf.

Synthetic participants don’t change the shape.

The participants in this experiment were synthetic — LLMs with a system prompt that gave them a background, a goal, and a disposition. The path they produce is structurally the same as a real participant’s path. The decisions inside the path are distributed differently — the LLM has different priors, different blind spots, will sometimes pattern-match the wrong metaphor — but the artifact is the same shape, and the same conversion to a regression test works on it.

The reason synthetic participants are valuable isn’t that they replace real users. They don’t. The reason is that the cost of generating a participant’s worth of cognition under uncertainty drops from “recruit, schedule, pay, observe” to roughly free. The decisions you couldn’t generate yourself are now generated for you, in volume, on demand, before the thing ships. The thing they catch is the thing you couldn’t have tested for because the relevant knowledge is posterior — it only exists once a participant has produced it.

If you build your loop around what your formal tests assert, you will never find what your users find. If you build it around what your participants did, you don’t have to anticipate users at all. They just show up and surface what was wrong, and you read the decisions, and you fix the gap.

The corollary, which is awkward to write down but true, is that the value of the participants is upstream of any output you can persist. The test you save off afterwards is downstream. If you confuse the two — if you treat the saved test as the asset and the participant session as a way to produce it — you optimize for regression coverage and lose the discovery channel that produced anything worth covering. Formal tests are great at pinning. They are not great at finding. The next piece is about a specific way coding agents make exactly this confusion.

Next → The path of least resistance.

Take it with you
If you’d rather just write, seb@noemica.io.