Shopping Agents Scored 76% On The Customer Who Doesn't Exist

Two arXiv papers measure the gap between simulator-graded LLM shopping agents and human-graded ones at roughly thirty points, and explain why retail pilots like Klarna's have been quietly walking back their automation claims.

The gap between simulator-graded shopping agents and human-graded ones is now measured, and it is larger than the field has been admitting. A January study and a new companion paper put real shoppers in front of LLM retail agents and found the benchmark had been overstating success by roughly thirty points. The pilots that have been quietly walking back their AI assistant numbers now have papers to cite, and a clean way to explain the shortfall.

Lost in Simulation ran τ-Bench retail tasks against shoppers in the US, India, Kenya, and Nigeria, holding GPT-4o steady as the agent. Against real US participants GPT-4o landed at 45.2%; the same agent had been scoring near 76% against Sonnet-4.5-simulated users. The benchmark and the field were not measuring the same agent.

The gap is not uniform. Simulated users underestimated agent success on the hardest tasks and overestimated it on moderate ones, where humans hit 39.0%. Performance dropped a further nineteen percentage points between older AAVE speakers and older Standard American English speakers, with the simulator worst calibrated for AAVE and Indian English. The simulator is not noisy. It is confidently grading a different population from the one the agent will meet. The error is structured, not random, and points in a consistent direction.

The companion paper Beyond Cooperative Simulators supplies the diagnosis. Existing user simulators are cooperative, homogeneous, and behaviourally thin; the new method — Persona Policies, or PPol — evolves Python generators that produce users who are “unclear, impatient, or reluctant to share information” — the words are taken from the abstract. Annotators rated the new simulated users as human 80.4% of the time, roughly double the baseline. Training agents against the harder simulator lifted task success by seventeen percent. The two papers converge: the benchmark has been scoring a customer that does not exist.

A nautilus shell holds two report cards in its tentacles; one labeled "Simulated Shopper" graded 76%, one labeled "Real Shopper" graded 45%.

Retail has been making the same observation without writing it up. Klarna spent 2024 boasting its assistant was doing the work of seven hundred agents; in May 2025 its CEO told Bloomberg the cuts had gone too far and the company was rebuilding human staffing. An Amazon Rufus study found that LLM digital twins aligned with human action patterns and yielded similar design feedback — a more optimistic result, though that study evaluated design-stage interactions rather than live task completion under realistic user pressure.

What the benchmark passes, the shopper does not.

The defence is the obvious one. A simulator does not have to be accurate, only useful for relative ranking; if Sonnet 4.5 beats Sonnet 3.7 in simulation and beats it in deployment, the absolute numbers are decoration. τ-Bench arrived in mid-2024 as the field-standard tool eval on exactly this premise: cheap, repeatable, good for sorting.

That defence dies in the Expected Calibration Error column. The Lost in Simulation authors found ECE of 11.7 for Standard American English speakers and 20.3 for AAVE speakers; the simulator’s confidence drifts by demographic. The leaderboard’s distortion is differential, not uniform. Two agents tied on the leaderboard can land in opposite places once the shopper is not the customer the simulator was trained to imagine. The benchmark is selecting for agents optimised to a particular customer fiction.

The price is paid where retailers do not yet measure. A pilot agent that hits its simulator targets and underperforms by forty percent in deployment is not failing — it is meeting the only number anyone graded it against, and that number was calibrated against a polite customer who answers every question on the first ask. The fix is not bigger models; the bottleneck has moved off the agent and onto the persona used to grade it. Retailers serious about whether their shopping agent works should be running it against the rude shopper, the distracted one, and the one who refuses to type a postal code. PPol gives a way to construct that test cheaply, and Klarna’s reversal gives a way to explain the spend to the board. The papers are now upstream of where the field has been failing.