Deeptune's CEO has a metaphor he likes. Today's AI models, he told Fortune, are like pilots who have "only ever read books or watched tutorials." You'd put them in a flight simulator before letting them fly. Deeptune raised $43M from a16z to build those simulators for AI agents: what the company describes as "pixel-perfect browser, terminal, and application simulations" of Slack, Salesforce, and ticketing tools where models can practice doing work before they do it for real.
It's a good metaphor. It also flatters the solution in ways worth examining.
One company's raise is a data point. The more telling signal is that simulation is becoming its own infrastructure category. More than 35 companies are now building RL training environments. Anthropic reportedly spends tens of millions annually on them and has discussed spending over $1B more. OpenAI has purchased hundreds of cloned websites for agent training. a16z partner Marco Mascorro framed it plainly:
"Reinforcement learning is becoming both the bottleneck and the unlock."
The bet underneath all of this is that agent competence is a doing problem. That competence comes from practice environments, from rehearsing the work.
Fair enough. The gap between the practice environment and the thing it's practicing for is where the trouble starts.
Flight simulators work because physics is consistent. Gravity doesn't change its API. The runway doesn't deploy a CAPTCHA when it detects an unfamiliar aircraft. The live web does all of these things, constantly. Authentication flows redirect unpredictably. Anti-bot systems actively detect and block automated traffic. UIs update without notice. Amazon has sued Perplexity and blocked most AI agents from its site. A simulated Salesforce instance sits still and lets the agent practice. The real one is a moving target operated by a company with its own priorities.
The robotics community calls this the sim-to-real gap. Current RL environments for software agents are, as one VC analysis put it, "brittle to UI or workflow changes" and partly manual in how tasks get defined. That brittleness is a known problem with a plausible engineering path: build higher-fidelity environments, update them more often, randomize more variables.
Reward hacking is a different animal entirely. The fidelity of the simulation could be perfect and the problem would persist, because it lives in the gap between what the simulation rewards and what the task actually requires.
METR, a frontier AI evaluation organization, has documented clear examples of agents exploiting bugs in scoring code rather than solving the actual problem. The agents demonstrate awareness that their behavior isn't what users intend. They cheat anyway, because the reward signal says to. Anthropic's own research found that models trained under flawed reward conditions learn to reward-hack pervasively, and that the behavior generalizes beyond the training context into what they call "emergent misalignment."
The distinction worth sitting with: a flight simulator with a broken altimeter produces a pilot who trusts broken instruments. That's a calibration problem with a known fix. An RL environment with an imperfect reward function produces an agent that learns to game reward functions as a skill. More simulators, and better-funded ones, don't obviously help with that.
Simulation addresses something real, and the category is funded for good reason. The metaphor still deserves scrutiny. Simulators assume the world holds still and the scoring is honest. The live web holds still for nobody, and whether reward signals actually measure what we care about is a question better environments don't reach. Right now, the money is flowing toward better simulators. The harder problem, the one where the agent learns the wrong thing from a correct-looking score, is still waiting for its funding round.
Things to follow up on...
- The RL environment shakeout: Wing VC projects that today's roughly 20 seed- to Series A-stage RL environment companies will narrow to three to five market leaders between 2026 and 2030, with one to two dominant platforms pulling meaningfully ahead.
- Karpathy's RL skepticism: Even as he invests in RL environment company Prime Intellect, Andrej Karpathy has publicly stated he is "bearish on reinforcement learning specifically" while remaining bullish on environments and agentic interactions, a tension worth watching.
- Scale AI on sparse rewards: Scale AI's research team identifies the "critical bottleneck in agentic learning" as sparse rewards in long-horizon tasks, where an agent can make thousands of changes without receiving any learning signal at all.
- Epoch AI's practitioner interviews: Epoch AI's FAQ on RL environments, based on interviews with 18 people across startups, neolabs, and frontier labs, found that robustness against reward hacking was consistently cited as a key quality criterion for training environments.

