Market Pulse
Reading the agent ecosystem through a practitioner's lens

Market Pulse
Reading the agent ecosystem through a practitioner's lens

The 61% Question

Anthropic's Claude Sonnet 4.5 scores 61.4% on OSWorld, the benchmark for autonomous desktop interaction. Highest score available. Best we've got for agents that can actually navigate real computer interfaces, click buttons, fill forms, complete tasks without human intervention.
Still nowhere near production-ready.
The distance between those two facts is the signal. Model capability keeps improving—22% in October 2024, 42% by mid-2025, now 61%. But a 40% failure rate in controlled conditions means something specific about what it takes to deploy these systems in the real world. The question isn't whether the models will get better. It's what becomes necessary when they do.
The 61% Question
Anthropic's Claude Sonnet 4.5 scores 61.4% on OSWorld, the benchmark for autonomous desktop interaction. Highest score available. Best we've got for agents that can actually navigate real computer interfaces, click buttons, fill forms, complete tasks without human intervention.
Still nowhere near production-ready.
The distance between those two facts is the signal. Model capability keeps improving—22% in October 2024, 42% by mid-2025, now 61%. But a 40% failure rate in controlled conditions means something specific about what it takes to deploy these systems in the real world. The question isn't whether the models will get better. It's what becomes necessary when they do.
Where This Goes
Salesforce backing away from per-conversation pricing tells us something. Enterprises are discovering that software which decides its own resource consumption breaks their control systems. We're watching this play out across the ecosystem: multi-agent frameworks promise sophisticated coordination while 90% of deployments stay stuck in pilot mode. Observability standards fragment as frameworks multiply. Foundation models absorb reasoning that used to live in orchestration layers.
Our read: the next six months bring control planes, not just monitoring tools. Active resource governance. Decision boundaries that actually constrain behavior. The tension is fundamental. Enterprise software assumes you can predict what it costs to run. Agents assume autonomy to pursue goals however they need to.
Teams building at scale face an architecture problem. How do you channel agent autonomy within enterprise constraints? TinyFish sees this daily in web automation: the capability exists, but operational models that let autonomous systems run inside governed environments lag behind.
From the Labs
When Multi-Agent Systems Actually Hurt Performance
Sequential work above 45% accuracy baseline performs better with simpler single-agent architectures.
Prevents expensive multi-agent deployments where straightforward architectures would outperform and cost less.
From the Labs
Agent Q Achieves 95% Success on Real Bookings
First web agents reaching 95% reliability on real booking tasks with minimal human oversight.
Self-critique and search capabilities let systems refine performance autonomously, reducing training dependencies.
From the Labs
Anthropic Maps Multi-Agent Economics at Scale
Multi-agent architectures only make sense when task value justifies fifteen times the token cost.
Clear framework emerges for deciding when breadth-first approaches justify architectural complexity and expense.
From the Labs
Human-AI Collaboration Beats Full Agent Autonomy
Hallucinations and tool misuse create consistent failure modes that human oversight prevents.
Production systems require patterns for human-agent collaboration, not just better autonomous reasoning.
What We're Reading





