
Market Pulse
Reading the agent ecosystem through a practitioner's lens
Market Pulse
Reading the agent ecosystem through a practitioner's lens

The 61% Question

Anthropic's Claude Sonnet 4.5 scores 61.4% on OSWorld, the benchmark for autonomous desktop interaction. Highest score available. Best we've got for agents that can actually navigate real computer interfaces, click buttons, fill forms, complete tasks without human intervention.
Still nowhere near production-ready.
The distance between those two facts is the signal. Model capability keeps improving—22% in October 2024, 42% by mid-2025, now 61%. But a 40% failure rate in controlled conditions means something specific about what it takes to deploy these systems in the real world. The question isn't whether the models will get better. It's what becomes necessary when they do.
The 61% Question
Anthropic's Claude Sonnet 4.5 scores 61.4% on OSWorld, the benchmark for autonomous desktop interaction. Highest score available. Best we've got for agents that can actually navigate real computer interfaces, click buttons, fill forms, complete tasks without human intervention.
Still nowhere near production-ready.
The distance between those two facts is the signal. Model capability keeps improving—22% in October 2024, 42% by mid-2025, now 61%. But a 40% failure rate in controlled conditions means something specific about what it takes to deploy these systems in the real world. The question isn't whether the models will get better. It's what becomes necessary when they do.

Rina Takahashi
Rina Takahashi, 37, former marketplace operations engineer turned enterprise AI writer. Built and maintained web-facing automations at scale for travel and e-commerce platforms. Now writes about reliable web agents, observability, and production-grade AI infrastructure at TinyFish.
Where This Goes
Salesforce backing away from per-conversation pricing tells us something. Enterprises are discovering that software which decides its own resource consumption breaks their control systems. We're watching this play out across the ecosystem: multi-agent frameworks promise sophisticated coordination while 90% of deployments stay stuck in pilot mode. Observability standards fragment as frameworks multiply. Foundation models absorb reasoning that used to live in orchestration layers.
Our read: the next six months bring control planes, not just monitoring tools. Active resource governance. Decision boundaries that actually constrain behavior. The tension is fundamental. Enterprise software assumes you can predict what it costs to run. Agents assume autonomy to pursue goals however they need to.
Teams building at scale face an architecture problem. How do you channel agent autonomy within enterprise constraints? TinyFish sees this daily in web automation: the capability exists, but operational models that let autonomous systems run inside governed environments lag behind.
Salesforce backing away from per-conversation pricing tells us something. Enterprises are discovering that software which decides its own resource consumption breaks their control systems. We're watching this play out across the ecosystem: multi-agent frameworks promise sophisticated coordination while 90% of deployments stay stuck in pilot mode. Observability standards fragment as frameworks multiply. Foundation models absorb reasoning that used to live in orchestration layers.
Our read: the next six months bring control planes, not just monitoring tools. Active resource governance. Decision boundaries that actually constrain behavior. The tension is fundamental. Enterprise software assumes you can predict what it costs to run. Agents assume autonomy to pursue goals however they need to.
Teams building at scale face an architecture problem. How do you channel agent autonomy within enterprise constraints? TinyFish sees this daily in web automation: the capability exists, but operational models that let autonomous systems run inside governed environments lag behind.
AI agent market reached $5.4 billion in 2024, with 45.8% projected annual growth and 44% of billion-dollar enterprises moving past experimentation.
OpenTelemetry introduced semantic conventions for agents in 2024, but standards remain actively evolving across CrewAI, AutoGen, and LangGraph frameworks.
Microsoft charges $4 hourly, Intercom $0.99 per resolution, while Salesforce's conversation-based model faces enterprise pushback demanding seat-based predictability.
1-800Accountant operates 20+ agents; Aviva saved £60M with multi-agent liability assessment; UC San Diego reduced sepsis deaths 17% using monitoring agents.
Enterprise platforms require $50K-$200K professional services, 3-6 month deployment timelines, plus ongoing platform licenses and API dependencies.
From the Labs
When Multi-Agent Systems Actually Hurt Performance
Sequential work above 45% accuracy baseline performs better with simpler single-agent architectures.
Prevents expensive multi-agent deployments where straightforward architectures would outperform and cost less.
From the Labs
Agent Q Achieves 95% Success on Real Bookings
First web agents reaching 95% reliability on real booking tasks with minimal human oversight.
Self-critique and search capabilities let systems refine performance autonomously, reducing training dependencies.
From the Labs
Anthropic Maps Multi-Agent Economics at Scale
Multi-agent architectures only make sense when task value justifies fifteen times the token cost.
Clear framework emerges for deciding when breadth-first approaches justify architectural complexity and expense.
From the Labs
Human-AI Collaboration Beats Full Agent Autonomy
Hallucinations and tool misuse create consistent failure modes that human oversight prevents.
Production systems require patterns for human-agent collaboration, not just better autonomous reasoning.
Quiet Tech That Compounds
The infrastructure layer is maturing. Not the stuff that makes headlines. The plumbing that determines whether your agent works when a customer hits it at 3am.
Standardized observability that prevents vendor lock-in. Caching layers that shave milliseconds off every request. Testing frameworks that catch failures before production. This is infrastructure that separates demos from systems you can build SLAs on.
The market chases model launches and flashy agent demos. Meanwhile, something more durable is being built. Here's what matters if you're trying to ship something that actually runs.
The infrastructure layer is maturing. Not the stuff that makes headlines. The plumbing that determines whether your agent works when a customer hits it at 3am.
Standardized observability that prevents vendor lock-in. Caching layers that shave milliseconds off every request. Testing frameworks that catch failures before production. This is infrastructure that separates demos from systems you can build SLAs on.
The market chases model launches and flashy agent demos. Meanwhile, something more durable is being built. Here's what matters if you're trying to ship something that actually runs.
What We're Reading


