CURRENT | Market Pulse

The Signal

When Agents Ask Permission

By Nora Kaplan— December 24, 2025

Feature image for article: When Agents Ask Permission

Your AI agent wants to access your banking portal. Chrome pauses, waiting for approval. Behind that single moment sits an elaborate architecture you never see: observer models monitoring behavior, consent mechanisms routing decisions, boundaries distinguishing where agents can learn from where they can act.

The pause feels like friction. But when we operate web agents across thousands of sites for enterprises, we've learned what that friction actually represents. Some architectures make certain choices visible. Others trust agents to navigate freely through sensitive operations. The technical capability exists in both approaches. What differs is the invisible infrastructure work that determines whether organizations can delegate with confidence—or whether they're just watching demos that can't scale.

The Signal

When Agents Ask Permission

Your AI agent wants to access your banking portal. Chrome pauses, waiting for approval. Behind that single moment sits an elaborate architecture you never see: observer models monitoring behavior, consent mechanisms routing decisions, boundaries distinguishing where agents can learn from where they can act.

The pause feels like friction. But when we operate web agents across thousands of sites for enterprises, we've learned what that friction actually represents. Some architectures make certain choices visible. Others trust agents to navigate freely through sensitive operations. The technical capability exists in both approaches. What differs is the invisible infrastructure work that determines whether organizations can delegate with confidence—or whether they're just watching demos that can't scale.

Nora Kaplan

Nora Kaplan, former collaboration platform product leader turned technology writer. Studied human-computer interaction and spent years designing tools for knowledge work. Now writes about AI agents, work transformation, and how enterprise software reshapes human capability at TinyFish.

Where This Goes

Foundation Models Are Absorbing What Frameworks Used to Do

We're watching something shift in how teams architect their agent systems. The planning logic that used to live in orchestration code is migrating into foundation models themselves. Gemini 2.0 ships with "native tool use." OpenAI's o3 emphasizes reasoning baked into the model. Nvidia's Nemotron 3 optimizes specifically for agentic workflows.

Running millions of browser sessions daily, we see teams wrestling less with "how do I teach this model to plan?" and more with "how do I coordinate models that already plan?" The orchestration layer isn't disappearing. It's changing jobs. Less prompt engineering, more traffic control.

This matters because reliability questions transform. When reasoning lived in your code, you debugged your logic. When it lives in the model, you're evaluating whether the model's native planning matches your requirements. Different problem entirely. The next six months will separate teams who grasp this from teams still fighting the old battle.

Where This Goes

Foundation Models Are Absorbing What Frameworks Used to Do

We're watching something shift in how teams architect their agent systems. The planning logic that used to live in orchestration code is migrating into foundation models themselves. Gemini 2.0 ships with "native tool use." OpenAI's o3 emphasizes reasoning baked into the model. Nvidia's Nemotron 3 optimizes specifically for agentic workflows.

Running millions of browser sessions daily, we see teams wrestling less with "how do I teach this model to plan?" and more with "how do I coordinate models that already plan?" The orchestration layer isn't disappearing. It's changing jobs. Less prompt engineering, more traffic control.

This matters because reliability questions transform. When reasoning lived in your code, you debugged your logic. When it lives in the model, you're evaluating whether the model's native planning matches your requirements. Different problem entirely. The next six months will separate teams who grasp this from teams still fighting the old battle.

Standards infrastructure:

OpenAI, Anthropic, and Block launched the Agentic AI Foundation in December 2024, establishing neutral governance for protocols like MCP under Linux Foundation stewardship.

Interface dominance:

Web browser and desktop GUI agents led commercial deployments in 2024, with startups like Kura AI and Runner H shipping browser-driving products.

Context expansion:

Nvidia's Nemotron 3 Nano supports one million token context windows with 4x higher throughput than predecessors, enabling longer autonomous operation cycles.

Collaboration patterns:

Stanford and Carnegie Mellon research comparing autonomous versus human-AI hybrid workflows shows 30-50% productivity gains favor collaborative approaches over fully autonomous agents.

Orchestration evolution:

IBM predicts larger models will become orchestrators coordinating smaller specialized agents, rather than monolithic systems handling all reasoning internally.

From the Labs

When Adding Agents Tanks Your Performance

Researchers quantified what many builders suspected: once single-agent performance crosses 45%, adding more agents creates overhead that kills results. Sequential reasoning tasks degraded 39-70% across every multi-agent configuration tested. The framework predicts optimal coordination with 87% accuracy on new setups.

Why should you care?

You can finally predict when coordination helps versus when it just burns tokens.

What changes in production?

Web navigation gains from decentralized coordination while tool-heavy workflows suffer under budget constraints.

The Math Behind Smaller Agent Models

NVIDIA researchers make the economic argument: models under 10B parameters handle 60-80% of agent tasks currently running on 70B+ models. Serving a 7B model costs 10-30x less in latency and energy. That matters when infrastructure investment outpaces API revenue by 10x.

What's actually possible?

Replace 40-70% of current LLM calls with specialized SLMs without losing performance.

How do you convert?

The paper provides a six-phase algorithm for transforming LLM systems into cost-efficient SLM architectures.

A Taxonomy for Agent Memory Systems

Forty-seven researchers tackle the definitional chaos in agent memory. Their unified taxonomy examines forms (token, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval). Crucially, they distinguish agent memory from RAG and context engineering.

Why does taxonomy matter?

Memory enables long-horizon reasoning, and this framework helps you match architecture to use case.

What's still unsolved?

Memory automation, RL integration, multimodal memory, and trustworthiness remain open research frontiers.

Network Structure Creates Agent Behavior

Borrowing from neuroscience, researchers show that connection topology alone shapes agent behavior. Deliberately engineered network structures match fully connected graphs on accuracy and token costs while stabilizing consensus. Network position creates distinct roles without explicit prompting.

What emerges from structure?

"Bridges" integrate information slowly while "Loners" show instability from weak signals.

Where does this help?

Fewer connections reduce communication overhead in distributed web automation requiring selective coordination.

From the Labs

When Adding Agents Tanks Your Performance

Researchers quantified what many builders suspected: once single-agent performance crosses 45%, adding more agents creates overhead that kills results. Sequential reasoning tasks degraded 39-70% across every multi-agent configuration tested. The framework predicts optimal coordination with 87% accuracy on new setups.

Why should you care?

You can finally predict when coordination helps versus when it just burns tokens.

What changes in production?

Web navigation gains from decentralized coordination while tool-heavy workflows suffer under budget constraints.

From the Labs

The Math Behind Smaller Agent Models

NVIDIA researchers make the economic argument: models under 10B parameters handle 60-80% of agent tasks currently running on 70B+ models. Serving a 7B model costs 10-30x less in latency and energy. That matters when infrastructure investment outpaces API revenue by 10x.

What's actually possible?

Replace 40-70% of current LLM calls with specialized SLMs without losing performance.

How do you convert?

The paper provides a six-phase algorithm for transforming LLM systems into cost-efficient SLM architectures.

From the Labs

A Taxonomy for Agent Memory Systems

Forty-seven researchers tackle the definitional chaos in agent memory. Their unified taxonomy examines forms (token, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval). Crucially, they distinguish agent memory from RAG and context engineering.

Why does taxonomy matter?

Memory enables long-horizon reasoning, and this framework helps you match architecture to use case.

What's still unsolved?

Memory automation, RL integration, multimodal memory, and trustworthiness remain open research frontiers.

From the Labs

Network Structure Creates Agent Behavior

Borrowing from neuroscience, researchers show that connection topology alone shapes agent behavior. Deliberately engineered network structures match fully connected graphs on accuracy and token costs while stabilizing consensus. Network position creates distinct roles without explicit prompting.

What emerges from structure?

"Bridges" integrate information slowly while "Loners" show instability from weak signals.

Where does this help?

Fewer connections reduce communication overhead in distributed web automation requiring selective coordination.

Quiet Tech That Compounds

The industry watches benchmark leaderboards. Production teams solve different problems. They build infrastructure that makes agent systems work when customers depend on them. The plumbing that doesn't demo well but compounds reliability over months.

Nobody writes threads about batch processing infrastructure. Pull-based deployment models don't generate headlines. But these capabilities separate systems that run in production from systems that run in demos. The boring work that matters because it closes the gap between 70% failure rates and systems you can actually build SLAs on.

Quiet Tech That Compounds

The industry watches benchmark leaderboards. Production teams solve different problems. They build infrastructure that makes agent systems work when customers depend on them. The plumbing that doesn't demo well but compounds reliability over months.

Nobody writes threads about batch processing infrastructure. Pull-based deployment models don't generate headlines. But these capabilities separate systems that run in production from systems that run in demos. The boring work that matters because it closes the gap between 70% failure rates and systems you can actually build SLAs on.

Deployment Architecture

Pull-Based Deployment Fixes Network Hell

Agents phone home instead of waiting for commands. Outbound connections only, no exposed cluster APIs. VPN configurations vanish. VPC peering disappears. Firewall rules simplify to "allow egress." Security improves because nothing listens. Centralized management across distributed environments without the networking archaeology that usually comes with it.

Cost Infrastructure

Gateway Layer Handles Provider Chaos

Sit infrastructure between your application and LLM providers. Switch providers without touching code. Route based on cost and latency. Set hierarchical budgets that actually enforce. Automatic failover when providers hiccup. The gateway absorbs complexity so your application stays clean. Runaway costs become infrastructure problems with infrastructure solutions.

Batch Processing

Async Operations Cut Costs in Half

Analytics runs don't need instant responses. Content moderation can wait seconds. Data enrichment happens overnight. Batch these operations and providers discount 50% or more. Requires queueing systems and result handling that nobody wants to build. Completely unglamorous. Compounds savings every month once the plumbing works.

Execution Security

Browser Sandboxes Isolate Generated Code

Run LLM-generated code in users' browsers through WebAssembly. Inherit browser security model automatically. Users can't contaminate each other. Host stays clean. Compute requirements drop because execution distributes. Container orchestration is sexier, but Pyodide-based sandboxing delivers isolation with minimal architectural surgery.

Production Evaluation

Continuous Testing Catches Silent Failures

Agent systems degrade quietly. Traditional monitoring misses distribution shifts. You need systematic quality measurement at every level: tool calls, conversation flows, outcome delivery. Combine deterministic rules with statistical methods and LLM judges. Production agents fail 70-85% of the time without this. Operational rigor beats algorithmic breakthroughs.

Deployment Practice

GitOps Makes Changes Auditable

Git becomes source of truth for agent system state. Agents pull configuration through control planes. Every modification is a versioned commit. Every rollback is a Git revert. Version control principles applied to agent infrastructure. Demos poorly. Essential for compliance frameworks. The audit trail you need when regulators ask questions.

What We're Reading

AutoGen Contributor Gets Honest About Enterprise Agent Reality

Year-end numbers tell the story: 41% cite performance bottlenecks, most deployments just wrap APIs with LLM calls.

Google and MIT Test When More Agents Actually Help

180 experiments reveal the pattern: parallelizable tasks benefit, sequential workflows suffer. Useful decision framework emerges from real data.

Agent ROI Materializes, But Integration Costs Remain Brutal

Enterprise Survey Reveals the Real Agent Deployment Blocker

OpenTelemetry Quietly Builds Agent Observability Standards

Anthropic Shares What Actually Works in Production Multi-Agent Systems

What We're Reading

AutoGen Contributor Gets Honest About Enterprise Agent RealityYear-end numbers tell the story: 41% cite performance bottlenecks, most deployments just wrap APIs with LLM calls.

Google and MIT Test When More Agents Actually Help180 experiments reveal the pattern: parallelizable tasks benefit, sequential workflows suffer. Useful decision framework emerges from real data.

Quick links

Agent ROI Materializes, But Integration Costs Remain Brutal

Enterprise Survey Reveals the Real Agent Deployment Blocker

OpenTelemetry Quietly Builds Agent Observability Standards

Anthropic Shares What Actually Works in Production Multi-Agent Systems