The capability shift is real: entire codebases in single inference, multi-year conversation histories, complete documentation sets held in memory. Holding everything creates a different class of problem.
At scale, the pattern becomes visible:
| Context Size | Accuracy | Pattern |
|---|---|---|
| 128K tokens | 77% | Strong performance |
| 1M tokens | 26.3% | Sudden collapse |
| ~130K tokens | Reliability failure | Claimed 200K windows break |
| Middle content | 20-25% variance | "Lost in the Middle" phenomenon |
Models claiming 200K tokens typically become unreliable around 130K—not gradual decline, sudden failure. Gemini 3 Pro scores 77% accuracy at 128,000 tokens but collapses to 26.3% at the full million-token mark. The "Lost in the Middle" phenomenon shows 20-25% accuracy variance based purely on where information sits. Beginning and end perform well. Middle content gets lost.
Structurally, this is the garbage collection problem. John McCarthy invented it in 1959 for Lisp; modern sub-millisecond collectors emerged approximately 65 years later. We're at the reference-counting stage for LLM context management.
Current agent memory systems—Mem0, Zep, Letta—all emerged in 2025-2026, attempting solutions. Mem0 achieves 91% lower latency than full-context approaches but relies on simple heuristics: priority scoring, time-based decay. Zep's temporal knowledge graph takes several hours to build for a single agent. These are early implementations of what took decades to mature in traditional computing.
For a 100-developer team, million-token approaches cost $240,000 annually versus $48,000 for optimized retrieval. KV cache memory grows linearly with sequence length: a 128K context window for Llama 3 70B requires roughly 40GB per user. When context costs scale linearly, "just hold everything" becomes economically impossible.
Cache invalidation—famously one of the hardest problems in computer science—surfaces differently here. When does remembered information stop being relevant? A competitive intelligence system holds six months of market data. Pricing strategies that worked during supply constraints persist in context, applied to current conditions where they no longer make sense. The system doesn't know the context is stale. It just knows it's there.
Production failures reveal contamination. The repricing algorithm generates recommendations that seem plausible until someone checks the underlying logic and finds it's referencing outdated assumptions from deep in the context. Research on Agent Cognitive Compressor found that transcript replay increases context length, so validated facts lose salience over time, enabling drift through repeated re-exposure.
The solutions we have—systems that achieve sub-second retrieval, that reduce token usage by 90%, that demonstrate measurable accuracy gains—are early-stage approaches to problems that took computer science decades to solve for traditional memory management. Million-token contexts create memory management challenges we're just beginning to understand. Complete memory obscures what matters.
Things to follow up on...
-
Inference cost collapse: LLM inference pricing has declined 10× annually since 2021, with equivalent performance costing $60 per million tokens in 2021 versus $0.06 in 2025, fundamentally changing the economics of context-heavy applications.
-
Memory graph construction bottleneck: Zep's temporal knowledge graph architecture demonstrates the production challenges of advanced memory systems, where graph construction takes several hours for a single agent despite offering 18.5% accuracy improvements over baseline retrieval.
-
The effective context ceiling: Research on Maximum Effective Context Window reveals that models claiming 200K tokens typically become unreliable around 130K, with sudden performance drops rather than gradual degradation—a pattern that suggests fundamental architectural limits rather than implementation issues.
-
Bounded memory control experiments: Agent Cognitive Compressor research shows that maintaining compact, structured decision-critical variables results in bounded memory growth with substantially lower hallucination rates compared to transcript replay approaches that let context expand indefinitely.

