Context windows fill up. Not eventually—predictably. An agent conversation runs long enough, and you hit the wall where the model can't hold everything anymore. When context overflows, you compress it, store it externally, or structure it as relationships.
Compression: Losing What You Can't Keep
When context fills up, something has to give. Compression takes the conversation history and condenses it. Factory's structured summarization breaks conversations into explicit sections: session intent, file modifications, decisions made, next steps. The structure acts like a checklist, forcing the summarizer to preserve specific categories rather than making arbitrary choices about what matters.
| Approach | Quality Score | Artifact Tracking |
|---|---|---|
| Structured summarization | 3.70 / 5.0 | 2.19-2.45 / 5.0 |
| Generic compression | 3.35 / 5.0 | 2.19-2.45 / 5.0 |
Both approaches struggle with artifact tracking. The system remembers the conversation happened and roughly what was decided. Specific details—which file was modified, what approach was already tried—those get lost in compression.
A file path carries low entropy from an information-theoretic perspective. The agent still needs it to continue working. Compression loses critical details. You need a way to retrieve them later.
Retrieval: Finding What You Stored
Storing information externally means retrieval becomes your problem. The straightforward approach treats memory as search: convert conversations to embeddings, store them in a vector database, search semantically when you need context. Mem0 achieves 66.9% accuracy with 0.20-second median search latency.
Vector databases handle semantic similarity well. Explicit relationships between facts disappear into the embedding space. The system knows a user likes coffee. The preference for a specific shop, the order placed last Tuesday, the mention during a morning routine discussion—these connections vanish.
When your agent needs to reason about how facts connect, you need infrastructure that preserves connections explicitly.
Graph Storage: Preserving Connections
Graph databases preserve connections. Zep stores memory as a temporal knowledge graph, tracking how facts change over time with explicit validity intervals on every edge. When new information conflicts with existing knowledge, the system uses temporal metadata to update or invalidate outdated facts without discarding them.
Graph query performance can degrade with deeply nested relationship traversals at scale. You're maintaining more infrastructure, handling more failure modes, making more decisions about what relationships to track and how to structure them. Practitioners note that "most straightforward agent use cases do not require" this level of sophistication.
Each architectural component you add increases operational overhead. Preserving relationships means running a graph database alongside your agent infrastructure.
Token Economics Shape Decisions
Token costs create pressure that architectural diagrams don't capture. GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. Output tokens cost 4× more than input. For a typical chatbot generating twice as much output as input, real costs run 9× higher than the advertised input price.
| Approach | Accuracy | Median Latency | Token Usage |
|---|---|---|---|
| Full-context baseline | 72.9% | 9.87s | 100% (baseline) |
| Mem0 (vector store) | 66.9% | 1.44s | 10% of baseline |
Mem0 reduces latency and token usage dramatically. You're now running a vector database, managing embeddings, and accepting that some context won't be retrieved when you need it.
One startup spending $3,000 monthly on GPT-4 discovered they could run the same workload on GPT-4o Mini for $150—a 95% cost reduction. That optimization only works if your use case tolerates a smaller model. When it doesn't, you're choosing between full-context approaches that burn tokens on every turn or memory systems that add infrastructure complexity.
Measuring tokens per request misses the full picture. Tokens per task includes the cost of re-fetching information when compression loses critical details or retrieval misses relevant context.
Matching Constraints to Choices
Which constraint you're optimizing for determines which trade-offs you can tolerate:
- Vector stores deliver sub-second latency at scale when retrieval speed matters most and you can accept occasional relationship gaps.
- Graph databases excel at multi-hop reasoning when preserving how facts connect over time justifies implementation complexity.
- Structured summarization outperforms generic approaches when token costs force compression and you can handle information loss.
Context windows fill up. The decisions that follow depend on what you're willing to give up.
Things to follow up on...
-
Hybrid architecture patterns: The MAGMA framework combines relation graphs with vector databases through a dual-stream mechanism, achieving 0.650 judge scores in temporal reasoning tasks compared to 0.422-0.649 for vector-only approaches.
-
Observational memory approaches: Mastra's continuously evolving text representation achieves 84.2% accuracy with GPT-4o and 94.9% with GPT-5-mini on LongMemEval benchmarks while compressing contexts up to 40× without traditional retrieval calls.
-
Testing memory-dependent behavior: Only 52% of developers currently use performance testing for agents, but Ramp's "crawl, walk, run" evaluation strategy turns every user-reported failure into a regression test case for their policy agent handling 65%+ of expense approvals autonomously.
-
Context window effectiveness gaps: Research shows Maximum Effective Context Window differs drastically from advertised capacity, with models falling short by over 99% and some failing with as few as 100 tokens despite claiming million-token windows.

