Pali Vega is not a real person, though her problems certainly are. She's a composite, assembled from practitioner forums, operational postmortems, and the specific weariness of people who've had to explain, more than once, why a dashboard full of green lights means almost nothing. Her title is Agent Operations Manager. Her actual job is keeping production agents working long after the team that built them has moved on to something shinier.
Before agent ops, she was a QA engineer. The manual kind, back when that meant something. She says the skills transferred perfectly. We talked on a Wednesday afternoon, which she described as "the only day nothing has catastrophically broken yet this week, so I'm basically waiting."
You came from QA. How did you end up doing this?
Pali: QA got automated. Everyone knows that story. What nobody mentions is that I'm doing the exact same job now. Looking at output, asking "is this right, or does it just look right?" Only now the thing producing the output argues with me. Politely. With citations.
I got pulled into agent ops because a support routing agent started misrouting tickets about six weeks after deployment. Not dramatically. Two percentage points. Nobody noticed because the agent was completing every ticket, green across the board. I noticed because I was still reviewing escalations from the old queue out of habit. Turned out a product update had shifted the mix of incoming queries, and the agent started sending billing questions to the technical team.1 The model hadn't changed. The world had.
Walk me through a failure you had to dig out recently.
Pali: Three weeks ago. We have an agent monitoring competitor pricing across about forty product pages. Task completion: 100%. Every run finishes. Gorgeous logs.
But I'm scanning the weekly summary and the prices are... calm. Suspiciously calm. Normally there's variance. Someone runs a sale, someone adjusts for a holiday. This was flat. Same numbers, tiny fluctuations, nothing interesting.
So I pull the traces. The agent navigates to each page, extracts data, returns structured output. No errors. No retries. Perfect. And that's what got me. No retries usually means the page loaded clean and the selectors hit first pass. But two of these sites redesigned their product pages in January.2 New layout, new components. The old selectors shouldn't have worked at all.
What happened: the agent was pulling data from a cached version of the page. Some CDN edge case where the bot-detected request got served stale content. The prices were real. They were just three weeks old. And the agent had absolutely no way to know that, because nothing errored. It got data. It returned data. Job done.
I spent four hours finding that. The fix took twenty minutes.
What's "prompt gardening"?
Pali: It's what I call the thing nobody wants to admit is a full-time job.
You deploy an agent with a prompt. Works great. Then a ticket comes in: weird answer on an edge case. You add a clarifying sentence. Two weeks later, different edge case, you tweak an example. A month after that, product asks the agent to handle a slightly adjacent task, so you adjust the instructions.
None of these changes are wrong individually. But six months in, the prompt is this archaeological layer cake of patches, and nobody remembers why paragraph three says "Do not include shipping estimates unless explicitly asked." Was that a bug fix? A product decision? Someone's personal preference from a Tuesday afternoon?
And the compound math is merciless. Twenty steps in a workflow, each 95% reliable, your end-to-end success rate is 36%.3 A small prompt edit that drops one step from 95% to 93% won't show up in isolation. It cascades. And without version control on the prompts, which most teams simply don't have4, you can't even reconstruct what changed when the numbers start sliding.
I keep a changelog. In a spreadsheet. It is not glamorous.
What do you check that no tool checks for you?
Pali: Tone.
I'm completely serious. When an upstream model provider pushes a quiet update, on their own schedule, without telling anyone5, the first thing I notice isn't accuracy. It's the character of the output shifting. More verbose. Slightly different cadence. Uses "certainly" instead of "sure." You know when someone you see every day gets a haircut and you can tell something changed but can't immediately say what? That.
It's a real signal. It means the model weights shifted, which means every prompt tuned to the old behavior is now slightly off. No alert fires for "the agent sounds different." But I notice, and then I go check whether the task metrics moved too. Usually they have. Just not enough to cross a threshold yet.
The LangChain survey found 89% of organizations have observability for their agents, but only 52% have evaluations.6 That gap is my entire job. I can see everything. I can prove almost nothing.
What happens when you go on vacation?
Pali: Last August I took a week off. Came back to three agents that had degraded. One was returning partial data. One had shifted its classification behavior after a model update on Thursday. One was hitting a rate limit and retrying itself into a loop that looked, from the outside, like normal latency.
All three dashboards were green the entire time.
My manager asked if I'd had a nice trip.
Footnotes
-
Intent distribution shift after product changes is a documented failure pattern in production agent systems. See Sama, "Model Maintenance: Monitoring, Drift, and Continuous Improvement," Feb 2026. https://www.sama.com/blog/model-maintenance-guide ↩
-
Hard-coded selectors breaking after site redesigns is one of the most common web agent failure modes. Practitioners on the Latenode community forums have documented this cycle extensively. https://community.latenode.com/t/why-do-my-browser-automation-scripts-need-a-complete-rewrite-after-every-site-redesign/60820 ↩
-
The compound reliability calculation (0.95^20 ≈ 0.36) is well-established in multi-step workflow analysis. See Zylos AI Research, "AI Agent Context Compression," Feb 2026. https://zylos.ai/research/2026-02-28-ai-agent-context-compression-strategies ↩
-
Prevention of prompt drift requires version control and continuous monitoring. See Maxim AI, "A Comprehensive Guide to Preventing AI Agent Drift Over Time," Nov 2025. https://www.getmaxim.ai/articles/a-comprehensive-guide-to-preventing-ai-agent-drift-over-time/ ↩
-
AI agent drift from model updates, data distribution changes, and prompt variations is a documented production challenge. See Maxim AI, "Preventing AI Agent Drift," Nov 2025. https://www.getmaxim.ai/articles/a-comprehensive-guide-to-preventing-ai-agent-drift-over-time/ ↩
-
LangChain, "State of AI Agents" survey, Nov–Dec 2025 (n=1,300+). The observability-evaluation gap (89% vs. 52%) indicates most teams can monitor agent behavior but cannot systematically validate correctness. ↩
