Tomás Ferreira-Wax is VP of Engineering at a Series D HR tech company with roughly 500 employees. The kind of company whose name you'd recognize if you've ever been onboarded at a mid-size firm and thought, "huh, this is smoother than expected." Six months ago, his team shipped agent capabilities into their core product: an AI system that orchestrates the dozens of small, tedious steps involved in bringing a new employee into an organization. Document collection, system provisioning, scheduling, compliance checks. The stuff that, when it breaks, a real human notices on their first day at a new job.
We should note: Tomás is a composite, a hypothetical character built from observable data, practitioner accounts, and production realities documented across multiple 2025–2026 industry studies. He is not a real person. But the problems he describes are.
We spoke over video. He was in a conference room with a whiteboard behind him covered in what appeared to be a flowchart with several branches ending in question marks.
Six months in. How's it going?
Tomás: You know that phase of a home renovation where the contractor says "we're 90% done" and then you live in that last 10% for longer than the first 90% took? That's where we are. We shipped. The launch blog post was great. My mom shared it on Facebook. And then the next Monday, the support queue was... educational.
Educational how?
Tomás: Nobody warns you about this. When you build deterministic software, a bug is a bug. You reproduce it, you fix it, you write a test, you move on. When you build agent-driven workflows, failures don't arrive as bugs. They arrive as vibes. A customer says, "the agent skipped a step." You go look at the logs and the agent didn't skip a step. It decided, based on context, that the step wasn't necessary. And it was wrong. But it was wrong in a way that felt like a judgment call, not a crash.1
The first month, we treated these like edge cases. By month three, we understood: the edge cases are the product.
What's the most common failure mode?
Tomás: Confident wrongness. The agent never says "I'm not sure about this." It just... does the thing. Slightly wrong. And because onboarding involves real humans on their first day at a new job, "slightly wrong" lands differently than it does in data reconciliation. Someone shows up and their laptop isn't provisioned because the agent decided — decided! — that the IT ticket had already been handled based on an ambiguous status field.
The Cleanlab study keeps haunting me: even among the small number of teams with agents actually live in production, most still can't reliably tell when their agents are right, wrong, or uncertain.2 I read that and thought, okay, at least we're not uniquely bad at this.
You have observability tooling, though?
Tomás: Oh, we have beautiful observability. We can trace every step, every tool call, every reasoning chain. Ninety-four percent of production teams have some form of observability in place.3 We are proudly in that ninety-four percent.
What we don't have is systematic evaluation. We can see what the agent did. We cannot systematically answer whether it did it well enough. Only about half of teams have formal evals, even though almost everyone has observability.3
We're watching the movie. We just haven't agreed on what a good movie looks like.
That sounds like a lot of manual review.
Tomás: There's a Snowflake engineering lead who reportedly spends 20 to 30 hours a week interacting with AI agents.4 When I read that, I didn't think "that's too much." I thought "that sounds about right." My senior engineers have shifted from writing code to orchestrating and validating. The CIO framing is that the engineer of 2026 spends less time writing foundational code and more time designing architecture and rigorously validating output.5
That's accurate. But "rigorously validating output" is a euphemism for reading logs at 11 PM because a customer's new hire didn't get their benefits enrollment email.
Is your team burning out?
Tomás: [long pause]
There's research from Harvard Business Review, cited in Fortune, showing that employees equipped with AI don't just work faster. They take on broader scope, extend into longer hours, and experience increased cognitive load from managing and correcting AI outputs.6 I forwarded it to my leadership team with the subject line "this is us, please read."
The promise was: the agent handles the routine work, your team focuses on strategic stuff. The reality is: the agent handles the routine work mostly, and your team spends their strategic time cleaning up the mostly.
The gap between "mostly" and "reliably" is where all the human labor lives now. The tool isn't bad. The gap is just expensive in ways nobody modeled.
What about the underlying infrastructure? Is the stack stable?
Tomás: [laughs]
One practitioner in the Cleanlab study described moving from LangChain to Azure in two months, then considering moving back.2 I felt seen. We've rebuilt our orchestration layer twice. The AI stack shifts beneath you faster than you can standardize.2 Integration with existing systems is still the top challenge for 46% of enterprise leaders, according to Anthropic's survey.7 For us it's not even close. Integration is the job. Everything else is a subplot.
Maya Mikhailov has this line I taped to my monitor:
"AI makes it dramatically easier to write software. It does not make it easier to run enterprise software."8
That's the whole season of television I'm living through, compressed into one sentence.
So what does success look like from where you're sitting?
Tomás: Invisibility. I want the agent to be so boring that nobody talks about it. The highest-ROI deployments right now are document processing, compliance checks, data reconciliation. The stuff nobody writes blog posts about.2 That's what we're aiming for. Not "wow, AI!" but "huh, onboarding just... works now."
We're not there yet. But we're closer than we were three months ago. And three months ago we were closer than at launch. The trajectory is right. The slope is just gentler than anyone's pitch deck suggested.
Any advice for a VP Engineering about to ship their first agent feature?
Tomás: Budget for the phase after launch. Seriously. The launch is maybe 30% of the work. And get your governance into code, not slide decks.9 Policies that live in documents don't constrain agent behavior at runtime. They constrain nothing. They're decoration.
Also: your regression testing framework? Understand that you cannot deterministically test a non-deterministic system. Every model update is a potential regression with no test suite to catch it.10 You need continuous monitoring infrastructure from day one, not as a nice-to-have after launch.
And maybe don't let your mom share the launch blog post until month six. Just in case.
Footnotes
-
LangChain, State of Agent Engineering, Nov–Dec 2025. https://www.langchain.com/state-of-agent-engineering ↩
-
Cleanlab, AI Agents in Production 2025, n=1,837. https://cleanlab.ai/ai-agents-in-production-2025/ ↩ ↩2 ↩3 ↩4
-
LangChain, State of Agent Engineering, Nov–Dec 2025. Production teams: 94% observability adoption vs. 52% formal evals. https://www.langchain.com/state-of-agent-engineering ↩ ↩2
-
Al Jazeera, March 8, 2026, reporting on Snowflake engineering operations. ↩
-
CIO.com, February 2026, on the evolving role of the engineer. ↩
-
Harvard Business Review research, cited in Fortune, February 10, 2026. https://fortune.com/2026/02/10/ai-agents-anthropic-openai-arent-killing-saas-salesforce-servicenow-microsoft-workday-cant-sleep-easy/ ↩
-
Anthropic, 2026 State of AI Agents Report, 500+ technical leaders surveyed. ↩
-
Maya Mikhailov, CEO of SAVVI AI, quoted in CIO.com, March 2026. https://www.cio.com/article/4136820/ai-agent-platforms-could-push-down-saas-license-costs-report-argues.html ↩
-
Yi Zhou, "2025 Overpromised AI Agents. 2026 Demands Agentic Engineering," January 2, 2026. ↩
-
Maxim AI, Ensuring AI Agent Reliability in Production, November 5, 2025. https://www.getmaxim.ai/articles/ensuring-ai-agent-reliability-in-production/ ↩
