Every enterprise leader has been asked to approve an agentic AI pilot this quarter. Most of the ones that get approved will not ship. Three questions, asked in the first 30 minutes, separate the agent that runs in production from the agent that becomes a slide.
There is a specific kind of meeting happening inside every enterprise right now. A VP walks into a room with a deck. The deck proposes an agentic AI pilot — a system that will handle tier-one support, or reconcile invoices, or triage security alerts without human oversight. The business case is plausible. The vendor has been shortlisted. The budget is already soft-allocated. The ask is a signature.
We sit in these rooms. Often on the side of the person being asked to sign. And we have learned that three questions, asked in the first 30 minutes, predict whether the pilot will ship into production with high accuracy. None of them is technical. All of them are the sort of thing a chief of staff would ask if the chief of staff had watched five of these projects fail.
This is the one that kills pilots. Not accuracy. Not latency. Not model selection. What happens when the agent produces a confident, articulate, wrong answer — and what does the system do next.
The honest answer in most decks is a version of “we will monitor it and iterate.” That is not an answer. That is a hope. A production-grade agent has three things written down before a single line of code ships:
If the team proposing the pilot cannot answer this in 30 minutes without hedging, the pilot is not ready. Not because the technology is not there. Because the operating discipline around the technology is not there. That is a different, harder problem to fix, and it does not fix itself by picking a better model.
A system you cannot audit is a system you cannot ship.
This question exposes a common mis-specification. Teams propose AI agents for tasks no one has measured. The assumption is that a human does the task, does it fine, and the AI will do it faster or cheaper. When we push on the assumption — actually, how often does a human get this right? — the answer is usually one of three things.
Without that baseline, any deployment is unfalsifiable. You cannot tell success from failure, and neither can the board. You are then running on vibes — and vibes do not survive contact with a difficult Q3.
Most discussion of agent safety is abstract. This question makes it concrete. If the agent does the wrong thing — not catastrophically wrong, just plausibly wrong — what is the worst outcome in the next 24 hours?
We score every proposed agent on a three-tier scale:
The rule we apply: no agent ships past tier one until it has been operating at tier one for at least 90 days with a measured error rate below the human baseline. No agent moves to tier three until it has been at tier two for a minimum of two quarters. This sounds slow. It is slow. It is also the difference between an agent that is still running next year and an agent that was quietly retired after an incident nobody wants to write about.
We have watched enough agentic pilots ship — and fail to ship — to see the pattern. The ones that make it to production share four attributes, and they are all operational, not technical.
Ask the three questions. Listen to the answers. Then do one of three things.
Green light. The team has failure modes, a baseline, and a tier-one use case. They have a named owner with skin in the game. Approve the pilot with a 90-day gate and a written definition of what “working” means.
Yellow light. The team has one or two of the three answers but is vague on the others. Send them back for two weeks with a specific list of what is missing. Do not fund a half-specified pilot — it will burn trust on both sides.
Red light. No answers, or answers dressed up in vocabulary. Decline the pilot. Offer to help the team build the answers. If they are not willing to do the operational work, the technical work is not the thing standing between them and success.
Agentic AI is not a technology purchase. It is an operating model change. The technology is ready for more than most teams are ready to absorb, and the gap between what the model can do and what your operation can absorb is where pilots die.
Three questions, 30 minutes, a lot of discipline saved downstream. If the answers come back clean, the rest is engineering. If they do not, there is no amount of engineering that fixes it.
Before you sign, run the three questions. If the answers are shaky, tell us what you are looking at — we have sat on both sides of this meeting.