Three questions before you build an AI agent

There is a specific kind of meeting happening inside every enterprise right now. A VP walks into a room with a deck. The deck proposes an agentic AI pilot — a system that will handle tier-one support, or reconcile invoices, or triage security alerts without human oversight. The business case is plausible. The vendor has been shortlisted. The budget is already soft-allocated. The ask is a signature.

We sit in these rooms. Often on the side of the person being asked to sign. And we have learned that three questions, asked in the first 30 minutes, predict whether the pilot will ship into production with high accuracy. None of them is technical. All of them are the sort of thing a chief of staff would ask if the chief of staff had watched five of these projects fail.

Question one: what does the agent do when it is wrong?

This is the one that kills pilots. Not accuracy. Not latency. Not model selection. What happens when the agent produces a confident, articulate, wrong answer — and what does the system do next.

The honest answer in most decks is a version of “we will monitor it and iterate.” That is not an answer. That is a hope. A production-grade agent has three things written down before a single line of code ships:

A failure mode catalog. Not theoretical. Specific. “The agent will sometimes assert that a refund was processed when it was not.” “The agent will sometimes route a high-severity ticket to the wrong queue.” Written in customer language. Owned by a named human.
A detection path for each failure mode. How will you know when it happened? Log parsing? A human review sample? A downstream system that will surface the mismatch within 24 hours?
A remediation path for each failure mode. Who rolls it back? Who tells the customer? Who changes the prompt or the tool schema? What is the SLA on all three?

If the team proposing the pilot cannot answer this in 30 minutes without hedging, the pilot is not ready. Not because the technology is not there. Because the operating discipline around the technology is not there. That is a different, harder problem to fix, and it does not fix itself by picking a better model.

A system you cannot audit is a system you cannot ship.

Question two: can a human do the same task today, and how often do they get it right?

This question exposes a common mis-specification. Teams propose AI agents for tasks no one has measured. The assumption is that a human does the task, does it fine, and the AI will do it faster or cheaper. When we push on the assumption — actually, how often does a human get this right? — the answer is usually one of three things.

“We don’t measure that.” This is the common case. Nobody has a baseline. Which means there is no way to tell if the agent is better, worse, or indistinguishable from the status quo. Which means there is no way to justify the pilot to the board.
“Humans get it right 85% of the time.” Now we have a number to beat. We also have a new constraint: the agent has to meet or exceed 85%, not because 85% is a magic number, but because shipping at less destroys trust with a downstream team who will refuse to cooperate on the next pilot.
“Humans get it right 98% of the time, but they are slow and expensive.” This is the interesting case. The agent does not have to be more accurate. It has to hit 95%+ with a throughput or cost advantage. Now we have a real target and a real decision.

Without that baseline, any deployment is unfalsifiable. You cannot tell success from failure, and neither can the board. You are then running on vibes — and vibes do not survive contact with a difficult Q3.

Question three: what is the blast radius of a bad action?

Most discussion of agent safety is abstract. This question makes it concrete. If the agent does the wrong thing — not catastrophically wrong, just plausibly wrong — what is the worst outcome in the next 24 hours?

We score every proposed agent on a three-tier scale:

Contained. The agent writes to a draft, an internal ticket, or a staging environment. A human reviews before anything goes external. Blast radius: embarrassment in a Slack channel. Cost of failure: low. This is where every new agent should start, even if the team insists they are past that.
Customer-facing with reversal. The agent takes an action that a customer sees — sends an email, updates a CRM record, posts a reply — but the action can be reversed within hours and the customer can be contacted with a correction. Blast radius: a handful of customer apologies. Cost of failure: moderate.
Customer-facing with real-world consequence. The agent moves money, changes provisioning, triggers a physical shipment, or makes a regulatory filing. Blast radius: financial loss, compliance exposure, or physical harm. Cost of failure: severe, and sometimes not reversible.

The rule we apply: no agent ships past tier one until it has been operating at tier one for at least 90 days with a measured error rate below the human baseline. No agent moves to tier three until it has been at tier two for a minimum of two quarters. This sounds slow. It is slow. It is also the difference between an agent that is still running next year and an agent that was quietly retired after an incident nobody wants to write about.

Rule of thumb The right first question is never “can the model do this.” It is “can our operation absorb the version of this that does it wrong.” The second question only becomes interesting once you have answered the first.

The pilots that ship.

We have watched enough agentic pilots ship — and fail to ship — to see the pattern. The ones that make it to production share four attributes, and they are all operational, not technical.

A named operational owner, not a sponsor. Someone whose quarterly objectives improve or worsen based on whether the agent works. Not a VP who signed off. Not a data scientist. A person running a number the agent touches.
A short, boring first use case. Not the most ambitious one on the roadmap. The one where a human currently spends two hours a day doing something repetitive, the error rate is already known, and the blast radius is tier one. Ship that. Learn. Then move up.
A rollback plan that has been tested at least once. Most teams have a rollback plan on paper. Very few have ever actually triggered it in a live environment. The ones who have are the ones whose next pilot goes smoothly.
A 90-day review cadence with the right stakeholders in the room. Legal, ops, customer support, finance, security. Not quarterly. Not ad hoc. On a calendar. The agent is a new employee and gets a new-employee review schedule.

What to do with the answer.

Ask the three questions. Listen to the answers. Then do one of three things.

Green light. The team has failure modes, a baseline, and a tier-one use case. They have a named owner with skin in the game. Approve the pilot with a 90-day gate and a written definition of what “working” means.

Yellow light. The team has one or two of the three answers but is vague on the others. Send them back for two weeks with a specific list of what is missing. Do not fund a half-specified pilot — it will burn trust on both sides.

Red light. No answers, or answers dressed up in vocabulary. Decline the pilot. Offer to help the team build the answers. If they are not willing to do the operational work, the technical work is not the thing standing between them and success.

The short version.

Agentic AI is not a technology purchase. It is an operating model change. The technology is ready for more than most teams are ready to absorb, and the gap between what the model can do and what your operation can absorb is where pilots die.

Three questions, 30 minutes, a lot of discipline saved downstream. If the answers come back clean, the rest is engineering. If they do not, there is no amount of engineering that fixes it.

Three questions before you build an AI agent.

Question one: what does the agent do when it is wrong?

Question two: can a human do the same task today, and how often do they get it right?

Question three: what is the blast radius of a bad action?

The pilots that ship.

What to do with the answer.

The short version.

Adjacent reading.

Why most rebrand projects don’t need a rebrand.

Packaging is production, not decoration.

Have an agentic AI pilot on the desk?