Back to Blog
AI Strategy

Your AI Pilot Worked. Here's Why It Still Won't Ship.

9 June 20268 min read

There's a statistic floating around that some large share of enterprise AI pilots never reach production. I don't know if the exact number is right, but the shape of it matches what I see. The pilots mostly work. The demos go well. Everyone agrees it's promising. And then — nothing. Six months later the pilot is still a pilot, the sponsor has moved on, and the team quietly stops mentioning it in stand-ups.

What's frustrating is that the reasons are rarely technical. I've been called in to rescue enough of these to have a list, and the list is boring. That's the good news: boring problems have boring fixes.

The five ways pilots die

1. Nobody owns it after the demo

A pilot typically has a champion — someone who wanted it, found budget for it, and showed it to the executive team. Production systems need something different: an owner. Someone whose actual job includes keeping it running, watching its quality, handling the complaints, and arguing for its budget next year.

These are different roles, and the handover between them almost never happens by itself. The single most reliable predictor I've found for whether a pilot ships is whether anyone can answer the question "who gets paged when it's wrong?" If the answer is a shrug, the pilot is already dead. It just doesn't know yet.

2. The pilot ran on clean data

Pilots get the curated dataset. The fifty hand-picked documents, the tidy export, the test tenant. Production gets the real thing: the SharePoint site with three contradictory versions of every policy, the CRM where half the fields are free-text dumping grounds, the ticket history full of "see attached" with no attachment.

The fix is unsatisfying but effective: run at least part of the pilot on the worst data you have, not the best. If the system holds up against your messiest content, production becomes an expansion rather than a surprise. If it doesn't, you've learned that for the price of a pilot rather than the price of a launch.

3. Security and legal saw it last

The pattern: a team builds for three months, gets sign-off scheduled, and then the security review raises questions that should have been design inputs. Where does the data go, what does the vendor retain, what's the story for prompt injection, who audited the permissions on that connector? Each question is reasonable. Each one now requires rework. The momentum dies in the queue.

Inverting this is cheap. A one-hour conversation with your security team in week one — "here's what we want to build, what would make you reject it?" — turns the review from a gate into a checklist. I've watched this single meeting cut months off delivery timelines.

4. Nobody defined what "good enough" means

Pilots get judged by impression: people try it, it seems impressive, thumbs up. Production needs a number. What accuracy on what test set? What rate of escalations to humans is acceptable? At what error rate do we switch it off?

Without an agreed threshold, every individual mistake becomes a referendum on the whole system. One bad answer lands in the wrong inbox and the project is "unreliable" — even if it's right 96% of the time and the process it replaced was right 89% of the time. Set the bar before launch, measure against it, and publish the numbers. Systems with published numbers survive their first bad week. Systems judged on vibes don't.

5. The cost model was a guess

The pilot cost a few hundred dollars a month, so nobody did the maths for ten thousand users. Then someone does the maths, the number has more digits than expected, and the business case needs to be re-litigated from scratch — this time with a sceptical audience.

Token costs in production are an architecture problem more than a pricing problem — caching, routing, and output limits change the bill by multiples. But you have to do that work during the pilot, when changing the architecture is cheap, not after the invoice arrives.

What shipping teams do differently

The teams I've watched get pilots into production share a few habits, none of them glamorous:

  • They build the pilot inside production constraints. Real auth, real data permissions, real network rules. Slower to start, dramatically faster to finish.
  • They name the owner on day one. Not the champion — the person who'll run it. That person shapes the pilot to be something they can actually operate.
  • They write the launch criteria before the pilot starts. Three or four measurable conditions. When the conditions are met, it ships. Nobody gets to move the goalposts in either direction.
  • They keep the scope embarrassingly small. One workflow, one user group, one system integration. The pilots that try to prove everything prove nothing in time to matter.

The question to ask before you start

If you're about to kick off an AI pilot, the most useful question isn't "will this work?" — pilots almost always work. It's "what would stop this from shipping?" Ask it in week one, write down the answers, and spend the pilot retiring those risks rather than polishing the demo.

If you've got a pilot that worked and stalled anyway, and you want help working out which of these it hit, get in touch.

AI StrategyProductionEnterprise AIDelivery