Why most AI features fail before they ship

An MIT study cited across industry coverage found that 95% of enterprise generative AI pilots deliver no measurable P&L impact, and 42% of companies scrapped most of their AI initiatives in 2025, up from 17% the year before. The failures cluster around integration, data quality and governance, not model capability.

If you are a founder or product lead being pushed to "add AI", the useful question is not which model to pick. It is whether the feature you are imagining will survive contact with real users, real margins and real edge cases. This is the playbook we use at Devonic when clients ask us to build AI into their product, written for the people who have to own the result a year after launch.


Start with the friction, not the model

The mistake of picking a technology first

The most common pattern we see: a team decides to "add AI to the product", picks a vendor, then spends three months hunting for a problem the technology can plausibly solve. The output is a chatbot bolted to the dashboard, a summarisation feature nobody asked for, or a recommendation widget competing with the search bar already on the page.

Users do not want AI. They want their work to be faster, less annoying, or possible at all. A model is just one possible tool for that. Treating "AI" as the feature is the equivalent of treating "JavaScript" as a feature. Nobody buys software for what it is written in.

How to identify a problem worth solving with AI

A problem is a candidate for AI when three things are true:

  1. You can name the workflow. Not "support" but "tier-one questions about password resets and shipping status during business hours".
  2. You can measure the current cost. Time per task, error rate, tickets per week, abandonment at a specific step. If you cannot put a number on it now, you will not be able to prove the AI worked later.
  3. Rules alone are too brittle and humans alone are too slow. This is the narrow band where AI genuinely earns its place. If a deterministic rule would work, write the rule. If only a human can judge it, hire or train.

We push clients to write the success metric before they write a single line of prompt. "Reduce time to complete onboarding task X by 40%" is a brief you can ship against. "Add AI to onboarding" is not.


The unit economics trap

Here is a pattern that has burned several teams in the last eighteen months. A SaaS product launches a "chat with your data" feature using a top-tier model, bundles it into the existing plan to drive adoption, and watches engagement metrics climb. Then the finance team runs the numbers.

Mind The Product describes the shape of it well: most users ask small questions, but a handful of power users start pasting giant CSV exports into the chat. For that subset of accounts, the AI feature now costs around forty dollars per month per heavy user in model and retrieval costs, while those users are paying thirty dollars for the whole product. The feature is a hit. The unit economics are a disaster.

This is the structural change AI introduces. Traditional features have a near-zero marginal cost per use. AI features charge you every time a user breathes on them. A good demo and a bad pricing model can quietly turn your best customers into your worst ones.

Modelling inference costs before you commit

Before we build, we run a simple exercise with the client:

  • Estimate calls per user per session, with separate numbers for median and 95th percentile usage.
  • Multiply by the realistic input and output token sizes for the actual feature, not a toy prompt.
  • Apply the provider's pricing across a range of plausible models, including one cheaper fallback.
  • Stress-test against the heaviest 5% of accounts, because that is who will define your margin.

Then we decide, with the client, what to do about the tail: hard usage caps, a separate AI add-on tier, throttling on large inputs, or a cheaper model for the bulk of queries with a smarter model reserved for harder ones. That conversation happens at the design stage, not after the bill arrives. It is exactly the kind of trade-off we work through in digital strategy consulting before any code is written.

The CFO question

If your AI feature ships with no usage cap and a flat price, you have implicitly bet that no customer will ever use it intensely. That bet is almost always wrong.


Trust is a design problem, not a model problem

Clay figure inspecting transparent versus opaque boxes, one revealing trustworthy sources within, the other hiding uncertain information

What the Deloitte and Air Canada cases actually show

Two cases get cited a lot. In 2025, Deloitte's Australian arm delivered an AI-assisted report to a federal department with fabricated citations and fabricated court quotes. The firm refunded part of the fee and took a reputational hit. In February 2024, Air Canada's virtual assistant gave a customer incorrect bereavement fare information, and a tribunal forced the airline to honour the wrong price.

Both are usually told as AI failure stories. They are actually UX and process failures. The model did what models do (modern LLMs can be factually wrong anywhere from 1% to 30% of the time depending on prompt and context, per NN/g). The failure was that the output looked authoritative enough to publish or act on without a check. A polished interface that hides uncertainty is more dangerous than a rough one that exposes it.

Trust markers and honest interfaces

Consumer comfort with brands using AI dropped from 57% to 46% between 2023 and 2024, according to Statista. NN/g calls 2025 "the year of post-hype AI" and notes that users are fatigued by AI slop. You cannot ship a generic AI sparkle and expect goodwill anymore.

What works in the interface:

  • Show the source. If the AI cites a document, link to the exact passage. If it does not have one, say so.
  • Expose confidence honestly. "I'm not sure" is a feature, not a bug.
  • Make correction easy. Let users flag a wrong answer in one click and route it somewhere a human will see.
  • Constrain the model to your data. Retrieval against your own knowledge base beats open-ended generation for most product use cases.
  • Never present generated content as the final word in any context where being wrong has a cost (legal, medical, financial, refunds).

This is design work, not modelling work. Most of the AI projects we are proud of involved more time on the interface and the fallback paths than on the prompt itself. Good digital product design for AI looks boring on the surface and rigorous underneath.

Trust has to be earned directly through the product experience, in how it behaves under uncertainty and how it allows people to correct it.

NN/g, State of UX 2026

Get plain-English guides like this in your inbox.

One short email a month. WordPress, Shopify, SEO, no fluff. Unsubscribe in one click.

We never share your email.

Narrow scope wins: how to pick the right workflow

The teams shipping AI features that survive past launch are not building "AI-native products". They are picking one workflow where AI removes real friction and leaving the rest of the product alone.

A useful mental model: most of your product should stay deterministic. Rules, validations, state machines, database queries. AI sits inside that as a recommendation layer at specific decision points, with guardrails around it. This is the hybrid pattern that consistently beats both extremes: jamming AI into rigid systems, or ripping out the rules entirely.

Examples we have seen work:

  • Research synthesis. A team runs thirty user interviews. AI handles transcription, tagging and first-pass theme detection. The researcher validates, edits and decides. Two weeks of work becomes two to three days. The human still owns the conclusion.
  • Support triage. AI routes incoming tickets, suggests a draft reply, and surfaces relevant past tickets. The agent reviews and sends. Tier-one ticket time drops without removing the human judgement on edge cases.
  • Invoice and document checking. AI flags anomalies against rules the team already had. A person confirms. Errors that used to slip through get caught early.

What these have in common: AI does the first pass, a human approves the result, and the system fails safely if the model is wrong. That last part is the one most teams skip.

The narrow-scope test

If you cannot describe your AI feature in one sentence that names a specific user, a specific task and a specific improvement, the scope is too wide. Cut it down before you build.


Measuring what matters after launch

In 2024 it was acceptable to define AI success as "percentage of users who clicked the AI button". That was the experimentation phase. It is not enough now. CFOs are asking harder questions about margin, and users are less impressed by novelty.

The metrics that earn their keep are tied to outcomes the business already tracks:

  • Task completion time for the specific workflow the feature touches
  • Tier-one ticket volume for support-adjacent features
  • Conversion or completion rate on the step the AI is supposed to unblock
  • Error rate or rework rate in processes where AI does a first pass
  • Cost per resolved request, including inference, so you can see margin impact, not just usage

McKinsey's 2025 research points at workflow redesign as the single biggest driver of EBIT impact from generative AI. That is consistent with what we see: the AI features that move numbers are the ones embedded into how work actually gets done, not the ones bolted on as a separate experience.

There is also a maintenance reality nobody mentions in the demo. AI features need an ongoing owner. Model providers update their endpoints. User behaviour drifts. Edge cases accumulate. A prompt that worked in March may behave differently in October. Budget for the person who will watch the metrics and tune the system, or accept that quality will quietly erode.


How Devonic approaches AI feature builds

We are a custom software and SaaS studio, so our bias is towards features that ship, get used, and survive the second year. When a client asks us to add AI to their product, our default sequence looks like this:

  1. Problem audit first. We map the workflow, measure the current cost, and challenge whether AI is the right tool. Sometimes the honest answer is a better rule, a clearer interface, or a missing report.
  2. Unit economics modelling. Before design starts, we estimate inference cost per user per tier across realistic and worst-case usage. If the maths does not work, we adjust scope, model choice or pricing now, not later.
  3. Narrow scope and hybrid architecture. AI sits inside deterministic guardrails at a specific decision point. The rest of the system stays predictable.
  4. Honest interface. Sources visible, confidence exposed, correction easy, fallback paths defined for when the model is wrong.
  5. Outcome metrics from day one. We instrument the workflow the AI is supposed to improve, not just the AI calls themselves.
  6. A named owner for the long tail. Model performance, drift, and quality reviews are scheduled, not assumed.

None of that is glamorous. It is also why the features we ship tend to still be in production a year later, doing the job they were built to do. To see what that looks like in practice, see our work for examples across SaaS, marketplace and content products.