Skip to content
Artificial Intelligence 17 min read

From Pilot to Production: Putting Generative AI to Work

K

Khoirush Akbar

Principal Consultant, Data & AI · June 2, 2026

A practical operating model for enterprises crossing the GenAI divide.

Executive Summary

Generative AI has cleared the credibility threshold in the boardroom. The question is no longer whether it works, but why so little of it reaches production — and what to do about it.

The most-cited evidence is sobering. MIT's NANDA initiative, in The GenAI Divide: State of AI in Business 2025, found that roughly 95% of enterprise GenAI pilots deliver no measurable impact on profit and loss, with only about 5% reaching scaled production. In their sample, 60% of organizations evaluated GenAI tools, just 20% reached pilot stage, and only 5% made it into production. Importantly, MIT concluded the core barrier is not the model, infrastructure, regulation, or talent — it is the inability of systems to integrate into workflows, retain context, and demonstrate outcomes.

This is good news disguised as bad news. It means the gap between pilot and production is largely an operating-model and engineering-discipline problem, not a fundamental technology limitation — and those are problems enterprises know how to solve.

This article lays out a vendor-neutral approach to crossing that divide: how to select use cases by value, build an AI-ready data and retrieval foundation, operationalize models with LLMOps and evaluation, embed governance by design, and organize teams to treat GenAI as a product, not a project. It includes a reference architecture, the KPIs that matter, a phased roadmap, and the pitfalls that quietly kill most initiatives.

The Business Challenge

Across the enterprises we advise — in banking, manufacturing, energy, logistics, the public sector, and state-owned enterprises (BUMN) — the pattern is strikingly consistent. The organization runs an impressive demo. A copilot summarizes documents, drafts emails, or answers questions over a knowledge base. Executives are encouraged. Budget is approved. And then, six to twelve months later, the initiative quietly stalls. The pilot is "wrapped up," the budget is reallocated, and the underlying problems remain unsolved for the next pilot to rediscover.

Three business realities make this an urgent problem rather than a tolerable one:

1. The bill is coming due. Pilots launched in 2023–2024 are now entering serious budget-review cycles. When a program cannot demonstrate defensible business value, it gets cancelled. Gartner has projected that a significant share of AI projects lacking AI-ready data will be abandoned, and that over 40% of agentic AI projects could be cancelled by end of 2027. Boards that approved AI on faith are now asking for returns.

2. The money is going to the wrong places. MIT found that more than half of GenAI budgets are spent on sales and marketing, while the highest, most defensible ROI consistently shows up in back-office and operational automation — reducing outsourcing, compressing cycle times, and removing manual rework. Many enterprises are optimizing for visibility instead of value.

3. "Deploying a model" and "generating value from a model" are not the same thing — and they are routinely treated as if they were. A model that performs in a controlled demo collides with messy data, regulatory constraints, undefined success metrics, and organizational inertia the moment it meets a real workflow.

For a CFO, the question is simple: we have spent on AI — where is the return? For a CIO or CDO, the challenge is to convert experimentation into a repeatable, governed, cost-controlled capability. Both questions have the same answer, and it is not "a better model."

Why Traditional Approaches Fail

The pilot-to-production gap is not random. It is the predictable result of a handful of recurring mistakes.

Demo-driven thinking. A demo is engineered to succeed on a narrow, curated path. Production must handle the long tail of real inputs, edge cases, and adversarial users. Organizations mistake a successful demo for a near-complete product, then are surprised by the 80% of work that remains: integration, evaluation, monitoring, security, and change management.

Treating GenAI as an IT project, not a product. Projects have an end date; products have an owner, a roadmap, and a feedback loop. GenAI systems degrade without continuous evaluation and tuning because data, user behavior, and models all change. A "project" mindset ships once and walks away — which is exactly when a GenAI system begins to drift.

No evaluation, no observability. Most stalled initiatives cannot answer a basic question: is the output good, and is it getting better or worse? Without systematic evaluation (accuracy, groundedness, safety) and production observability (latency, cost, hallucination rate, user feedback), teams are flying blind and cannot earn the trust required to scale.

Data that was never AI-ready. Retrieval-augmented systems are only as good as the knowledge they retrieve. Siloed, inconsistent, ungoverned, and poorly documented data produces confident, wrong answers — and a "verification tax" where humans spend so long checking AI output that the efficiency gain evaporates. The MIT and Gartner findings converge here: data readiness, not model quality, is the dominant failure cause. (This is why a trusted, governed data foundation is the real prerequisite.)

Governance bolted on at the end. When security, privacy, data residency, and responsible-AI controls are treated as a launch-gate afterthought, they either block go-live or get waived under pressure — creating risk. In regulated Indonesian contexts (OJK oversight in financial services, the Personal Data Protection Law, sovereignty expectations for government and BUMN), governance designed in late is governance that fails.

The "build everything ourselves" reflex. MIT's data is blunt: strategic partnerships and vendor-led solutions succeed roughly twice as often as internal builds (about 67% vs. 33%). Enterprises consistently underestimate the integration, evaluation, and operations work, and over-invest engineering effort recreating commodity capabilities instead of focusing it on their proprietary advantage.

Crossing the GenAI Divide requires treating productionization as a system with deliberate components — what we call an AI Production System. It rests on seven pillars.

1. Use-case selection by value × feasibility. Stop running a dozen shallow pilots. Score candidate use cases on business value (revenue, cost, risk, experience) against feasibility (data readiness, workflow fit, risk tolerance). Prioritize a small number of high-value, deeply integrated use cases — ideally in back-office and operations, where ROI is most defensible — and commit to taking them all the way to production.

2. An AI-ready data and retrieval foundation. Production GenAI sits on a governed data platform: a lakehouse with a medallion structure, a business glossary and catalog, lineage, quality gates, and a retrieval layer (vector + structured) that supplies models with trustworthy, permission-aware context. This is the single highest-leverage investment, because it determines the ceiling on every use case built on top. (See our modern data platform reference architecture.)

3. LLMOps, evaluation, and observability. Establish an evaluation harness (golden datasets, automated scoring for accuracy/groundedness/safety, regression testing on every change) and production monitoring (cost per request, latency, hallucination and refusal rates, user thumbs up/down). This is how you convert "it seems to work" into "we can prove it works, and we will know the moment it doesn't."

4. Workflow integration with human-in-the-loop. Value is created when AI is embedded inside the workflow — in the CRM, the ERP, the service desk, the underwriting screen — not parked in a separate chat window. High-stakes decisions keep a human in the loop, with the AI accelerating rather than replacing judgment. Adoption is a design problem, not a training afterthought.

5. Governance and responsible AI by design. Access control, PII handling, data residency, audit trails, content guardrails, and model risk management are designed in from day one, mapped to the relevant regulatory regime. Governance that enables safe scale, rather than blocking it.

6. The right operating model and ownership. Decentralize execution with clear ownership. The most effective pattern is a small central AI platform/enablement team (providing the foundation, guardrails, and reusable components) supporting product-aligned squads that own specific use cases end-to-end. MIT found clear ownership and decentralized authority — not raw budget — to be decisive.

7. FinOps for AI. Token and inference costs scale with usage and can quietly destroy a business case. Track unit economics (cost per resolved ticket, per generated document, per query), apply model routing (right-size the model to the task), caching, and guardrails so the ROI that justified the project survives contact with production volume.

Reference Architecture

A production-grade enterprise GenAI architecture is layered, with governance and observability cutting across every layer rather than sitting beside them.

flowchart TB

subgraph EX["Experience Layer"]
    A1["Embedded in CRM / ERP / Service Desk"]
    A2["Copilots & Chat"]
    A3["APIs / Automations"]
end

subgraph OR["Orchestration and Agent Layer"]
    B1["Prompt and Context Management"]
    B2["Tool and Function Calling"]
    B3["Agent Workflows and Routing"]
end

subgraph MD["Model Layer"]
    C1["Foundation Models"]
    C2["Fine-tuned / Domain Models"]
    C3["Model Router and Fallback"]
end

subgraph RK["Retrieval and Knowledge Layer"]
    D1["Vector Search / RAG"]
    D2["Structured Data Access"]
    D3["Permission-aware Retrieval"]
end

subgraph DF["Data Foundation"]
    E1["Lakehouse - Medallion Bronze Silver Gold"]
    E2["Catalog, Glossary, Lineage"]
    E3["Data Quality Gates"]
end

EX --> OR
OR --> MD
OR --> RK
RK --> DF
MD --> RK

subgraph CROSS["Cross-cutting"]
    F1["Security, Privacy and Data Residency"]
    F2["Governance and Responsible AI Guardrails"]
    F3["Evaluation, Observability and FinOps"]
end

How to read it. The experience layer delivers AI where work already happens. The orchestration layer manages context, tools, and multi-step agent workflows. The model layer routes each request to the right model — from a small, cheap model for simple tasks to a frontier model for complex reasoning. The retrieval layer grounds responses in enterprise knowledge with permissions enforced. The data foundation is the governed lakehouse that makes retrieval trustworthy. And cutting across everything: security, governance, and evaluation/observability/FinOps — the three disciplines whose absence explains most production failures.

Technology Considerations

The architecture above is deliberately platform-neutral; it can be realized on any major ecosystem. The right choice depends on your existing estate, regulatory posture, and skills — not on vendor preference.

  • Microsoft / Azure: Azure AI Foundry and Azure OpenAI for models, Microsoft Fabric and OneLake for the data foundation, Purview for governance, and the Power Platform / Copilot ecosystem for embedded experiences. A strong fit where Microsoft 365 and Azure are already entrenched.
  • Databricks: Mosaic AI for model serving and evaluation, Unity Catalog for unified governance and lineage, MLflow for lifecycle management, and Delta Lake for the lakehouse. A strong fit for data- and ML-heavy organizations wanting one governed platform.
  • AWS: Amazon Bedrock for managed foundation models, SageMaker for custom ML, and a Glue/Redshift/Athena/S3 data foundation with Lake Formation governance.
  • SAP-centric estates: SAP Business Data Cloud, Datasphere, and Joule for organizations whose system of record is SAP, keeping AI close to governed business data and processes.
  • Open source and on-premise / sovereign: Orchestration with frameworks such as LangChain or LlamaIndex, open vector stores, open guardrail libraries, and open models served on-premise — relevant where data sovereignty is non-negotiable, as it often is for government, defense, and parts of the BUMN landscape.

For the Indonesian enterprise specifically: weigh data residency and sovereignty (Personal Data Protection Law, sector rules such as OJK requirements in financial services), the availability of in-country cloud regions, language and domain performance in Bahasa Indonesia, and the realistic depth of internal AI engineering skills. These constraints should shape the build-vs-buy-vs-partner decision deliberately, not by default.

Business Benefits

When GenAI is productionized through this kind of system rather than left in pilot purgatory, the benefits become measurable and defensible:

  • Cost reduction in back-office and operational processes — fewer manual touches, reduced outsourcing, compressed cycle times.
  • Revenue protection and growth through faster service, better cross-sell relevance, and shorter sales and underwriting cycles.
  • Productivity uplift where it can be measured — documented case-by-case rather than assumed across the board.
  • Risk reduction through consistent, auditable, policy-compliant outputs in regulated workflows.
  • Faster time-to-value for the next use case, because the foundation, guardrails, and operating model are reusable. The second and third use cases are dramatically cheaper than the first — this is where the compounding return lives.

A disciplined note on numbers: published productivity figures (often cited in the 10–50% range for specific tasks) are illustrative of potential, not guarantees. Credible programs commit to measuring their own baseline and uplift per use case rather than importing someone else's benchmark into a business case.

KPIs Impacted

Productionizing GenAI should move a defined set of metrics. Make these explicit before you build:

Business KPIs - Cost-to-serve / cost per transaction - Cycle time (e.g., case resolution, document turnaround, time-to-decision) - Revenue per employee or per channel - Customer satisfaction (CSAT/NPS) and first-contact resolution - Risk and compliance exceptions

AI Operational KPIs - Pilot-to-production conversion rate (the headline measure of organizational AI maturity) - Answer accuracy / groundedness (from the evaluation harness) - Hallucination and escalation rates - Adoption and active usage within the target workflow - Unit economics: cost per request / per resolved task

If you cannot name the KPI a use case is meant to move, it is not yet ready to leave the whiteboard.

Implementation Roadmap

A practical, phased path — informed by the finding that top performers moved from pilot to full implementation in around 90 days by staying narrow and disciplined.

Phase 0 — Foundation & Prioritization (0–3 months). Run an AI-readiness assessment (data, platform, governance, skills). Score and select 1–2 high-value, feasible use cases. Stand up the minimum data, retrieval, and governance foundation. Define success metrics and baselines.

Phase 1 — First Production Use Case (3–6 months). Build one use case all the way to production, not to demo. Implement the evaluation harness, observability, human-in-the-loop, and FinOps controls. Prove value against the pre-defined KPI. Capture reusable components.

Phase 2 — Scale & Reuse (6–12 months). Add use cases on the now-proven foundation, exploiting reusable retrieval, guardrails, and patterns. Formalize the AI platform team and product-squad operating model. Establish the governance and model-risk processes as standard practice.

Phase 3 — Industrialize (12+ months). Make AI a managed capability: a use-case intake and prioritization process, a portfolio view with FinOps, continuous evaluation, and a center of enablement that lets the business self-serve within guardrails.

The discipline that matters most is in Phase 1: resist the temptation to start a dozen pilots. Depth beats breadth.

Common Pitfalls

  • Boiling the ocean — a dozen shallow pilots that each lack the depth to succeed. Fragmentation guarantees failure.
  • Solving for the demo — optimizing the happy path and ignoring the long tail of real inputs.
  • Skipping evaluation — shipping without a way to know whether quality is improving or degrading.
  • Under-investing in data — expecting the model to compensate for siloed, ungoverned, poor-quality data. It cannot.
  • Bolting on governance late — turning compliance into a launch blocker instead of a design input.
  • Ignoring unit economics — a strong demo that becomes unaffordable at production volume.
  • Treating it as a project — shipping once and walking away, then watching the system drift.
  • Building what you should buy — burning scarce engineering capacity on commodity capabilities instead of proprietary advantage.

Conclusion: Strategic Insights

The "95% of pilots fail" headline is not a verdict on generative AI. It is a verdict on how enterprises are approaching it. The 5% that succeed are not the ones with the best model — they are the ones with the best system: disciplined use-case selection, an AI-ready data foundation, rigorous evaluation, deep workflow integration, governance by design, clear ownership, and controlled economics.

For executive leaders, three insights carry the most weight. First, the constraint is the operating model, not the technology — which means it is squarely within your control. Second, value compounds from the foundation — the first production use case is an investment in every subsequent one, so choose it for depth, not show. Third, the window is narrowing — as competitors and regulators move, "another pilot next quarter" is increasingly an expensive form of standing still.

The organizations that will report AI-driven P&L gains are the ones that stop experimenting and start engineering — treating generative AI as a governed, measured, production capability rather than a sequence of impressive demos.

Key Takeaways

  1. ~95% of enterprise GenAI pilots fail to reach measurable production value (MIT NANDA, 2025) — and the cause is the operating model and data foundation, not the model.
  2. Spend where the ROI is: back-office and operational use cases consistently out-return the sales/marketing pilots that absorb most budgets.
  3. AI-ready data is the dominant success factor. Retrieval quality caps every use case built on top of it.
  4. Evaluation and observability are non-negotiable — you cannot scale trust you cannot measure.
  5. Partner-led approaches succeed ~2x more often than pure internal builds. Focus internal effort on proprietary advantage.
  6. Treat GenAI as a product with an owner and a roadmap, supported by a central enablement team — not as a one-off project.
  7. Go deep, not wide: take one high-value use case all the way to production before scaling. Top performers did it in ~90 days.

Frequently Asked Questions

Why do most enterprise generative AI pilots fail to reach production? Because the barrier is rarely the model. Research from MIT's NANDA initiative found that around 95% of enterprise GenAI pilots show no measurable P&L impact, and the dominant causes are operating-model and engineering issues: poor workflow integration, no evaluation or observability, and data that was never AI-ready.

What does it take to move a GenAI use case from pilot to production? A deliberate "AI production system": value-based use-case selection, an AI-ready governed data and retrieval foundation, LLMOps with an evaluation harness and observability, deep workflow integration with human-in-the-loop, governance by design, clear product ownership, and FinOps to control token and inference costs.

Where do enterprises get the best ROI from generative AI? Consistently in back-office and operational automation — reducing manual rework, outsourcing, and cycle times — rather than the sales-and-marketing use cases that absorb most budgets. The most defensible business cases are operational.

What is an "AI-ready data foundation" and why does it matter? It is a governed lakehouse (medallion structure, catalog, glossary, lineage, quality gates) plus a permission-aware retrieval layer that supplies models with trustworthy context. It is the single highest-leverage investment because retrieval quality caps the performance of every use case built on top.

Should we build our enterprise AI in-house or buy/partner? The evidence favors partnering for commodity capability: MIT found vendor-led and partnership approaches succeed roughly twice as often as pure internal builds. Reserve scarce internal engineering for your proprietary advantage, not for recreating commodity infrastructure.

How long does it take to get a GenAI use case into production? Top performers reached full implementation in around 90 days by staying narrow and disciplined — one high-value use case taken all the way through, rather than a dozen shallow pilots.

Ready to move from pilot to production?

Is your generative AI stuck in pilot purgatory? NDS helps enterprises move from experiments to governed, measurable production — starting with a focused Enterprise AI Readiness Assessment that pinpoints where your highest-value, most-feasible use case lives and what foundation it needs.

Talk to our team about a practical path from pilot to production →  |  Explore our AI Solutions.

Further reading

Share
Chat with us