GenAI Pilots vs Production: What Actually Changes
The Problem: A Pilot That Will Not Scale
The pilot was a success. A small team built a workflow on top of an LLM, ran it against a curated set of inputs, showed the results to stakeholders, and got the green light to roll it out broadly.
Six weeks into the rollout, things look different. The cost is three times higher than projected. The accuracy on real-world inputs is noticeably worse than on the pilot dataset. Security has questions that nobody anticipated. A model provider update changed behavior overnight. The team that built the pilot is overwhelmed handling production tickets, and no new features are shipping.
This is the gap between “the model works” and “the system works.” Pilots are about the model. Production is about the system around the model — and the system is almost always more work than the model.
What Actually Changes
Five things shift between a pilot and a production deployment. Most teams underestimate at least three of them.
1. Inputs Stop Being Curated
In the pilot, the team chose the inputs. They came from a known source, were cleaned up, and represented the workflow as it ideally exists. In production, inputs come from real users and real upstream systems. They are messy, ambiguous, sometimes adversarial, and frequently outside the distribution the pilot tested.
A model that handles 95 percent of curated inputs may handle 70 percent of real ones. The fix is not a better model — it is input validation, fallback paths, and an evaluation set that reflects production distribution rather than pilot distribution.
2. Evaluation Has to Become Continuous
Pilot evaluation looks like a spreadsheet: the team rates a few hundred outputs, computes accuracy, and reports the number. Production evaluation has to be ongoing, automated where possible, and tied to the specific failure modes that matter for the use case.
That usually means a labeled evaluation set that gets re-run on every model or prompt change, automated checks for known failure patterns, and a sampling process where humans review a small fraction of live outputs and feed corrections back into the system.
3. Cost Becomes a First-Class Concern
In a pilot, you process a thousand inputs and the cost is a rounding error. In production, you process a million inputs and the cost shows up on someone’s budget review. Token usage compounds across retries, multi-step chains, and retrieval contexts.
Production deployments need cost monitoring at the request level, caching for repeated queries, model tiering (cheap model for easy inputs, expensive model for hard ones), and a clear understanding of what each business outcome costs in inference dollars.
The math that justified the pilot at small scale rarely survives multiplication by production volume. Run the numbers at full scale before you commit to a launch date.
4. Change Control Catches Up
In a pilot, the team owns the model version and the prompts. In production, both are software that need the same discipline as any other production code: version control, automated tests, staged rollouts, a rollback plan, and a pinned model version policy. Without an evaluation harness running on every change, a quality regression looks identical to a quiet day — until a downstream consumer notices.
5. Security and Compliance Show Up
In a pilot, the team is often working with synthetic data or a small slice of real data, and the security review is light. In production, you are sending real customer data, real internal documents, or real proprietary code to a model provider, and every relevant control needs to be in place: data handling agreements, PII redaction, audit logging, and access control on the system itself.
Prompt injection deserves its own line item. Any text that flows into the model — user messages, retrieved documents, tool responses, even fields on records the model is summarizing — can carry instructions that try to override the system prompt, exfiltrate context, or trigger unintended actions. Defenses range from input sanitization and instruction hierarchy in the system prompt to running outputs through a separate classifier before they reach downstream actions. The exposure scales with what the model is allowed to do: a model that drafts text for human review is a smaller blast radius than a model with tool access to real systems.
Plan for this work as part of the production effort, not as a blocker discovered late. Security review timelines often dwarf the engineering timeline for the deployment itself.
A Pre-Production Checklist
Before you call a GenAI system production-ready, you want concrete answers to:
- What does an evaluation set that reflects real input distribution look like, and does the current system meet quality bars on it?
- What is the cost per business outcome at projected production volume, and what is the cost ceiling that triggers a redesign?
- How are model versions pinned, and what is the process when the provider ships an update?
- What fails when the model is down or slow, and is that failure mode acceptable to the business?
- Has security reviewed the data flow, and are PII handling and audit logging in place?
If you cannot answer those five questions, the system is still a pilot, no matter how many users it has.
Moving Forward
Most GenAI work that fails in enterprises does not fail because the model is wrong. It fails because the system around the model was scoped as if it were a demo.
If you have a pilot that is working and a production rollout that is starting to look harder than expected, our AI and GenAI practice focuses on exactly that transition. Reach out if it would help to talk through what is between you and a launch.