6 weeks
Single-tool, eval, kill-switch, monitoring
- Eval set with passing threshold
- Kill-switch on cost
- 30-day post-launch guarantee
Six weeks for single-tool agents. Ten for multi-tool. Fourteen for compliance-scoped. Below six weeks, you ship a demo. Above fourteen, scope has expanded into platform territory.
6 weeks
Single-tool, eval, kill-switch, monitoring
Phase breakdown
Implementation is rarely the longest phase. Eval set design and observability consistently are.
| Phase | Single-tool | Multi-tool | Compliance-heavy |
|---|---|---|---|
| Discovery + scope | 3 to 5 days | 5 to 7 days | 7 to 10 days |
| Design + prompt engineering | 1 week | 2 weeks | 2 to 3 weeks |
| Implementation | 2 to 3 weeks | 3 to 4 weeks | 4 to 5 weeks |
| Eval set + tuning | 1 week | 2 weeks | 2 to 3 weeks |
| Observability | 3 to 5 days | 1 week | 1 to 2 weeks |
| Compliance review | — | — | 1 to 2 weeks |
| 30-day monitoring | Parallel | Parallel | Parallel |
| Total | 6 weeks | 10 weeks | 14 weeks |
Production-ready
If any of these are missing, you have a demo, not a product. The 6-week floor is a function of all five.
Without a versioned eval set, every prompt change is a guess and every quality regression is invisible. RAGAS or a custom harness, target ≥ 80% pass rate.
Per-feature ceiling, automatic shutoff, alert at 50/80/100%. Five lines of middleware. Without it, one bad day costs $5K to $50K.
Every call logged with inputs, outputs, latency, cost, model version. Searchable. You cannot debug what you cannot see.
Provider outage, rate limit, timeout. The agent should degrade gracefully, not crash. Vendor parity is the cleanest fallback.
What does the agent do, what are its tools, what does it cost, how do we roll back. One page. Owned by the team running it on day one.
Worked example
Single-tool: Help Scout API access · 6 weeks
Week 1: Discovery, scope, kill-switch design Week 2: Prompt design, eval set draft (30 cases) Week 3: Implementation, Help Scout integration Week 4: Eval run, prompt iteration, observability Week 5: User testing with 5 customers, fixes Week 6: Production deployment, 30-day monitoring
Compliance-heavy multi-tool · 14 weeks
Weeks 1-2: Discovery + compliance scoping (HIPAA, EHR access) Weeks 3-4: Prompt + tool design with privacy review Weeks 5-7: Implementation, BAA signed with vendor Weeks 8-9: Eval set with clinician validation Weeks 10-11: Observability with audit log Weeks 12-13: Compliance final review, sign-off Week 14: Deployment, 30-day monitoring
Timeline killers
Each of these is small at the start. Stacked, they double the calendar.
“It should be smart” is not a target. Define a measurable threshold (eval pass rate, latency budget, cost per query) before week one.
Prompt changes shipped without re-running the eval set look fine until users complain. Always re-run the suite on every prompt change.
Each new tool adds 1 to 2 weeks: error handling, eval cases, sandbox tests. “Just one more tool” rarely costs less than a sprint.
Caught at the first cost spike, usually 4 to 6 weeks in. Adds a week of unplanned work and one shaky weekend.
BAA negotiation, audit log requirements, validation cycles. If compliance starts week 10 of a 14-week plan, you are now on a 18-week plan.
Decision framework
| If you need | Realistic timeline |
|---|---|
| Demo for investors | 2 to 3 weeks (not production) |
| Single-tool production agent | 6 weeks |
| Multi-tool production agent | 10 weeks |
| Compliance-scoped agent | 14 weeks |
| Agent platform | 6+ months |
FAQ
Six weeks for a single-tool agent on existing infrastructure. Ten weeks for a multi-tool agent with custom evaluation. Fourteen weeks for a compliance-scoped agent (HIPAA, SOC 2, financial). Below 6 weeks, you ship a demo, not a production system.
Yes, but production-ready becomes demo-ready. Eval, kill-switch, observability, and 30-day monitoring are what take a demo to production. Cutting these to ship in 4 weeks usually costs 8 weeks of remediation in months 3 to 6.
BAA or DPA negotiation with the model provider, audit log requirements, validation cycles with domain experts (clinicians, lawyers, security teams), and final sign-off. Each of these adds 1 to 2 weeks. They do not parallelise as much as people hope.
Six months or more. Agent platform is a different problem from agent: you are building the infrastructure to ship many agents, not one. Most teams start with one agent and slip into platform scope by accident.
Yes. Eval data takes longer than implementation in most cases: ground truth labelling, edge cases, regression cases, and re-runs. Skip it and you do not know if you regress. Plan 25 to 30% of total time for eval design and tuning.
Pre-built kill-switch, observability scaffold, eval harness, and provider-parity layer. We bring the platform; you bring the use case. Without this scaffolding, the same scope takes 10 to 12 weeks at most teams.
Free, 48-hour SLA, no sales call
The free Product Audit returns a scoped engagement with a week-by-week plan, kill-switch design, eval set outline, and one “don't build this” recommendation.