1) Shipping “chat” instead of a system of action
- Symptom: Demos that talk, not do; customers can’t complete work end‑to‑end.
- Fix: Bind outputs to typed tool‑calls (JSON‑schema actions) with simulation, approvals, idempotency, and rollback. Measure successful actions and reversals, not messages.
2) Unpermissioned or stale retrieval (RAG)
- Symptom: Hallucinated or outdated answers; cross‑tenant data risks.
- Fix: Apply ACLs pre‑embedding and at query time; store provenance, timestamps, and jurisdictions; refuse on low/conflicting evidence; show citations in‑product.
3) Free‑text actions to production systems
- Symptom: Silent data corruption, irrecoverable mistakes.
- Fix: Never allow free‑text. Enforce schema validation, policy‑as‑code gates (eligibility, limits, egress/residency), simulations, and approvals; keep instant undo.
4) “Big model everywhere” and cost blowups
- Symptom: Margins collapse as usage grows; latency spikes.
- Fix: Small‑first routing for classify/extract/rank; escalate only when needed. Cache embeddings/snippets/results; cap variants; separate interactive vs batch; enforce budgets and alerts.
5) No golden evals or CI gates
- Symptom: Regressions in grounding, JSON validity, or safety after each release.
- Fix: Add golden evals for grounding/citations, JSON/action validity, refusal/safety, domain tasks, and fairness slices. Block releases on regressions; keep contract tests for connectors.
6) Ignoring reversal and appeal rates
- Symptom: Trust erosion; hidden ops costs.
- Fix: Track reversal/rollback and appeal/complaint rates as first‑class SLOs; require approvals for consequential steps; add counterfactual explanations and appeals flows.
7) Weak privacy and residency posture
- Symptom: Enterprise deals stall; legal exposure.
- Fix: Data minimization/redaction, tenant‑scoped encrypted caches/embeddings with TTLs, “no training on customer data” by default, region pinning or private/VPC inference, DSR automation.
8) Underestimating integration fragility
- Symptom: Breaks when partner APIs change; flaky automations.
- Fix: Contract tests, canary probes, schema/semantic drift detectors, and self‑healing PRs for mappings. Maintain a tool/connector registry with versions and scopes.
9) Over‑automation too early
- Symptom: Irreversible mistakes; operator distrust.
- Fix: Progressive autonomy: suggest → one‑click with preview/undo → unattended only for low‑risk, reversible steps with sustained quality history.
10) No decision logs or auditability
- Symptom: Painful incident investigations; failed audits.
- Fix: Immutable decision logs linking input → evidence → policy gates → action → outcome, with signer identities, timestamps, hashes; exportable evidence packs.
11) Chasing breadth over a wedge
- Symptom: Shallow features across many surfaces; no deep value.
- Fix: Pick 1–2 high‑volume, reversible workflows with clear ROI (e.g., L1 support actions, AP exceptions). Nail them, then expand adjacently.
12) Pricing that doesn’t map to value
- Symptom: Misaligned incentives; surprise bills.
- Fix: Platform + workflow modules; seats for human users; pooled action quotas with hard caps; optional outcome‑linked components where attribution is clean.
13) Neglecting fairness and user harm
- Symptom: Unequal error/exposure rates; reputational risk.
- Fix: Define protected attributes; monitor subgroup parity (TPR/FPR, exposure, uplift). Provide appeals and counterfactuals; cap intervention frequency.
14) Logging raw prompts/outputs
- Symptom: Data leaks through observability tools.
- Fix: Structured logs with field‑level redaction; short retention; break‑glass access with audit; mask PII and secrets at ingest.
15) No SLOs or degrade modes
- Symptom: Spiky UX, fire‑drills during incidents.
- Fix: Publish p95/p99 targets per surface; circuit breakers; degrade to suggest‑only under stress; maintain kill switches for models/tools.
16) Vendor lock‑in without a plan
- Symptom: Pricing pressure and roadmap risk.
- Fix: Model gateway abstraction; standardized action schemas; portable embeddings or re‑index plan; champion–challenger models; export APIs.
17) Forgetting FinOps and unit economics
- Symptom: Usage grows but margins don’t.
- Fix: Track GPU‑seconds and partner API fees per 1k decisions; router mix and cache hit; north‑star: cost per successful action trending down by workflow and tenant.
18) Security as an afterthought
- Symptom: Prompt‑injection, egress leaks, tool abuse.
- Fix: Instruction firewalls, allowlists, output filters; tool least‑privilege with JIT elevation; idempotency and replay protection; anomaly alerts (tokens/variants/egress).
19) Weak buyer evidence
- Symptom: Long sales cycles; “nice demo” feedback.
- Fix: Weekly value recaps: actions completed, reversals avoided, time saved, SLO adherence, spend vs budget; include decision‑log snippets.
20) One‑time ethics/compliance reviews
- Symptom: Drift back to risky behavior.
- Fix: Make grounding/JSON/safety/fairness part of CI gates and weekly ops reviews; keep DPIAs/model cards current; run drills and red‑team tests quarterly.
Quick checklists (copy‑ready)
- Trust & safety
- Citations with timestamps/jurisdiction or refusal
- Typed actions with simulation, approvals, rollback
- Policy‑as‑code: eligibility, limits, egress/residency
- Decision logs and audit exports
- Reliability & cost
- Small‑first routing, caches, variant caps
- p95/p99 SLOs, degrade modes, kill switches
- Budgets/quotas; CPSA, router mix, cache hit dashboards
- Privacy & security
- Minimization/redaction; tenant‑scoped encrypted caches/embeddings; TTLs
- Residency/VPC, no‑training defaults; DSR automation
- Injection/egress guards; least‑privilege tools; idempotency
- GTM & focus
- Narrow wedge with measurable outcome
- Weekly “what changed” reports; live ROI metrics
- Pricing tied to actions with hard caps; enterprise add‑ons
Bottom line: Avoid the traps by engineering for outcomes and trust from day one. Ground every decision in permissioned evidence, execute only schema‑validated actions under policy with rollback, run to explicit SLOs and budgets, and tell the story in customer outcomes and cost per successful action—not tokens or hype.