Skip to content
← Back to insights

framework pillar · mode multi

AI projects fail at the rung, not the model

Most AU AI pilots stall because Agency is sold to a buyer who's only ready for Automation. Match the rung to the readiness — here's the four gates that separate a demo from a system that ships.

Willie Prosek··9 min read

Most AI pilots don't fail because the model is wrong. They fail because the rung is wrong — Agency gets pitched to a buyer who's only ready for Automation, and the project quietly dies in the gap. This piece names the four gates that separate a demo from a system that ships, the rung-readiness pattern we see across Australian mid-market, and a five-question checklist you can walk into Monday with.

The graveyard nobody shows investors

Every board deck in the last eighteen months has had an AI initiative slide. What you don't see on those slides is the count of pilots that quietly died in Q3 and got rebranded as "Phase 1 learning" in Q4.

Industry surveys put the pilot-to-production failure rate somewhere between 70 and 90 percent depending on who's counting. When we audited twelve Australian mid-market organisations that came to us for a second opinion on a stalled pilot, the pattern was consistent:

  • Eleven of twelve had technically working prototypes
  • Nine of twelve had never named which team owned the system in production
  • Seven of twelve had no observability — nobody could answer "how often is this wrong, and how would we find out?"
  • Five of twelve were burning API costs that would have grown 6–20x at production volume, and nobody had modelled that
  • Zero of twelve had a documented audit trail their legal team would sign off on

The pilots weren't failures of technology. They were failures of transition — and the transition is where the real engineering happens.

Why "the rung" matters

We use Anthropic's AI Fluency framework on every engagement. Three modes (Automation, Augmentation, Agency), four skills (Delegation, Description, Discernment, Diligence). Free, open-licensed, attributed to Joseph Feller and Rick Dakan. (CC BY-NC-SA 4.0.) We use it; we don't rebadge it.

The rung-mismatch pattern is straightforward.

A buyer who's never run Claude in a production workflow is not ready for Agency. Agency means an agent acting on bounded authority with human oversight by exception. To make that safe you need the diligence layer — audit trail, governance, escalation rules — and the skill to evaluate AI outputs at production volume. Most organisations buying their first AI work haven't built either yet.

Sold Agency, they get a demo. They sign. The build ships. Then production reveals what was missing — and the project stalls because the foundation under it isn't there.

The fix isn't a different model. It's the right rung. Most first engagements should sit firmly in Automation or Augmentation. Agency comes after the buyer has shipped one of the lower rungs and has lived data on what their workflow actually looks like in production.

Four gates between a pilot and a production system

Gate 1 — Governance (Diligence)

The first question we ask any organisation running an AI pilot: "If a customer asks you tomorrow why the AI made the decision it made about their account, can you answer?"

Most can't. Not because the answer isn't knowable, but because nobody instrumented the system to retain it. The prompt, the retrieved context, the model version, the tool calls, the output, the downstream action — six artefacts that need to live in a tamper-evident log for three to seven years depending on industry.

Australian organisations are about to collide with this hard. The Privacy Act 2026 amendments (substantive provisions commence 10 December 2026) introduce automated decision-making transparency obligations that require this audit trail for any AI-influenced decision that significantly affects an individual. APRA CPS 234 already imposes similar expectations on regulated financial entities. The Insurance Council of Australia is already scrutinising claims automation.

If your pilot doesn't produce a complete, queryable record of every AI decision, you don't have a pilot — you have a compliance liability wearing a product mask.

Production checklist:

  • Every model inference logged with prompt, context, output, model version, timestamp, user
  • Logs immutable (append-only, WORM-storage or equivalent)
  • Retention matched to legal obligation (3 years minimum, 7 for financial services)
  • Per-user and per-decision query UI so a human can reconstruct any AI-influenced outcome
  • Incident response plan for when the AI is wrong and it matters

Gate 2 — Cost (Discernment)

This is the gate that kills the most pilots quietly. A pilot runs for three months against a handful of test cases at a few dollars a day. The cost model extrapolated to production volume was never stress-tested. Then the pilot moves to production and the CFO gets an invoice.

Two repeating failure modes.

The token blowout. The pilot used a frontier model for everything. It works brilliantly. At production volume it costs 8–12x what the business case assumed. Nobody modelled the economics of routing simple queries to Claude Haiku at a fraction of the cost. The deployment gets pulled "for cost review" and never comes back.

The context creep. Every additional document, every retrieval, every agent step adds tokens. Pilots measure tokens-per-successful-task. Production measures tokens-per-month across a whole organisation. These are different animals. A single badly-pruned RAG index can 5x a deployment's cost overnight.

The production-ready answer isn't "use the cheapest model." It's "route intelligently and instrument cost the same way you instrument latency." Our internal rule: API cost should be 5–8% of the revenue the system touches. If it's above 15%, the architecture is wrong, not the model.

Production checklist:

  • Cost dashboard at the per-user and per-feature level, not aggregate only
  • Model routing: Haiku for simple, Sonnet for mid, Opus for complex and governance-critical
  • Prompt caching live (the Compaction API or equivalent can cut costs 70–90% for long-running agent sessions)
  • Monthly budget alerts wired to the product team, not just finance
  • A clear answer to "what is the unit economics of one agent task?"

Gate 3 — Integration (Description)

The dirty secret of organisational AI: getting the model to respond well is maybe 20% of the work. The other 80% is connecting the model to the systems where work actually lives — Microsoft 365, Salesforce, Xero, bespoke case-management systems, practice-management platforms, claims software, legacy SOAP endpoints nobody wants to touch.

MCP (Model Context Protocol) has changed the surface here. A pilot built in early 2024 had to ship bespoke tool integrations for every system. The same system in 2026 plugs into an MCP server that handles Microsoft Graph or Salesforce as a first-class capability, with proper scopes and auth.

But MCP doesn't make the production work go away. Production is the moment the AI has to handle:

  • Partial system outages (the CRM is down — what does the agent do?)
  • Rate limits (we hit 429s — graceful degradation to which path?)
  • Auth expiry (OAuth tokens die — how is refresh handled, and who gets notified when it fails silently?)
  • Inconsistent data shapes (half your contracts are PDFs, half are scanned images, half are OCR garbage — can your pipeline handle all three?)
  • Human-in-the-loop checkpoints (when should the agent pause and ask, not proceed?)

Production checklist:

  • MCP-based integrations with explicit scope documents (least-privilege)
  • Retry logic and graceful degradation documented per integration
  • Circuit breakers so one broken system doesn't cascade
  • Monitoring that pages a human when auth dies, not when users complain
  • Every external tool call logged with latency and outcome

Gate 4 — Team readiness (Delegation)

The last gate is the one nobody wants to talk about: most organisations that run AI pilots don't have the team to run the result.

A pilot can be run by one motivated engineer with Claude and some afternoons. A production AI system needs a different shape of team:

  • Someone owning the prompt library and treating it as versioned, testable infrastructure
  • Someone owning evaluation and regression testing (does the new model version still pass our tests?)
  • Someone owning the governance layer and responding to audit queries
  • Someone on-call when the system misbehaves
  • A product owner who can say no to features and yes to retirements

None of these are full-time jobs in a mid-market organisation. Collectively they're 0.4–0.8 FTE of ongoing responsibility the pilot budget didn't contemplate. When nobody owns the production system, it quietly degrades. Prompt drift, model deprecation, integration rot, cost creep — all manageable with attention, all lethal without it.

Production checklist:

  • Named owner for prompts, evals, governance, on-call, and product decisions (one person in a small team is fine, but named)
  • Prompt library in version control with test cases
  • Weekly evaluation run against a fixed regression set
  • Runbook for the top five failure modes
  • A clear retirement path so dead features don't accumulate

The Australian angle

Three things about deploying production AI in Australia specifically that overseas playbooks don't cover.

Data sovereignty is a first-class concern, not a footnote. AU buyers — financial services, health, legal, government-adjacent — are increasingly asking for Australian-region inference. Claude is available via AWS Bedrock and Google Vertex in Sydney (ap-southeast-2), but most pilots default to a US endpoint because that's what the docs show first. If your pilot's data is traversing a US region by default, that's a Legal conversation before production, not after.

Privacy Act 2026 is not optional. The substantive provisions commence 10 December 2026. Organisations that process personal information in AI systems are going to need automated decision-making transparency, the right of review, and (for significant decisions) the ability to explain. Start instrumenting now. Retrofitting is painful.

The APRA regime reaches further than people think. CPS 234 and the broader prudential standards apply to any regulated entity and cascade to their service providers by contract. If you're selling to or building for a bank, super fund, or insurer, their CPS 234 obligations become your CPS 234 obligations. Vendors learn this two months into a procurement cycle.

What to do Monday morning

Five questions to bring to your next pilot review.

  1. If a customer asks why the AI decided X, can we answer? If not, instrument the audit trail before anything else.
  2. What does this cost at 10x current volume? If you don't know within 20%, build the cost model this week.
  3. Who owns this system when the engineer who built it goes on leave? Name the owner on paper.
  4. What integration assumptions would break this in production? Write them down and test at least three.
  5. What's the retirement plan? If you can't describe what "we turn this off" looks like, you don't understand the system yet.

If those five feel overwhelming, they are. They're also the gap between a working demo and a working system. The organisations that close that gap ship. The organisations that don't, add another slide to next year's board deck.

How we work

Adaptation AI builds tailored Claude-native agents for Australian organisations with document-heavy workflows. Every engagement gets scoped against the A/A/A × 4D framework. You pick how you pay: PAYG, upfront fixed scope, monthly hosted, or buy outright. For organisations burnt by past pilots we also offer free scope and free build with payment only on acceptance.

Eight internal teams. Forty agents in production. One Adelaide boutique that runs its own business on the fleet it sells.

If you have a workflow that's been quietly failing the four gates above, book a free scope-out at adaptation.ai/book.

Further reading

Want this applied to your workflow?

Free scope, free build, you only pay if it works. 30-min call books straight to a real engineer.

Book a 30-min Scope-out call →

Written by Willie Prosek · founder, Adaptation AI · an Australian Claude-native consultancy building enterprise agent systems on Anthropic's AI Fluency framework.