For the last few months I have been building something I kept calling, almost dismissively, “the harness.” Not a model. Not a fine-tune. The boring scaffolding around a model that decides what it sees, what it can touch, when it stops, and what it remembers for next time.
It turns out the harness is the whole game.
You hand this system a single line. “Build a BMI calculator with a clean UI.” Or “find 10 people looking to invest in AI startups in the Mumbai area.” Or “write a Twitter thread on why Supra is the fastest L1.” Roughly fourteen minutes and four cents later you have a deployed URL, or a researched and sourced list, or a campaign written and ready to send. No human in the loop between the prompt and the result.
This post is about how that actually works, the specific engineering decisions that made it reliable, and the failures that taught me more than any of the successes.
Table of contents
Open Table of contents
The bet: don’t wait for a smarter model
The premise everyone in AI is quietly operating on is that the next model release fixes everything. I think that is a trap. The models we have today are already capable of writing a working Next.js app, researching a topic with real sources, and drafting a campaign. What they are not good at is doing all of that reliably, end to end, without a human catching them when they wander.
So I stopped trying to make the model smarter and started trying to make it dependable. The whole system is an answer to one question:
How do you make a current-generation model reliable enough to ship a real deliverable with nobody watching?
The answer was never “a better prompt.” It was a stack of unglamorous engineering: specialized agents, staged hand-offs, dynamic budgets, checkpoints, hard guardrails, and a memory file that turns yesterday’s failure into today’s rule.
The shape of the system
A task flows through a pipeline of specialized agents. Each one has its own persona, its own model, its own tool permissions, and its own iteration budget.
task ──▶ Planner ──▶ Strategist ──▶ Developer ──▶ Marketer ──▶ QA ──▶ deployed result
(Haiku) (research) (build) (campaign) (verify)
The planner does not build. The developer does not market. The marketer never gets a shell. Each agent does one thing, reads the artifacts the previous stage produced, and hands its own artifacts forward. This is the single most important design decision in the whole thing, and I will come back to why.
The planner: cheap classification first
The very first agent is a planner, and it runs on the cheapest model available (Haiku). Its only job is to read the task and emit structured JSON: what type of task is this, which agents should run, in what order, and a one-sentence instruction for each.
It is pure classification, so paying for a frontier model here would be waste. The routing it produces is deliberately not one-size-fits-all:
web_app | startup → strategist → developer → marketer → qa
research | data_analysis → strategist → developer → qa
marketing_campaign → strategist → marketer → qa
document → developer → qa (skip the strategist entirely)
A simple document does not need a product strategist. A marketing campaign does not need a build step. Routing the pipeline per task type means a simple task never pays for stages it does not need.
The agents themselves
Each downstream agent is a persona with a sharply scoped job:
- Strategist. A “chief product strategist” that researches the fundamentals (it actually fetches docs and verifies API signatures), then writes a
PRD.mdthat becomes the source of truth, plus aRESEARCH.mdfull of copy-pasteable exact function signatures and endpoint URLs. Crucially, anything it could not verify gets explicitly tagged[UNVERIFIED]. - Developer. A “principal engineer” that reads the PRD and writes code in batches, building after each batch.
- Marketer. Writes the campaigns, job posts, email sequences, and social content, and can hit real APIs (SendGrid, Twitter, lead-search providers).
- QA. The only honest agent in the building. It runs the actual build, hits the live URL, reads the real deployed content, and writes either a
RESULT.md(success) or aCORRECTIONS.md(failure). ThatCORRECTIONS.mdis where the magic happens later.
The engine: a research-first agentic loop
Under every agent is the same loop. Iteration one is always orientation: read BRIEF.md, read PRD.md, list the directory, scan which integrations are actually available. Only then does it start acting.
From there it is a standard tool-use loop. Call the model, parse the tool-use blocks, dispatch each tool, feed results back, tally the cost, check whether the agent declared itself done. The tools include the obvious ones (read_file, write_file, bash, web_search, web_fetch, http_request, verify_url) and some less obvious ones (post_tweet with OAuth signing, send_email, search_leads, even make_phone_call via Twilio).
But the loop being standard is not the interesting part. The interesting parts are the constraints wrapped around it.
Six decisions that made it actually work
1. Tool allowlists per agent
The marketer cannot run bash. The developer can. This sounds like a security measure, and it is, but the bigger payoff was focus. When I gave every agent every tool, my developer agent would waste iterations wondering whether it should go do web research (no, that was the strategist’s job, already finished). Take the tool away and the question disappears. The agent knows: research is done, now build.
Unrestricted access looks like freedom. In practice it is paralysis. Constraints do not slow agents down. They tell them what not to deliberate about.
2. Iteration budgets are computed, not fixed
I never hardcoded “every agent gets 30 turns.” The budget is calculated per run from the agent type, the task type, the length and keywords of the request, and the execution mode:
iters = base[agent] * weight[agent][task_type]
if "comprehensive" in task.lower(): iters *= 1.4
if "simple" in task.lower(): iters *= 0.8
if len(task.split()) > 180: iters *= 1.30
iters *= mode_factor # rescue mode = 0.5, get unstuck faster
A simple 50-word task gets about 10 iterations. A comprehensive enterprise build gets 40 or more. It is about twenty lines of arithmetic, and it directly controls both cost and quality without me ever tuning a task by hand. A developer agent might get 26 to 48 iterations; a QA agent 5 to 14. The budget is a lever, not a cap.
3. The developer builds in layers, not in one shot
My first version of the developer wrote all twenty files and then ran the build. A TypeScript error introduced in file #5 did not surface until turn 35, by which point the fix touched half the codebase.
The fix was to force batching:
Turn A: package.json, tsconfig, tailwind/postcss/next config → npm install
Turn B: lib/*.ts (db, types, utils) → npm run build → git checkpoint
Turn C: app/api/* routes → npm run build → git checkpoint
Turn D: components/*.tsx → npm run build
Turn E: app/layout, globals.css, app/page.tsx (mandatory) → final build → deploy
Now errors surface inside one or two iterations of being introduced, and every green build is a git checkpoint the agent can roll back to. A deployable app always exists, even if a feature is incomplete. The rule in the developer’s prompt is blunt: “Never write 10 files then try to build.”
There is a companion rule I am weirdly proud of, the NO RE-RESEARCH rule: if the PRD already covers a technology, read it; do not search for it again. That one line stops the strategist burning eight iterations researching Solidity and the developer burning four more researching the exact same thing.
4. Tell the agent what is actually plugged in
Before every task, the system scans environment variables and injects a list of live integrations into the agent’s context:
Available integrations:
• Clerk (auth, native Vercel Marketplace) [CLERK_SECRET_KEY]
• SendGrid (transactional email) [SENDGRID_API_KEY]
Without this, ask the developer to “build authentication” and it confidently designs against Auth0, Okta, and Firebase, none of which you have keys for. QA then fails because the integration does not exist, and you have burned three iterations on fiction. One injected line of context kills an entire category of hallucination. Boring environment introspection turned out to be the difference between “this AI hallucinates APIs” and “this AI ships real integrations.”
5. Guardrails so failure cannot cascade
Two of these earn their keep constantly.
Pre-build verification before deploy. Running npm run build locally costs a fraction of a cent in tokens. Deploying broken code, having QA hit a 500, and sending the developer on a rescue mishunt costs far more. So the deployer refuses to ship anything that does not build locally first. Cheap verification before an expensive operation, every time.
Stall detection. Agents occasionally get stuck in a thinking loop:
Iteration 18: "I will now search for the Supabase auth docs."
Iteration 19: "I will now search for the Supabase auth docs."
Iteration 20: "I will now search for the Supabase auth docs."
If an agent emits no tool calls for two consecutive turns, it is declared done and terminated. Some of the best engineering in this system is invisible: the circuit breakers that stop it from being expensively stupid at 3am.
And underneath everything, a literal whitelist for filesystem and shell actions. The OS executor only allows mkdir, write_file, append_file, all confined to the workspace root with path-traversal blocked. The shell executor regex-blocks rm -rf /, sudo rm, fork bombs, mkfs, dd to devices, curl | bash, shutdown. The agent gets to be autonomous precisely because the blast radius is bounded.
6. Three model tiers, chosen per stage
The seductive belief about LLMs is that the bigger model is always better. The practical truth is that different stages have different economics. So there are modes:
| Mode | Setup | Cost/task |
|---|---|---|
| Default | Opus/Sonnet everywhere | $0.08 to $0.15 |
| Lite | Haiku everywhere, half the iterations | $0.02 to $0.04 |
| Sweet spot | Haiku for research/QA/marketing, Sonnet for the developer | $0.04 to $0.06 |
The sweet spot is not “cheap” or “quality.” It is cheap for research, quality for execution, fast for verification. My strategist does not need Opus to find an API endpoint. My developer absolutely needs Sonnet to make the UI sing. Matching the model to the job is where the 96% gross margin lives.
The part I did not expect: it learns without training
This is my favourite piece, and it cost nothing to build.
When QA fails a task, it writes a CORRECTIONS.md, a plain table of what the agents assumed versus what was actually true versus the impact. After the run, those corrections get distilled into a single learnings.md file. And every agent, in its very first iteration on every future task, reads that file. The instruction is explicit: “Read this in iteration 1. It overrides your assumptions.”
Here is a real entry, from a task that failed:
## Create an AEO plan for [client]
Status: ❌ no deployment
| Assumption | Reality | Impact |
|-----------------------------------------|--------------------------|--------------|
| Marketer would create src/index.html | No HTML file at all | BUILD FAILED |
| Marketing deliverables in marketing/ | Directory empty | All missing |
### System-level fixes:
- src/index.html is MANDATORY for strategy tasks (>1500 bytes)
- Agents must start writing files by iteration 3 (budget awareness)
What happened: the strategist wrote a genuinely excellent PRD, then the marketer spent its whole budget writing markdown and never produced the one HTML summary that made the deliverable real. It failed. But the next strategy task reads that learning in iteration one, reserves iterations specifically for src/index.html, and does not repeat the mistake.
After enough runs, the system accumulated genuinely useful domain knowledge nobody put in the prompts. For lead-gen tasks it learned to map every claim to two independent sources, to tier contacts by how reachable they are, to write an honest methodology section because users trust lists that admit their limits, and to phase outreach across weeks instead of blasting everyone at once.
After a few hundred tasks this thing knows more about how to actually run a lead-gen campaign than most consultants I have met. And that knowledge came from documented failures, not an expensive training run.
One subtlety that mattered: the learnings have to be filtered. Raw QA notes contain run-specific praise like “the PRD was excellent this time,” which is poison as a permanent rule. A small regex strips those task-state phrases before anything is written to the file. Boring. Essential.
What the whole thing is made of
Nothing exotic. Python for the orchestration (subprocess safety, threading, JSON state), the Anthropic SDK for the model calls across Opus, Sonnet, and Haiku, FastAPI plus a Next.js frontend for the platform, and a deploy chain of Vercel, then Netlify, then GitHub Pages as a three-tier fallback. State lives in plain JSON and SQLite; the learning system lives in a Markdown file you can read with your eyes and diff in git. Cost is tracked per call against a hardcoded pricing table, so every run prints exactly what it spent.
That last choice, keeping the state human-readable, paid off more than any clever abstraction would have. When something goes wrong I open a .md file and read what the system believed.
What I actually learned
- The harness beats the model. None of these techniques are individually novel. Specialized agents, staged hand-offs, computed budgets, checkpoints, guardrails, a learning file. Combined, they turn a capable-but-flaky model into a system that ships. The leverage was never in the weights.
- Constraints are a feature. Every place I removed a choice from an agent (tool allowlists, no-re-research, mandatory build order) I gained reliability without losing anything I actually wanted.
- Make verification cheaper than retry. Pre-build checks, stall detection, integration scanning. Each one spends a fraction of a cent to avoid a failure that would cost far more. At a thousand tasks a month, the invisible circuit breakers are the margin.
- Let failure write the rules. The single highest-leverage component is a Markdown file that turns each documented mistake into a permanent, system-wide correction, at zero marginal cost.
I did not wait for the perfect model. I built the operational wrapper that makes the current one reliably useful. That, more than anything, is the lesson I would hand to anyone trying to build with agents today.
I have spent the last few months here writing up the pieces that sit on top of this harness: how these agents pay for what they use, how you keep them from going off the rails, and how to run them where the cloud cannot reach. This is the thing underneath all of them. If the model is rarely the hard part, this is where the hard part actually lives.