There is a class of company that wants AI to fix its code and is legally forbidden from using almost any of the tools that do it. Banks, hospitals, defence, government. Their code cannot leave the building, which rules out every coding assistant that ships your repository off to an API in another country. The honest answer to “can you give us a coding agent that never touches a US server?” has mostly been no.
So I tried to build the yes. A coding agent that runs fully air-gapped on the customer’s own hardware, works with whatever model you point it at, and, most importantly, never tells you a fix works until it has run your repository’s own tests and watched them pass.
This follows the little YouTube analyser I wrote about earlier, and it pushed the same lesson much harder: the model is rarely the hard part.
Table of contents
Open Table of contents
- The thing that is actually broken about AI coding
- Language-agnostic, because exit codes are universal
- The leap from one file to a whole repo
- The harness is the moat, not the model
- Sovereignty: the frontier model is the teacher, not the product
- Being honest about where the models separate
- What I took away from it
The thing that is actually broken about AI coding
Most AI coding tools suggest. They generate a plausible-looking diff and hand it to you with an implicit “looks right to me.” Plausible and correct are very different things, and the gap between them is exactly where bugs live. A model can write code that reads beautifully and is subtly, confidently wrong.
The fix is almost embarrassingly obvious once you say it out loud: do not let the agent claim success until it has proven success the same way a human would, by running the tests. Not a model grading its own homework. Not a heuristic. The actual test suite that the human reviewer would run.
So the core of the system is a verification-first loop:
solve(repo, task, propose, max_attempts=5):
baseline_ok = run_tests(repo) # confirm the bug is actually there
for attempt in 1..max_attempts:
candidate = propose(task, last_failure, attempt) # the model tries a fix
ok, output = run_tests(repo) # THE GATE: run the real tests
if ok:
winning = candidate # only a verified fix is ever emitted
break
last_failure = output # feed the failure back and iterate
finally:
restore(repo) # dry-run by default; nothing left behind
The gate is the product. A candidate that does not make the tests pass is never emitted. Three outcomes only: the tests already passed and there was nothing to do, a candidate passed and you get that patch, or nothing passed and you get nothing, with the gate held. The agent is structurally incapable of telling you a broken fix is done.
Language-agnostic, because exit codes are universal
A nice side effect of “just run the tests” is that the loop does not care what language it is looking at. It checks an exit code, not a Python traceback. The test command is auto-detected from the stack markers in the repo, Cargo.toml means cargo test, package.json means npm test, go.mod means go test ./..., and so on, or you set it explicitly. It has been proven on Python and Rust, and a test in the suite verifies a fix to a TOML file using nothing but grep -q GOOD config.toml as the verifier. Exit zero is truth, whatever produced it.
One bug I am glad I caught here is worth repeating, because it would have been silent and poisonous. Python caches compiled bytecode. If a model makes an edit that happens to be the same byte length as the original (say >= to > ) within the same second, Python can reuse a stale .pyc and run the old code. The verifier would then run code that is not the code on disk and could certify a wrong answer as correct. The fix is to clear __pycache__ and set PYTHONDONTWRITEBYTECODE=1 on every run. A verifier you cannot trust is worse than no verifier, so that one mattered a lot.
The leap from one file to a whole repo
Fixing a bug in a file you are handed is table stakes. Real bugs hide in files you were not told about. So on top of the verification loop sits an agentic loop with six confined tools: ls, read_file, search (regex across the repo), str_replace, run (a shell command behind a safety filter), and submit (the terminal call the model makes when it believes it is done).
The system prompt teaches a discipline rather than a script: list the files, search for the relevant code, read the target before editing it, make one concrete edit copying the exact text, run the tests, read the failure, adjust, and submit only when they pass. One tool call per step, and never repeat the same call with the same arguments.
The concrete run I like to point at: the bug was “transfers that exceed an account’s balance are applied instead of rejected.” The agent ran ls, read transfer.py, realized the real logic lived elsewhere, searched for def debit, found it in accounts.py (the non-obvious file), read it, saw that debit() always returned success, added a balance guard, ran the tests, and submitted. Verified, in seven steps. It navigated to the right file by understanding the code, not by being told.
The harness is the moat, not the model
Here is the result that genuinely surprised me. Before I hardened the harness, a 7B model would flail on that multi-file bug right up to the 30-step budget and give up. After the harness work, the same 7B model converged in nine steps. The difference was not model size. It was the harness.
Two pieces of that harness did most of the work. The first is whitespace-tolerant editing. Small models constantly slip on indentation, and a strict str_replace rejects the edit and sends the model into a re-read loop. A fuzzy match that compares lines ignoring leading and trailing whitespace lets the edit land anyway. The second is a loop-breaker: the harness tracks repeated tool calls, and on the third identical call it injects a nudge to change approach or submit. Neither requires the model to be trained on anything. They are constraints on the loop, and they are what let a small model win.
The tools are confined the way you would hope. Paths are resolved against the repo root and anything that escapes it raises an error, so read_file("../../etc/passwd") simply fails. The shell filter blocks rm -rf /, sudo rm, fork bombs, mkfs, dd to devices, curl | sh, and shutdown, while letting pytest, cargo test, and git diff through. Tool output is truncated so a runaway grep cannot blow the context window. Every one of these has a unit test, and the dangerous-command denylist is the kind of thing any system needs the moment an LLM gets a shell, because the safe-execution problem is the same everywhere.
Sovereignty: the frontier model is the teacher, not the product
The architectural rule that makes this sellable to a regulated enterprise is one sentence: the frontier model is the factory, not the product.
A model like Opus is used only on my machines, on public code, as a teacher that generates training data. It solves public SWE-bench instances inside the harness, and only the trajectories whose final diff actually passes the tests are kept. Those verified trajectories train a small student model. The shipped product is that small model, running air-gapped via Ollama (or vLLM, or any OpenAI-compatible server) inside the customer’s walls. Their code never touches an external API.
The model itself is pluggable by design. The whole system is parameterized over a propose function, and a policy factory swaps the backend with a flag: local for Ollama, openai for any OpenAI-compatible endpoint including a hosted GLM, llm for Anthropic when teaching. As the code comment puts it, weights are swappable, the loop is the product. That single abstraction is what lets a customer run a 7B locally for the bulk of the work and route only the genuinely hard problems to a frontier model, with verification as the gate either way.
And every verified fix is appended to a corpus on disk, public-data only, never customer code. That corpus is the compounding asset: the more the system runs on public repositories, the better the training data for the next student model, without a single byte of customer code leaving the network.
Being honest about where the models separate
I am wary of self-graded benchmarks, so let me be plain about the numbers. On a small graded suite of eleven tasks across three difficulty tiers, a local 3B and a local 7B both score 9 out of 11. Opus scores 11 out of 11. Both local models clear every easy and medium task and fail the same two hard ones: a full semantic-version comparison and an operator-precedence calculator. Those are reasoning tasks, not code-completion tasks, and 7B hits a ceiling on them that 14B does not meaningfully lift.
That separation is the business case, not an embarrassment. A small local model handles the common, high-volume changes cheaply and privately, dependency bumps, off-by-one fixes, a guard clause in the wrong file. You route the genuinely hard reasoning to a frontier model when it is worth it. The engine is identical; only the model swaps, and nothing ships unless the tests pass.
I will also say plainly what is not done. The credible external number is SWE-bench Verified, real GitHub issues in per-instance Docker images scored by the official harness, and that scored run is the next milestone rather than a result I can quote yet. The self-authored tiers show the engine works and show where models separate; I do not dress them up as something they are not. There are 47 unit and integration tests and a handful of offline smoke checks gating all of this, runnable with no API key and no GPU.
What I took away from it
- Make the agent prove it, do not let it claim it. Running the real test suite as a hard gate is the single decision that turns a plausible-fix generator into something you can trust unattended.
- The harness is the capability. The same small model went from failing to converging because of fuzzy editing and a loop-breaker. Open models commoditize; the loop around them does not.
- Sovereignty is an architecture, not a promise. Keeping the frontier model in the training factory and shipping only the student means “your code never leaves” is a structural fact, not a pinky-swear.
- Verifier bugs are the scariest bugs. A stale-bytecode hazard that could certify a wrong fix as correct is worse than any normal failure, because it fails silently and convincingly.
The thread running through all of this is that the model is rarely the hard part. The boundary you run it inside, and the proof you demand before you believe it, are where the real work lives.