Half a Workday for One Correct Line of Code
seL4 is a microkernel—8700 lines of C code. To prove the functional correctness of those 8700 lines, the research team spent 20 person-years writing 200000 lines of Isabelle proof scripts. That’s 23 lines of proof for every line of implementation. Do the math: roughly half a workday per line of code just for the correctness proof.
This number explains something: why formal verification has struggled to break out of high-assurance domains and research communities for the past few decades. Not because it’s useless—quite the opposite, seL4 is probably one of the most trustworthy operating system kernels in the world. The hardest constraint has always been economics: very few projects can absorb the cost of half a day’s verification per line of code. Aerospace, military, medical devices can. Your web app can’t.
Some will point out that seL4’s numbers are from 2009, and toolchains have improved a lot since then. True. But even with a 10x efficiency gain, each line of code still takes close to an hour of expert time. For a project in the hundred-thousand-line range, that still means thousands of person-days. For most projects, the cost hasn’t dropped to anything close to negligible.
Vibecoding’s Hallucination Problem
Since 2025, the mainstream way people use AI code generation can be summed up as vibecoding: you describe your intent in natural language, the model generates code, you glance at it, think “looks about right,” and ship it. The workflow is stunningly fast, but it has a structural flaw—hallucinations.
Model-generated code can look reasonable, be syntactically correct, even pass simple tests, yet be logically wrong. A sorting function might work correctly on most inputs but silently return wrong results on edge cases. A concurrency lock implementation might run flawlessly under low load and deadlock under high load. What makes these bugs dangerous is their stealth: the code’s appearance and behavior both point toward “correct”—until it isn’t.
The moment a scenario demands reliability, hallucinations become the binding constraint, because the entire workflow depends on human judgment. You have to understand the code to judge whether it’s correct. But if you’re capable of reviewing every line for correctness, you could probably write it yourself. This is the vibecoding paradox: for experts, it’s a productivity booster; for non-experts, it’s a confidence trap—you think you’ve verified the code, but really you just thought it looked right.
This isn’t a dismissal of vibecoding’s value. For prototyping, exploratory programming, one-off scripts, vibecoding is already a massive leap in productivity. But for scenarios that demand correctness guarantees, its ceiling is capped by the bandwidth and reliability of human review. Humans reviewing code get tired, lose focus, get fooled by pattern matching. You can’t build system reliability on an unreliable verification step.
The Perfect Verifier: A Counterintuitive Insight
On December 8, 2025, Martin Kleppmann published an article with a counterintuitive observation: in the domain of formal verification, AI’s hallucination problem simply doesn’t matter.
His exact words: “It doesn’t matter if they hallucinate nonsense, because the proof checker will reject any invalid proof.”
The crux of that sentence is “proof checker.” A mathematical proof is either valid or invalid—there’s no in-between. There’s no such thing as a proof that “looks correct but is actually wrong,” because the checker verifies every derivation step. This is fundamentally different from code: code can be correct on 99% of inputs and wrong on 1%, but a proof can’t—any single incorrect derivation step causes the entire proof to be rejected.
This means the proof checker is a perfect verifier—for proof objects in a given formal system, its judgment is deterministic and doesn’t depend on human intuition. “Perfect” here applies strictly to “whether the proof steps are valid,” not to “whether the specification is correct.”
With a perfect verifier, AI’s operating model changes fundamentally. The model doesn’t need to generate a correct proof on the first try—it can generate 100 candidate proofs, and as long as 1 passes the checker, the job is done. Hallucinations? Doesn’t matter. 99 nonsensical attempts plus 1 correct proof equals the same outcome as 1 perfect generation.
This is a profound structural difference. In vibecoding, every generation requires a human to judge whether it’s correct, and an incorrect generation isn’t zero-cost but negative-cost—it misleads you into thinking the code is right, then you deploy it to production. In formal verification, incorrect generations have significantly lower externalities: they don’t silently slip into production the way buggy code does—the checker rejects them outright. Compute costs still exist, but the failure mode is cleaner and amenable to automation.
To put the distinction more precisely: vibecoding’s verifier (human review) is probabilistic, bandwidth-limited, and subject to fatigue; formal verification’s verifier (proof checker) is deterministic, can be run repeatedly at low marginal cost, and doesn’t fatigue like human review. The former has a hard ceiling on reliability; the latter shifts the bottleneck primarily to specification, search, and engineering organization.
Leanstral: From Theory to Data
If this were merely a theoretical possibility, it would be no different from the past two decades of beautiful visions about formal verification. But on March 16, 2026, Mistral released Leanstral, turning theory into concrete numbers.
Leanstral is a language model purpose-built for Lean 4 theorem proving—119B parameters, 6.5B active (MoE architecture), Apache 2.0 open source. Its performance on the FLTEval benchmark is worth unpacking item by item.
Absolute performance. Leanstral’s single-generation score (pass@1) is 21.9. For comparison: GLM5-744B scores 16.6, Kimi-K2.5-1T scores 20.1. A specialized model activating only 6.5B parameters at inference already outperforms general-purpose models one to two orders of magnitude larger in parameter count on single-pass performance. At least on FLTEval, the returns from specialization far exceed the intuitive “bigger is better.”
Parallel sampling gains. pass@2 (generate 2 candidates, take the best) jumps to 26.3. pass@16 reaches 31.9. Each additional sample yields a steady score increase—exactly the structural advantage that a perfect verifier enables. Without a perfect verifier, the value of multiple samples diminishes quickly because you still need a human to pick the best among candidates. With a proof checker, selection is automatic and marginal cost is minimal.
Cost—the most critical dimension. Leanstral pass@1 costs $18. Sonnet 4.6 scores 23.7 at $549. Opus 4.6 scores highest at 39.6 for $1650.
The ratios between these numbers tell the real story:
- Leanstral pass@2 scores 26.3, beating Sonnet, at $36—15x cheaper.
- Leanstral pass@16 scores 31.9, beating Sonnet by over 8 points, at $290—still roughly 1.9x cheaper than Sonnet.
- Opus costs $1650 per run, 92x Leanstral’s $18.
At least for tasks backed by a strong verifier, FLTEval’s signal is clear: a strong verifier systematically favors “cheap, sample-many-times” models.
This is a completely different competitive landscape from general-purpose code generation. Without a strong verifier backstop, people typically prefer paying for more capable models because single-generation accuracy matters more—or at minimum, “probably correct” matters. But with a perfect verifier, you don’t need the model to be smart enough to get it right on the first try—you need it cheap enough to try many times. Verification cost is fixed (run the checker once), while generation cost can be driven down through specialized small models to near-negligible levels.
Look back at seL4’s numbers: 8700 lines of code, 20 person-years, half a workday per line. If Leanstral-class models can automatically generate and verify proof steps, what happens to that “half a workday” figure? I can’t give a precise number—there’s still a wide gap between FLTEval benchmarks and industrial-grade kernel verification. But the direction and pace of cost reduction are unmistakable.
The Verification Spectrum: From Perfect to Good Enough
At this point, a reasonable objection arises: formal verification is great, but most real-world code doesn’t need mathematical-proof-level correctness guarantees. Does a shopping cart on an e-commerce site need proofs written in Lean 4?
No. But the value of the “perfect verifier model” extends beyond formal proofs. Verifier strength exists on a continuous spectrum, and the proof checker is just the strongest end. What matters is that the same economic logic applies at every point on the spectrum: where the verifier covers critical error classes and verification costs less than manual review, stronger and more automated verification generally lowers the quality bar for generation and makes parallel sampling more worthwhile.
On March 16, 2026, Peter Lavigne wrote an article arguing that property-based testing plus mutation testing should be used to verify AI-generated code. His core claim: “We should treat the output like compiled code.”—treat AI output as compiled artifacts, not as human-written source code.
This analogy hits the nail on the head. You wouldn’t read compiler-generated assembly line by line to judge its correctness—you run tests. Likewise, AI-generated code shouldn’t rely on human line-by-line review; you should build an automated verification pipeline.
Laying out verification approaches along a spectrum:
- Formal proofs (Lean, Isabelle, Coq): Perfect verifier. Once the proof passes, correctness is guaranteed in the mathematical sense. Hallucination impact: zero.
- Property-based testing + mutation testing: Not perfect, but when properties are well-designed, they expose logical errors more effectively than standard unit tests. Property-based testing checks whether properties hold under randomly generated inputs; mutation testing deliberately modifies the code to see if the test suite can actually catch errors—essentially “testing your tests.” Hallucination impact: significantly reduced.
- Strong type systems: Rust’s ownership system eliminates entire classes of memory safety issues at compile time. Dependent types can encode more business invariants. Not as strong as formal proofs, but broad coverage at extremely low cost. Hallucination impact: partially eliminated.
- Traditional unit tests: The weakest form of automated verification. Only covers the cases you thought of. Hallucination impact: limited reduction.
- Human review: The vibecoding status quo. Hallucination impact: depends on the reviewer’s ability and stamina.
The higher up the spectrum, the stronger the “perfect verifier model” effect. But even in the middle of the spectrum, stronger verification significantly changes the economics. If property-based testing can automatically catch 80% of logical errors, humans only need to review the remaining 20% of potential issues—a full order-of-magnitude reduction in workload.
The key insight here: the leverage for improving code reliability isn’t “get AI to generate better code”—it’s “make verification stronger and more automated.” The former is constrained by the slow improvement and diminishing returns of model capability (going from 90% correct to 95% is hard; to 99% even harder). The latter can achieve step-function improvements through engineering.
Concretely: payment settlement rules, protocol state machines, cryptographic libraries, concurrency kernel components are better suited for formal proofs or model checking. Meanwhile, backend CRUD, report generation, and glue code are more realistically served by property-based testing plus mutation testing. The criterion for choosing a verification approach isn’t “stronger is always better” but “how expensive is a bug?”
Confronting the Blind Spot: Where Does the Specification Come From?
I need to confront the weakest point in this article’s argument.
Formal verification requires a correct specification—you’re proving that code conforms to some spec. But where does the spec come from? Who ensures the spec is right?
If AI generates the code and also generates the spec, then proves the code matches the spec—you get a self-consistent system that might be completely divorced from actual requirements. The proof checker can only tell you “the code indeed does what the spec says,” not “the spec says what you actually want.”
Worse still, many real-world bugs aren’t “the implementation is wrong” but “the requirements changed later” or “the spec missed edge cases from the start.” Formal verification can’t discover omissions for you—it can only make the question “does the implementation match the spec?” airtight.
This isn’t a problem that technology can route around. A specification’s correctness ultimately depends on human understanding of requirements, and this challenge is structurally isomorphic to what vibecoding faces: humans must be able to judge whether something matches their intent.
But the difference lies in what’s being reviewed. In vibecoding, you need to review the entire implementation—the concrete logic of every line of code. In vericoding, the focus of human review shifts more toward the specification and abstraction boundaries, rather than checking every line of implementation. Reviewing “this sorting function should return a sorted array, and the output should be a permutation of the input” is orders of magnitude easier than reviewing quicksort’s actual partitioning implementation.
Is the specification correct? That question will always need a human answer. But vericoding compresses the problem from “are these thousands of lines of implementation right?” to “are these few lines of spec right?” The review surface area can shrink dramatically.
This doesn’t eliminate the necessity of human judgment—it puts human judgment where it works best: high-level semantics rather than low-level implementation.
The Word “Vericoding”
The term “vericoding” comes from a September 2025 paper (arXiv: 2509.22908). The paper constructed a benchmark spanning three verification languages—Lean, Dafny, and Verus—with over 12000 tasks.
The significance of this paper isn’t purely technical. It did something cultural: it gave a name to the “AI generation + formal verification” workflow.
Part of why vibecoding spread so quickly is that it gave a catchy name to a workflow. Vericoding does the symmetric thing, but points in the opposite direction: not “coding by vibes” but “coding with verification.” With a name, a practice can be discussed, promoted, and placed alongside vibecoding as an explicit alternative for comparison.
Reality won’t leap from vibecoding to formal proofs overnight. But the primary direction of investment should shift from “keep betting the model makes fewer mistakes” toward “build stronger automated verification.”
The Cultural Gap
The technical signal is already strong enough. Lean 4 is a mature proof language, Leanstral is Apache 2.0 open source, and the proof checker runs on a laptop. At least for specific tasks and cost structures, signs of economic viability have already appeared.
But most developers won’t start writing Lean proofs tomorrow. The barrier isn’t technology—it’s culture.
Formal verification comes from mathematics and theoretical computer science. Its vocabulary—dependent types, tactics, inductive propositions—belongs to a different knowledge system for most developers writing TypeScript and Python. This isn’t an intelligence gap; it’s an exposure gap. Most CS curricula don’t teach formal verification, most companies’ engineering cultures don’t embrace formal verification, and most toolchains don’t integrate formal verification.
Kleppmann predicts that AI will bring formal verification into the mainstream. I think the direction is right, but cultural inertia may be greater than he expects. Technology can mature within a year (Leanstral’s release proves this). The more realistic resistance is that most teams don’t have a habit of writing specs, haven’t integrated verification into CI, and don’t have budget to pay for “machine-provable correctness.” These organizational-level changes are typically much slower than technological ones.
But there’s reason for optimism: Leanstral is open source, Lean 4’s toolchain is actively developing, and the community is growing. If someone builds a good enough IDE integration on top of these—letting developers run formal proofs the way they run pytest—the cultural gap might close faster than expected.
Historically, every usability leap in verification tools has triggered a jump in adoption. Type checking went from optional to default in about a decade (TypeScript’s rise is a microcosm). Linting went from “some people use it” to “not using it is unprofessional” in even less time. The usability leap for formal verification hasn’t happened yet, but AI might be exactly that catalyst—not because AI teaches developers to write proofs, but because AI writes the proofs for them, and developers only need to write the specification.
What ultimately drives change won’t be papers or blog posts (this one included), but a sufficiently low-friction tool. If someday “write a correctness proof for this code” truly approaches the cost of “run a linter,” vericoding will likely permeate everyday development as naturally as type checking did.
Back to Economics
The core argument of this article in one sentence: the real leverage in AI code generation isn’t on the generation side—it’s on the verification side.
Vibecoding pushes on the generation side—smarter models, larger context windows, more precise instruction following. This path has a diminishing-returns problem: going from 90% correct to 95% correct is progress, but you still need humans to find the remaining 5% of errors. The cost of human review doesn’t decrease linearly as errors become fewer—the cognitive load of reviewing an entire codebase is essentially fixed.
Vericoding pushes on the verification side. With a strong enough verifier, the importance of generation quality drops significantly—wrong outputs are automatically rejected, correct ones automatically accepted. With an imperfect verifier (like property-based testing), generation quality still matters less—most errors are caught automatically, and humans only need to handle what slips through.
At the intersection of these two paths, the economics change fundamentally.
seL4’s 20 person-years and 200000 lines of proof scripts are no longer the only reference point for formal verification. Leanstral achieves a 21.9 FLTEval score for $18, and 31.9 for $290. These numbers are still a distance from industrial-grade application—FLTEval benchmark problems and real-world project complexity aren’t in the same league—but the cost curve is exponential, and the three variables of small models, parallel sampling, and verifiers are each improving independently.
The economics of formal verification are moving from “only aerospace and defense can afford it” toward “startups can consider it.” How fast? I’m not sure. But the direction is not in question.
When formal verification becomes cheap enough, looking back at the vibecoding era will probably feel like how we look back at the era before type checking—not that it was unusable, but: why would you take that risk?
Comments
No comments yet. Be the first!
Sign in to comment, or leave a message anonymously
Sign in with GitHub ✅