Probabilistic Compiler

Fifty years of software discipline, at a thousand times the clock rate.

May 14, 2026

The unit was always the probabilistic compiler

Engineers ask whether agent-coding is the next jump up the abstraction stack - assembly to C, C to Python, Python to prompt - and the question is right about the shape. Layers stack. What it gets wrong is an assumption hidden underneath: that any layer below the prompt was ever deterministic.

Translating business requirements into running systems was always a probabilistic act. The unit of analysis was always the probabilistic compiler - the human production function - and the engineering organization built around it was always the apparatus of pushing probable outcomes rightward on an acceptability frontier.

“Probabilistic compiler” needs an early disambiguation. It does not mean the probabilistic programming compilers of Stan, Pyro, and LLaMPPL, which translate probabilistic programs into inference algorithms. It does not mean Karpathy’s “Software 3.0” claim that natural language is now the programming layer and the LLM is the new compiler the prompt runs against. The claim here is older and more boring.

The compiler is the human production function plus the engineering organization around it, and it has been probabilistic since the first time two engineers handed the same ticket produced two different artifacts.

There are real deterministic layers in software production. The compiler that turns C into assembly is deterministic. Linters, type checkers, CI pipelines, automated test suites are all deterministic. They are guard rails and signposts installed around the probabilistic core, several of them simply multiplicative: they take higher-level abstractions and make execution faster, more consistent, more enforceable. The engineering organization is the full apparatus humans built around the probabilistic core: a mix of deterministic tooling where compliance can be encoded, and probabilistic human processes (code review, RFCs, planning, taste) where judgment cannot. None of the deterministic layers is the unit of analysis. The human production function is.

The acceptability frontier is the surface inside the probabilistic compiler that matters. Outputs distribute across it from the left edge, where work just clears the bar, to the right edge, where work exists that everyone could point at as good engineering even if they would have made slightly different choices. The job of the engineering organization has always been to push the distribution rightward. The same shape, repeated for fifty years, across every change in the substrate execution ran on.

What changes with LLMs is not the unit. The unit is still the probabilistic compiler. What changes is the cycle time. Most of the playbook still applies. Some of it does not. The rest of this piece is the work of sorting which parts survive directly, which parts have to be composed explicitly because slowness used to do the composing for free, and what an engineer or leader does on Monday morning with that information.

What fifty years of software development actually taught us

The disciplines that emerged across fifty years of building and iterating on software systems are not contingent on the substrate. They survive substrate changes because they encode the examination shape, not the production medium. They are the platonic truths of software development. Platonic in the classifier sense, not the metaphysical one. They show up the same way in adjacent knowledge-work crafts where the production function is also human and probabilistic.

Multi-stage examination

Multi-stage examination is the first one. The discipline of asking the right question about the same concern at design, planning, build, and review. Fagan named it as an industrial practice at IBM in 1976: design and code inspections at strict ordering, with entry and exit criteria, documented defect-removal effectiveness of eighty to ninety percent and resource savings up to twenty-five percent. Adopted at IBM, Hewlett-Packard, Motorola, NASA. The discipline predates the name; what Fagan did was make it legible. Beck named the smallest-scope version of it with TDD in 2002: pair every doing-skill with a reviewing-skill at the smallest unit. Brooks named the deepest version of the truth in 1986: the essential complexity of the conceptual construct is what survives every change in tooling; accidental complexity is what tooling eats.

Taste-encoding

Taste-encoding is the second one. The artifact whose value is encoding judgment rather than enforcing compliance. Style guides that say “we prefer X over Y because Z” - the because is the load-bearing word - they make any execution layer pointed at them sharper. Style guides that say “follow PEP 8” produce a wash.

Biasing trajectory

First steps - what to build, how to decompose, what to test for - dominates trajectory. Execution amplifies whatever it was pointed at, including the wrong thing. The most expensive bug is the one in the architectural decision; the cheapest is the one in the line of code. Senior engineering converges, across organizations, on putting disproportionate weight at the bearings stage - because that is where the trajectory is set.

Trust distribution

Trust-distribution is the fourth one and the most often misread. Code review was never primarily a quality mechanism. It was the surface humans built around their own probabilistic output to make it reviewable and accountable to other humans. The function was trust-distribution; quality emerged as a byproduct. Stripe ran on this principle as it scaled from a few engineers to thousands. Review was the surface through which judgment got distributed across a codebase maintained by hundreds of contributors, the same culture that produced Patrick McKenzie’s principle that if one finds oneself nitpicking a pull request, one should generally fix the linter or find something productive to do. Engineering leaders who priced review as a quality mechanism see it as the thing to automate. Leaders who priced review as a trust-distribution mechanism see it as the thing whose substrate needs replacing when the substrate of accountability changes.

These are platonic because they encode the shape of the work rather than the medium it runs on. Editorial discipline in publishing - developmental, line, copy, proofreading, executed at strict ordering, with each stage refusing premature compression into the next - is the same shape. These disciplines did not emerge because someone wrote them down. They emerged because organizations that did not run them produced visibly worse software than organizations that did. The discipline is what veers the distribution rightward on the acceptability frontier.

What LLMs change, and what they don’t

The change is to the substrate, not to the unit. Cycle time is the measure of it. The same engineer who would have spent three days writing a function gets the artifact back in five minutes. The probabilistic compiler that the function passes through is now running at a clock-rate three orders of magnitude faster than the one humans were used to. The unit of analysis has not changed. Outputs still distribute across an acceptability frontier; the engineering organization’s job is still to push the distribution rightward; most of the playbook still applies.

The parts that survive directly are the ones that encoded the examination shape rather than the production medium. Taste-encoding scales straight through. The artifact’s job is to encode judgment for any execution layer, and the new execution layer reads it just as well as the old one. The bearings work matters more, not less, when the execution layer is cheap. Multi-stage examination scales through if it is composed explicitly. That last condition is where the work is.

The parts that need new composition are the parts that depended on slow human work as substrate. Multi-stage examination existed in human engineering organizations because the work was slow enough to invite intervention at every stage. When a function takes three days, there is time to ask testability questions at design, decomposition questions at planning, taste questions during execution, and reconciliation questions at review. The discipline was invisible because slowness produced it by default. Teams did not name multi-stage examination as a discipline; they just had it. Speed up the substrate by three orders of magnitude and the slowness is gone. The discipline is not deficient. The conditions that produced it for free have evaporated.

The temptation to one-shot is the failure mode the discipline was always there to prevent, and the temptation arrives precisely because the substrate is fast. The same engineer who would never accept a function written without testability questions at design will accept one from an agent because the prompt felt complete and the artifact already exists in five minutes. The machine shop being built around the LLM gets read as an automated assembly line, a factory floor where uniform output gets pressed out at scale. That is the categorical mistake. The LLM is not a factory machine. It is another probabilistic substrate, with its own distribution across the acceptability frontier. The machine shop you build around it is itself a probabilistic compiler, just operating at a much faster pace.

The clearest precedent the audience already has cached for this kind of substrate change is the spreadsheet. VisiCalc and Lotus 1-2-3 collapsed the cost of running calculations; the cheap-execution layer compressed by orders of magnitude; analyst headcount went up, not down; demand for analysis exploded. The optimistic reading of LLMs is that software work goes the same way. The narrow defensible difference - and it is narrow - is that the spreadsheet automated one layer of the analyst’s stack, while the LLM automates most of the execution layer simultaneously, including the apprenticeship rungs that produced the next generation of seniors. Software is the case where the cheap-execution layer eats the manufacturing pipeline for its own future seniors. The continuity thesis holds. The apprenticeship-rung consequence is real and named in the next section.

Repricing the apparatus

The engineering apparatus reprices asymmetrically, and the asymmetry is not random. Each component sorts by who its consumer was: a future human reading the artifact, a human catching defects another human produced, or the discipline itself.

Negative half-life is the tier of components whose consumer was a future human and is no longer. Onboarding RFCs and README architecture overview sections were built to replay context for the next human in the room. The new contributor reads the codebase cold in seconds, and the prose drifts from reality faster than humans can update it. The artifact is often worse than nothing because it confidently mis-describes the system the agent is about to operate on. Code comments for “future engineers” persist as a loyalty signal - a real engineer wrote this - exactly the apparatus surviving past its consumer.

Short half-life is the tier of components whose consumer was a human catching defects another human produced. Line-level code review for correctness collapses because bugginess gets tested into the floor by the same loop that produced the code. Mechanical refactoring as senior-engineer load collapses because it is the task agents are best at. The load-bearing entry in this tier is the apprenticeship rung.

The “good first issue” labor was the layer juniors used to grow on. New-grad hiring at Big Tech is down twenty-five percent year-over-year and over fifty percent from 2019, and employment in software for ages twenty-two to twenty-five is down roughly twenty percent from its late-2022 peak. The pipeline that manufactured next-generation seniors lost its substrate. The leadership tier has to rebuild the pipeline at plan-time and review-time, pulling juniors into bearings work earlier than the old apprenticeship ladder placed them, or live with the consequences in 2031.

Long half-life is the tier of components whose consumer is the discipline itself. Taste-encoding documents compound. Plan-time decomposition compounds. Integration tests migrate from safety net to executable spec. Same tests, different role; pricing inverts from “did they pass” to “does the suite encode what the system should be doing.” The substrate change makes these components more valuable, not less, because the floor-quality of execution against them just got cheap and the principles themselves are the differentiator.

The asymmetry is already metered. METR’s 2025 randomized controlled trial on experienced open-source contributors found developers nineteen percent slower in completion time when allowed to use AI tools, while the same developers believed they had been twenty percent faster. Cursor’s analysis across tens of thousands of users found organizations merging thirty-nine percent more pull requests after the agent became their default. Both findings price the same thing. The execution layer compressed; the examination layer absorbed the saved time and more; the gains land where the apparatus was already shaped to absorb the throughput, and fail to land where it was not.

Apparatus components without a clear answer yet sit outside this sort. Observability under probabilistic execution, on-call rotations for code agents wrote, junior onboarding when the rung is gone: these are the load-bearing structural questions the half-life frame does not yet price. Each is its own essay.

The cleanest historical precedent for this shape of repricing is cloud and the emergence of SRE. AWS launched in 2006; enterprise adoption ran from 2008 to 2020; the execution layer of sysadmin work compressed from weeks to API calls. The unit of analysis - running a reliable service - did not change. The discipline survived and strengthened. The role-name changed from sysadmin to SRE or Platform Engineer, with comp tracking the new shape upward. Google’s Site Reliability Engineering opens on the continuity claim: SRE is what happens when you ask a software engineer to design an operations team. The discipline was always engineering. Cloud just compressed the labor that used to disguise it.

Composing for the new pace

Editorial discipline is the mature analog for multi-stage examination, encoded in publishing for over a century as a four-stage intervention executed at strict ordering. Developmental editing asks whether the thesis, structure, and scope are right and refuses to engage prose-level work until the architecture is settled. Line editing asks whether each sentence carries its weight, in the right register, at the right cadence, and refuses to engage grammar until the prose is doing the right work. Copy editing asks whether the work is internally consistent and grammatically clean. Proofreading catches what survived. Each stage refuses premature compression into the next. The depth of intervention decreases monotonically as you move downstream.

The strongest editorial fact is barely controversial inside the publishing industry: writers-with-editors produce visibly better work than writers-without-editors. The New Yorker has long maintained a fact-checking department of sixteen who speak to every person mentioned in every story, even when not quoted. The discipline survives editor turnover, writer churn, and deadline pressure because it is encoded in the institution rather than in any individual editor. The shape is platonic. The substrate has changed across a century of publishing - typewriter to word processor to collaborative cloud editor - and the discipline survives all of them, because what is being encoded is the examination architecture, not the production medium.

Engineering has half of this. Code review is final-pass-only. Architectural review is plan-stage-only and episodic. The two middle stages - idea-stage examination as part of how problems get framed, and iterative examination during execution - are improvised or skipped. Production organizations got away with the gap for fifty years because slow human work made the gap invisible. The discipline was carried out as a byproduct of the substrate.

Testing is the operational example for the new pace. Testing is not a skill the LLM installs once against a coverage target. Testing is a concern examined at design (is this thing testable?), at planning (what tiers of tests, at what assertion targets?), at execution (what does a great test of this kind look like at this tier?), and at review (did the plan survive contact with the code?). Coverage targets are the final-pass-only version of testing: a metric for review-stage compliance that pretends the other three stages did not matter. Hand an LLM a coverage target and you get tests that hit the lines and miss the failure modes. The system breaks anyway.

Building a machine shop that veers right on the acceptability frontier is the work of installing the multi-stage discipline as context. The verification surface in production is not a single review gate at the end. It is a composed layer - evals, drift detection, replay, fallback paths, human-in-loop escalation - each examining the same output from a different angle, the way the editorial discipline examines a manuscript at idea, structure, sentence, and proof. Cursor describes its agent harness as exactly this: offline benchmarks, online A/B tests, custom acceptance metrics, LLM-based sentiment analysis. Stripe encodes the same discipline organizationally: pre-installing Cursor on developer machines and shaping every agent invocation with Cursor Rules. Human judgment is the scarce resource; the discipline of multi-stage examination is what frees it from work tooling can encode.

Build your machine shop the way you have always built your engineering organization. Multi-stage examination, applied explicitly, installed as context. Anything you build that pretends the LLM is a factory machine will produce factory-shaped output: uniform, on-spec, and on the wrong side of the acceptability frontier.

References

The disciplines and receipts cited above, grouped by section.

The platonic truths (§2)

Michael Fagan, “Design and Code Inspections to Reduce Errors in Program Development,” IBM Systems Journal 15:3 (1976). The original treatment of multi-stage examination as an industrial practice, with the defect-removal and resource-savings numbers cited in §2.
Kent Beck, Test-Driven Development: By Example (Addison-Wesley, 2002). The smallest-scope version of multi-stage examination — pair every doing-skill with a reviewing-skill at the unit level.
Fred Brooks, “No Silver Bullet: Essence and Accident in Software Engineering” (UNC TR86-020, 1986). The deepest version of the truth: essential complexity is what survives every tooling change; accidental complexity is what tooling eats.
Patrick McKenzie, “What Working At Stripe Has Been Like,” Kalzumeus (March 2019). Source for the §2 trust-distribution principle: “if one finds oneself nitpicking a pull request, one should generally either a) fix the linter or b) find something productive to do with one’s time.”

The Software 3.0 dispatch (§1)

Andrej Karpathy, “Software Is Changing (Again),” YC AI Startup School (June 2025). The “Software 3.0” framing this piece distinguishes its claim from in the §1 disambiguation — natural language as the programming layer, LLM as the new compiler.

Repricing the apparatus (§4)

SignalFire, State of Talent Report 2025. Source for the new-grad hiring numbers: Big Tech down twenty-five percent year-over-year, over fifty percent from 2019.
Erik Brynjolfsson, Bharat Chandar, Ruyu Chen, “Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence,” Stanford Digital Economy Lab (August 2025). ADP-payroll-microdata source for the age-22-25 software-developer employment decline.
METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” (July 2025). Randomized controlled trial on sixteen experienced open-source contributors across mature repositories. Headline finding: nineteen percent slower in completion time when AI tools were allowed, against a twenty percent perceived speedup. The cleanest single receipt for “execution compresses, examination does not.”
Cursor, “The Productivity Impact of Coding Agents” (2026). Eligible-vs-baseline cohort analysis across tens of thousands of users; thirty-nine percent more pull requests merged at organizations after the Cursor agent became the default. The positive receipt for the same asymmetry — gains land where the apparatus is shaped to absorb them.
Amazon Web Services launch press release (March 14, 2006). The starting line for the cloud-era apparatus-repricing precedent.
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy (eds.), Site Reliability Engineering (Google / O’Reilly, 2016). The book that opens on the continuity claim cited in §4: SRE is what happens when you ask a software engineer to design an operations team.

Composing for the new pace (§5)

Brent Cunningham, “The World’s Largest Fact-Checking Operation,” Columbia Journalism Review. The documented sixteen-checker department at The New Yorker, the discipline encoded at institutional level rather than in any individual editor.
Anysphere, “Continually Improving the Cursor Agent Harness,” Cursor engineering blog. The composed verification surface in production — offline benchmarks, online A/B tests, custom acceptance metrics, LLM-based sentiment analysis.
Anysphere, “How Stripe Rolled Out a Consistent Cursor Experience for 3,000 Engineers,” Cursor case study. Stripe’s organizational encoding of the multi-stage discipline — pre-installation and Cursor Rules as substrate.
John McPhee, “Checkpoints,” The New Yorker (February 9, 2009). The literary-canon treatment of the same fact-checking discipline cited in §5.

Infinite Playground

Discussion about this post

Ready for more?