The Next Great AI Product Will Not Be an Agent Factory

Factories optimize known workflows. Probes discover the ones companies never wrote down

May 13, 2026

Most enterprise AI products begin with a confession. They ask the customer to specify the work.

Glean asks you to know what to search for. Notion AI asks you to know what to draft. Every agent builder asks you to know which workflow to script. The product’s competence is bounded by the customer’s ability to articulate their own work, which is the thing the product was meant to compensate for in the first place.

This is the central error of the agent era.

Labor is not the bottleneck of knowledge work. Attention is.

Attention, in the operational sense, is the scarce capacity to notice, interpret, prioritize, and route ambiguous signals before they become named work. Every dashboard nobody opens, every Slack thread that scrolls past, every escalation that arrives late, is a withdrawal from the same account.

The bottleneck nobody is naming

Consider the production function of any team inside a modern company. A revops team watches pipeline by stage, conversion per segment, three or four dashboards that update on Monday. That is the part of the work the team has learned to articulate. It is almost always a minority of what the team’s performance depends on.

The rest sits in the long tail. The Gong call where a customer mentioned a competitor by name and nobody on the deal team noticed. The Slack thread where someone in support flagged a recurring billing edge case three weeks before it broke. The pull request where a senior engineer typed “this is the third time this month we’ve changed this function” and merged anyway. These signals exist in every company’s systems of record. They go unwatched not because they are useless, but because attention has always been too expensive to spend continuously.

This is Amdahl’s Law applied to organizations. You can accelerate the named subprocesses of a business by an order of magnitude and the business will get only marginally faster if those subprocesses are not where the real constraint lives. The serial pathologies sit in the unarticulated transformations, the silent handoffs, the decisions that were never written down. The current generation of AI products cannot touch that surface. They were architected to optimize what their customers can already describe; they are pricing themselves against a ceiling they helped install.

Klarna ran the cleanest version of the factory play in 2024: AI handled 2.3 million chats a month, did the work of roughly seven hundred full-time agents, dropped resolution time from eleven minutes to under two, and was projected to add forty million dollars in operating profit. By May 2025 the company was rehiring; its CEO told the press that the cost-cutting push had over-rotated on quality. The factory deployment could not see what the human agents were doing in their third minute, which was silently reframing the customer’s literal question into the customer’s actual problem. That reframe is where the value lived. The factory frame could not price it because it had no line item for work the customer never thought to describe.

The wrong product

By factory I mean any AI product whose architectural premise is that the workflow is already known. Name the task, script the steps, attach the tools, run the agent, sell it as a labor unit. The category drifted there because the financial story is clean: factories convert operating expense to capital expense, they replace human units with software units, they produce a number on a slide a CFO can underwrite.

The factory language is useful precisely because it gives the financing layer something legible: fixed assets, throughput, utilization, payback. NVIDIA selling “AI factories” is the same move at chip-vendor scale, and at infrastructure scale the frame may be the right one. Training compute, data centers, and the capex behind them genuinely do behave like factories. The problem is letting the language slide downstream into the application layer, where roughly six hundred billion dollars of 2026 AI spend is being underwritten against an architecture that is not, in fact, a factory.

Every company is a unique organism. Its production function is a recursive composition of dozens of role-specific functions, each with its own tacit knowledge and emergent behavior accumulated over years.

These eccentric details are not noise. They are the company.

A factory presupposes a canonical workflow that can be stamped out across customers; the canonical workflow does not exist at the level of detail that matters. Two companies running the same support process have different escalation rules and different unwritten standards for what resolved means. Two engineering teams running the same SDLC have different test conventions, different deploy quirks, different load-bearing patterns left behind by incidents nobody documented.

The eccentric load-bearing parts of a company’s work cannot be parameterized in advance. They have to be discovered, in place, by something watching.

Factories optimize what you can describe. Probes find what you couldn’t. Every other architectural choice falls out of that one.

Self-assembling software

Imagine software that lands inside a company the way a Von Neumann probe lands on a planet. It does not arrive with a configured workflow. It arrives with a goal: extend the company’s production function. It also arrives with the capacity to figure out where putting attention will do that.

It reads the company’s systems of record: the Gong calls, the Linear tickets, the deploy logs, the database schemas, the Slack archive. It constructs its own map of the local substrate. Where the map has gaps, it does not ask the human to teach from scratch; it looks for where the company has already solved the problem before, surfaces the evidence, and asks the human to grade it. This dialogue interface, the Inquiry, is how the probe converts the company’s own history into its working memory.

I see twelve Linear tickets for billing escalations over the last quarter. No runbook is documented anywhere, but I found the Slack threads and PR diffs from the four most recent resolutions. Here is the pattern I am reading across them. Walk me through where I have it right and where I am missing context, and I will use this as the seed dataset for the agent that handles these going forward.

The human’s cognitive load is editing, not authoring; reviewing, not explaining. That difference is roughly an order-of-magnitude reduction in onboarding cost, and it is the reason the probe can build context at customer scale without a million-dollar professional-services engagement attached to it.

The product is the probe. Not the agents inside it. The whole self-bootstrapping, self-harnessing, self-replicating system that lands at the customer and grows. The agents are sprites: small, many, embedded, alive. They are the probe’s hands, the means by which it bootstraps itself into the local terrain. The probe has the goal. The probe has the taste: the learned local model of what is urgent, what is noise, who should be interrupted, and how much confidence is enough. It carries the eval history and the escalation judgment that decide what matters. The sprites are how it acts. The humans the probe works with become deputies: not operators of the system but its inheritors, accruing the judgment the probe is learning and propagating it laterally to colleagues.

“Self-replicating” deserves its qualifier. What replicates is patterns, through a typed case base, with human approval at the boundary. The probe does not fork itself onto your hardware; it proposes patterns that other probes have learned, your humans decide which ones land, and the local probe absorbs them under your permission model. Discovery is broad. Action is narrow. The probe reads widely, proposes often, writes rarely, escalates whenever confidence or blast radius crosses a threshold. It earns autonomy the way a new employee does: through observed judgment over time, scoped to surfaces a human has already signed off on. Without that, none of the rest is shippable inside an enterprise.

The architectural property that distinguishes this from every other product on the market is that human articulation skill is no longer the ceiling. The probe assembles itself.

Four layers

The probe has four layers, and the load-bearing one is the one nobody is shipping today.

Substrate is the local map: connectors to every system of record, a model of how data transforms across them, and the Inquiry that surfaces tribal knowledge onto the map. Onboarding the probe is the way a company finally writes down what was in everyone’s heads. This is not a feature. It is the byproduct nobody else is selling.

The same byproduct serves humans. A factory product asks the customer to design a generic training flow; the probe watches the examples, objections, support tickets, and workflows that actually exist inside the company, and assembles employee onboarding around reality instead of abstraction.

Drivers are the atomic capabilities the substrate exposes. Reconcile a ledger entry. Open a pull request. File a support ticket. Run a deploy. Every customer has these systems of record, so the probe ships with a curated library and day-one value is real before any local learning has occurred. Drivers are what compress the main job. The probe runs them with the substrate underneath them, which is the part the factory products cannot reach. When a sprite opens a pull request, it knows the team’s review conventions, the unwritten rule about touching the migration directory before a Wednesday deploy, the specific senior engineer who must be tagged on changes near the billing subsystem. The same atomic capability is doing different work, because the context underneath it is different.

The noticing loop is the proposer. It reads the human reaction stream, consults the substrate, borrows patterns from the playbook, and drafts new sprites to try. Its operating principle is the one most agent products get wrong.

You do not manage an agent’s performance. You compound the context the agent is operating in, and the agent gets better.

Every escalation is a labeled example. Every correction is a dataset entry. Every accepted resolution is an eval point. The probe accumulates its own evaluation substrate as it operates, calibrated to this specific company’s taste. Over time the loop also learns which categories of signal deserve sustained attention going forward. Sprites that earn attention survive. The rest are recycled. Drift detection is native, because the probe can see when its automated outputs are diverging from how its humans actually resolve similar cases.

The playbook is the typed case base across customers: typed patterns, eval traces, correction histories, escalation profiles. Patterns generalize across the fleet; data stays home. Customer two onboards faster than customer one. By customer fifty the probe arrives with priors strong enough that bootstrapping is fast. The moat is not the model. The moat is the corpus.

Elastic attention

AWS made compute elastic. The probe does the same for attention.

The easiest way to feel the shift is to ask what you would watch if attention had no marginal cost. Would you inspect every deal in the pipeline every morning, not for the fields already in Salesforce but for the weak signals that suggest the deal has changed? Would you monitor every customer’s website, every competitor’s pricing page, every API doc, every changelog, every Gong call, every product event, every anomalous log line after a major release? Today the answer is no, not because the work is unimportant, but because no team can afford the attention. So companies build rituals around scarcity: weekly pipeline review, quarterly competitive analysis, sampled call review, dashboards someone checks when something already feels wrong. Elastic attention makes the default different. The question stops being “is this worth assigning a human to?” and becomes “what would we learn if this surface were watched continuously, with taste?”

A sprite that drafts most of a vendor contract and waits for counsel to fill the gaps is capacity that costs nothing at rest and scales instantly when demand spikes. That elasticity shows up across three surfaces, and a factory product cannot ship any of them.

The first is compression of the main job. The engineer’s standard loop of reading specs, designing, writing, testing, reviewing, deploying, and monitoring runs several times faster when sprites operating on substrate context can take the boring middle of every step.

The second is the long tail of latent builds: the internal tools the engineer always wanted and never had the bandwidth for, the better runbook, the eval suite that should have existed two years ago, the dashboard that surfaces the metric that mattered all along. The probe lowers the threshold on is this worth building? from a week to an afternoon, and capability stock compounds in the background of the company’s main work.

The third is novel attention to problems no one was watching: the Gong call, the Slack thread, the PR diff that the team’s bandwidth could not reach.

All three require the substrate, the drivers tuned to local convention, and the noticing loop running over both. The probe ships all three because all three fall out of the same architectural property.

Elastic attention without taste is sprawl.

Hand a human a thousand agents and they will run out of ideas after a handful. How many on our teams are hitting their usage limits? There are limits to the average employee’s ability to articulate and amplify their production function.

The probe is defined by it’s taste for how to deploy elastic attention.

The right financial frame is option accounting. The factory frame underwrites task replacement: agents do forty percent of the work, headcount should drop forty percent, payback is fast. This is the math driving the wave of ill-fated layoffs across the industry. It is also the framing the architecture asks you to reject.

The math does not survive contact with the work. The industry assumption it runs on, that LLMs are a labor substitute, has been wrong since the earliest public deployments. Klarna was the canonical early case. In May 2026 Cloudflare announced eleven hundred layoffs as a redesign for the agentic-AI era. The eighteen months in between produced a continuous stream of similar announcements, all underwritten by the same task-replacement math. The pattern is not that the cuts failed. It is that the math priced the wrong thing.

More importantly, task accounting is pricing the wrong thing.

Most of the value of a senior employee is not the tasks she executes but the option to direct her judgment at work that has not surfaced yet. The migration that hasn’t happened. The boundary that hasn’t broken. The strategic call that hasn’t been needed.

Pre-LLM, those options were expensive to hold because the carrying cost was full headcount. Post-LLM, the carrying cost dropped and the volatility rose, and every option in the firm got more valuable on both sides at once. Selling the headcount that carries the option in order to book a short-term cost saving is selling the asset at the bottom.

Under task accounting you cut headcount and pocket the savings. Under option accounting you hold the headcount and let the probe leverage every senior judgment across ten times the surface area it used to cover. The companies getting this right will look less like they are replacing work and more like they are increasing the clock speed of the organization.

Ramp is the closest public example of the better posture. Not because of any one external product, but because its internal AI work starts from acceleration rather than replacement. Glass gives every employee a configured AI workspace; Inspect is a background coding agent wired into Ramp’s development environment, internal tools, observability, and verification loop, with enough speed to work in parallel with the engineer rather than queueing behind her. The important part is not the agent. It is the harness around it, and the management philosophy behind it: use AI to compress the work so capable people can direct judgment across more surface area. Ramp has not yet built the probe. It is pointing the machinery at the right economic object.

Under the ceiling

The bottleneck was never labor. It was the ceiling every other product installed above attention: the requirement that the customer name the work before software could help with it. The probe ships under that ceiling. It watches before it acts, learns before it automates, discovers the work before anyone has to manage it.

The next great enterprise AI product will not be the one that hands every employee a thousand agents. It will be the one that knows which three should exist before anyone has thought to ask.

Appendix: sources

Klarna case study and walkback

OpenAI, Klarna’s AI assistant does the work of 700 full-time agents — original 2024 case study with the 2.3M chats, 700 FTE-equivalent, sub-2-minute resolution, and $40M figures. https://openai.com/index/klarna/
Bloomberg, Klarna Turns From AI to Real Person Customer Service (May 2025) — initial walkback and CEO statement on cost-vs-quality tradeoff. https://www.bloomberg.com/news/articles/2025-05-08/klarna-turns-from-ai-to-real-person-customer-service
Reuters, Europe’s AI poster child Klarna taps the brakes on chatbots (September 2025) — broader reassessment of Klarna’s AI-substitution thesis and shift back toward human hiring. https://www.reuters.com/business/europes-ai-poster-child-klarna-taps-brakes-chatbots-2025-09-10/

The “AI factory” frame and 2026 capex

NVIDIA glossary, What is an AI Factory? — NVIDIA’s own definition of the AI-factory frame as infrastructure for training, fine-tuning, and inference at scale. https://www.nvidia.com/en-us/glossary/ai-factory/
Reuters, Big Tech investors to gauge payoff as AI spending set to hit $600 billion (April 2026) — source for the 2026 AI-capex figure. https://www.reuters.com/business/retail-consumer/big-tech-investors-gauge-payoff-ai-spending-set-hit-600-billion-2026-04-28/

Cloudflare’s May 2026 layoffs

Business Insider, Read the memo: Cloudflare is laying off 1,100 employees to prepare for “the agentic AI era” (May 2026) — primary reporting on the eleven-hundred-headcount reduction and Cloudflare’s framing of it as an AI-era org redesign. https://www.businessinsider.com/cloudflare-announces-1100-layoffs-amid-ai-focus-shift-2026-5

Ramp’s acceleration-first internal AI posture

Ramp blog, Eric Glyman on using AI to radically boost internal productivity — Ramp’s framing of internal AI work as employee empowerment, with the limiting factor named as adoption rather than model capability. https://ramp.com/blog/ai-for-internal-productivity
Modal, How Ramp built a full-context background coding agent on Modal — case study on Inspect: sandboxed environments, internal-stack integration, parallel capacity, verification loop, share of merged PRs. https://modal.com/blog/how-ramp-built-a-full-context-background-coding-agent-on-modal

Concepts referenced

Amdahl’s Law — Gene Amdahl, 1967. The classical statement that speedup of a system is bounded by the fraction of the work that remains serial.
Von Neumann probe — the speculative self-replicating spacecraft concept, used here as a metaphor for software that arrives with a goal and the capacity to bootstrap itself into local terrain.
Elastic compute — the original AWS economic shift: provisioning compute capacity in seconds, surrendering it when idle, and pricing it as a utility.

Infinite Playground

Discussion about this post

Ready for more?