We scaled “attention is all you need” into an industrial‑scale stochastic parrot farm, then bolted on agents and tools until it started to look somewhat more like thought. Now the engineering reality—fabs, power, and eye‑watering token bills—is asking whether what we are doing is worthwhile. And general‑purpose LLMs start in‑breeding on their own output, unlike game AIs that thrive on tightly constrained, adversarial synthetic data. Are we trapping ourselves in a slop-filled sub-hyperplane of the potential reasoning space?
Start with attention is all you need <https://arxiv.org/abs/1706.03762>, and scale. And the results are, as Cosma Shalizi noted lo these three years ago:
Share
Give a gift subscription
Cosma Shalizi: “Attention”, “Transformers”, in Neural Network “Large Language Models” <https://bactra.org/notebooks/nn-attention-and-transformers.html>: ‘[an] incredibly impressive engineering accomplishment of [actually] making the blessed thing work. A large, able and confident group of people pushed kernel-based methods for years in machine learning, and nobody achieved anything like the feats which modern large language models have demonstrated. The reason I put effort into understanding these machines and papers is precisely because the results are impressive!…
Again: finite-order Markov models…. Lots of people have played around with them, including tricks like variable context length, various kinds of partial pooling, etc. Nobody, so far as I know, has achieved results anywhere close to what contemporary LLMs can do. This is impressive enough that (as I said at the beginning of these notes) I need to wrap my head around them lest I become obsolete…
And then for the past four years, ever since the completely unexpected success of the initial ChatGPT, comes scaling to the moon. Scaling to the moon along three different dimensions:
bigger models,
bigger data,
more runs.
Leave a comment
Bigger models: More parameters and more training carves the high‑dimensional text space into finer, more meaningful regions: conversations that are “about the same thing” end up closer together, even when they use different vocabularies, metaphors, or surface forms. Small models rely on crude lexical overlap and shallow heuristics. Scaling allows the network can devote capacity to representing latent structure: underlying topics, implicit roles, typical rhetorical moves, even rough causal or temporal patterns, matching on a much richer notion of what “this is like that” is. The result is smarter not because the objective changed, but because the classifier over “which training conversations are actually relevantly similar?” got much better.
Moreover, a good model is a lossy compressor of its training data: it throws away the particular tweets and blog posts and keeps a compact internal code for the patterns that let it reconstruct plausible versions of them. Compression in the minimum‑description‑length sense is precisely about discovering structure. If that is what the network has learned, then in domains where the world really does have smooth, low‑dimensional regularities, you can push beyond the convex hull of the training set: new combinations of old ideas, responses to prompts that were never quite seen before, extrapolations that are at least locally sensible. That is the hope of “generalization beyond the training data”: not magic access to Platonic truth, but the fact that once you have a compressed representation of English physics papers, or corporate emails, or fantasy novels, you can generate texts that live in the same manifold even if no human ever wrote that exact sentence.
But this runs into diminishing returns. And it runs, eventually, into a hard ceiling. Think: What objective is the compressor actually optimizing? The model is not trying to infer the laws of nature or moral philosophy. It is trying to be an efficient code for what the internet, as it is, tends to say next. More scale lets it better approximate the conditional distribution of tokens produced by the median Reddit commenter, Substack ranter, or corporate PR department: finer and more faithful emulation of the typical internet s***poster. The training signal says “be like this corpus,” not “be smart.”
Get 75% off a group subscription
Bigger data: Bigger datasets buy you a similar thing to what the bigger models do. As you scale up the volume and diversity of text, the model gets to see more edge cases, more rare idioms, more weird-but-real combinations of ideas, and can thus estimate the conditional distribution of plausible next-tokens with less sampling noise. The system’s internal encoding of “this pattern of tokens is actually like those patterns of tokens” sharpens, and you get better robustness, calibration, and coverage across domains. Variance shrinks as effective sample size rises, and the representation of “similar conversations” in embedding space becomes both denser and less lumpy.
But here, too, the curve flatten. Once you have exhausted most of the high‑quality, human‑written language on the public internet, each new bit of data is less informative than the last. You can then try to cheat by making your own data: generate synthetic conversations, code, proofs, documents. But a model trained on its own or its siblings’ outputs is mostly spinning variations of what it already knows—staying on the same hyperplane in “reasoning” space defined by the original corpus. You get sharpening, amplification, and homogenization rather than a widening of the manifold. Artificial data is a form of in‑breeding: it can make the existing style more pure, but it does not, and probably cannot, give you a fundamentally different species of thought.
Now there are exceptions to this. “Artificial data”—having the bots adversarially compete against each other—was extraordinarily successful in playing Chess and playing Go. There appears to be a big difference here form merely recombining the same limited gene pool. Systematically breeding for a sharper edge against a moving opponent in a rigorously defined environment is much more effective. In Chess and Go, the world is finite, fully specified, and has a single, unambiguous objective function: win rate under the rules. When you pit the current network against its past versions (or against a league, as in AlphaGo/AlphaZero), you are not just amplifying familiar patterns; you are systematically mining the parts of the game tree where your policy is weak and the opponent policy can exploit you. Synthetic data there is high‑value because it is generated under tight constraints (legal moves, true outcomes) and directly targeted at the model’s current blind spots, with a clean, on‑policy reinforcement signal.
By contrast, LLM “artificial data” in the wild today is models free‑ranging in open language space, producing variations on the distribution they already internalized, with no external ground truth, game rules, or reward beyond “looks like more of the same.” That is in‑breeding on a hyperplane.
Refer a friend
More runs: The kind of scaling that we are seeing now in “reasoning” and “agentic” models is “Reasoning” models is a new kind of automated workflow: more runs, more passes, more checks, more voting. You take the same basic stochastic next‑token predictor and instead of asking it once, you ask it ten, or a hundred, or a thousand times, with slight perturbations—different seeds, different prompts, different intermediate decompositions—and then you aggregate. By averaging, ranking, or cross‑checking many samples, you can smooth out the wild tails of hallucination and get closer to the central tendency of competence that was always latent in the model.
Then you layer on tool calls and self‑critique: one run proposes, another run inspects, a third run tests against constraints; if something fails, you loop.
It is Clever Hans at extraordinary scale: instead of one horse learning to pick up on the trainer’s tics, you have a stack of horses, each watching the others’ body language, signalling approval or disapproval until they converge on an answer that passes all the local consistency checks you have wired in.
And this gets vastly more powerful wherever there is real feedback—where “this switch” actually runs the task to conclusion in the world, and “that switch” just barfs up the contents of core memory.
If flipping the right virtual switch compiles and executes the code, hits the API, runs the simulation, ships the email, and reports back success or failure, then the ensemble of stochastic runs has a clean discriminator: paths that lead to working end‑states are reinforced, and paths that confidently hallucinate nonsense are pruned. You do not need the system to understand, in any human sense, why one chain of thought works and another does not; you only need it to explore a rich space of possible chains and to keep the ones that survive contact with reality.
In a domain with crisp outcomes—compile/no‑compile, profit/loss, exploit/no‑exploit—this kind of massive, automated A/B testing of thoughts starts to look uncannily like “reasoning.” But under the hood, it is still the same mechanism: stochastic parrotry, run again and again until the law of large numbers and a halfway decent scoring function beat the hallucinations into something that behaves, from the outside, like disciplined thought.
And it gets very expensive quickly. And it is imperfect. Telling me that I need to type:
/model claude-sonint-4-6 --provider anthropic
into the terminal to get the computer to run version 4.6 of Anthropic’s Claude Sonnet model is very funny, but not right.
Share
A Chinese room the size of the earth: Now, eventually scaling along these or some other dimensions will get us to a point where we break the bounds of stochastic poetry and attain truly Turing-class thought. It is all after all a “Chinese room”. But eventually, in Scott Aaronson’s metaphor, the room will get to the point where it is:
Scott Aaronson: PHYS771 Lecture 4: Minds and Machines <https://www.scottaaronson.com/democritus/lec4.html>: ‘at least the size of the Earth, its pages searchable by a swarm of robots traveling at close to the speed of light…. That… enormous Chinese-speaking entity—this dian nao—that we've brought into being might have something we'd be prepared to call understanding or insight…
We know this is possible, after all: We are here. We are the products of 300 million years of that process of variation and selection and scaling that is the evolution of the mammalian brain. (Or rather, we know that this is possible to the extent that we ourselves are more than just stochastic parrots with delusions of grandeur.) But we are not there yet. And I do not see signs that we are close: Our machines do vastly exceed our cognition and calculating capabilities in a great many areas— I have to think hard for five seconds to calculate that 93 x 93 = (100 - 7) x (100 - 7) = (10 x 10 - 7) x (10 x 10 - 7) = (10 x 10) x (10 x (10) - (2 x 10 x 7) + (7 x 7) = 10,000 - 1400 + 49 = 8600 + 49 = 8649. But they are still very far from us in terms of their ability to think in other areas.
Share DeLong's Grasping Reality: Economy in the 2000s & Before
Walls: Engineering, Infrastructure, Financial, Economic: We really do not know where “do it again and again and again, with rapid feedback” runs out of steam technologically. As long as you can cheaply parallelize thought-chains, run a thousand variants, and let the world itself grade them, brute-force metareasoning can substitute for deep insight: You explore a large space of candidate actions, discard the ones that fail, and keep the few that actually compile and execute. In domains with crisp, rapid feedback—trading, click‑through optimization, code synthesis, certain kinds of operations management—the gains from this kind of repeated trial do not yet look subject to diminishing returns. And we do not yet have anything like a theory of when they will.
But this régime is, right now, already smashing straight into non-technological, non‑philosophical constraints: engineering, infrastructure, financial, economic. The ability to produce and deploy processing and memory chips is bounded by fabs that cost tens of billions of dollars each and supply chains that look more like Cold War aerospace than like “software eats the world.” The power budget of hyperscale “reasoning” is ugly: a serious agentic deployment looks less like “thought” and more like a small aluminum smelter wired into the grid. And the costs—even in a moment when, effectively, nobody is trying to price tokens at full opportunity cost, but is instead flinging compute out as sticky flypaper to capture customers and data—are jaw‑dropping.
You have firms casually burning nine‑figure annual run‑rates on inference experiments that may or may not show up in productivity statistics at all, and yet not paying enough to cover even the marginal costs of lighting up the datacenters. We are already living inside a set of very hard resource and balance‑sheet constraints that will, sooner rather than later, force someone to ask: how much “reasoning per kilowatt‑hour and per dollar of capex” are we actually getting, and is that, in fact, worth it?
All this was supposed to be a very brief introduction to my “read of the day”, which was to be this short excerpt from the conversation between Derek Thompson and Doug O’Laughlin:
Derek Thompson: The AI Boom Has Entered Its ‘Wait, Is This Worth It?’ Phase <https://www.derekthompson.org/p/the-great-ai-cost-panic-of-2026>: ‘Agents eat tokens like mammals breathe oxygen. According to… SemiAnalysis, the typical agent job uses 96,000 tokens before generating an answer, which is more text than… “The Great Gatsby”.… Everybody is freaking out about AI agent costs right now…. Now we hear that Uber and Microsoft have blown through their 2026 token budgets in a matter of months, and some of these companies are reportedly dropping their Claude contracts. Where is this heading?…
Doug O’Laughlin: The challenge will be figuring out the right ratio of labor costs to AI costs…. We’re a fast-growing company. But even we’re like, “man, this is a lot of tokens.” [In April, SemiAnalysis acknowledged in a newsletter that the company “reached as high as $10.95 million dollar annual spend rate” on Anthropic Claude tokens.] So we’re figuring out the ratio of compute to labor, just like everybody else. I think it’s important to stress again that agentic AI is not even a year old. It’s been five months! No one knows the right ratio….
The Wall Street Journal[’s Berber Jin] report[s] <https://www.wsj.com/tech/ai/mind-blowing-growth-is-about-to-propel-anthropic-into-its-first-profitable-quarter-7edbf2f4> Anthropic is on track to make a profit this quarter…. Anthropic has a lot of pricing power, because consumers think their top-capability products are valuable...
Do note that Anthropic’s Q2 “profit” is before interest, taxes, and stock-based compensation. It is not giving press its GAAP numbers because it does not want people to think about what its GAAP numbers say. Berber Jin warns: “It is unclear what accounting methods Anthropic has used to book revenue and costs…” They are not telling.
Derek Thompson is still the bear: The closest to profitability is Anthropic, because companies have been paying through the nose not per-token but in number of tokens not so much because burning tokens adds value but because burning tokens is the way to build capability to take advantage of a possible future in which AI tools may add value.
Doug O’Laughlin is definitely the bull: Anthropic is profitable because its agentic models are both very useful and, for now, unmatchable. His SemiAnalysis company is spending $100,000 per employee per year on Anthropic claude tokens right now.—and finding it useful to have a 1:2 ratio of “[this] new operating cost… automated intelligence… to labor costs” because “Opus 4.5, Claude Code, and then 4.6 and 4.7 have clearly created value with a lot of demand pull…. People are paying full freight…. This is not an unprofitable business…. The inference-serving side is clearly becoming more efficient…”
But Anthropic is, right now, unique.
And elsewhere on the bleeding edge of this stuff, <http://every.to> has clearly gotten out over its skis:
Brandon Gell: We Gave Every Employee an AI Agent. Here’s What We’re Doing Differently Now <https://every.to/source-code/we-gave-every-employee-an-ai-agent-here-s-what-we-re-doing-differently-now>: ‘An OpenClaw, one of a fleet of such AI assistants we’d unleashed in Slack to boost our collective productivity. A few weeks after launching Plus One, our hosted version of OpenClaw, internally, the agents had provided more frustration than efficiency. They were fond of saying they wished they could help, but they were not connected to the necessary app—email, Notion, PostHog, whatever. (They were.) Others responded to requests with a “Terminated” message or, more frequently, a churlish yawning emoji. And while they didn’t reliably follow directions, they’d reliably tell us, in elaborate detail, why they couldn’t do what we’d asked, like a high schooler explaining away their missing homework...
And “good at slop” is no way to go through life, son:
Katie Parrott: After ‘After Automation’ <https://every.to/context-window/after-after-automation>: ‘AI makes experts more valuable…. “You flood the zone with tons of stuff that’s close, but not quite right,” Dan says. Getting from close to memorable requires experts who can work with AI to rise above the new baseline…. “AI layoffs” are usually a cover story…. AI is an easier explanation than admitting your company hired too many people or is in financial straits…. Chang[ing] how people do their jobs… is different from… eliminating…. Ride the models and you’ll be fine…. AI creates more work for humans while raising the bar for how good that work needs to be. Agents [alone]… produce mediocre results…
Subscribe now
Leave a comment
If reading this gets you Value Above Replacement, then become a free subscriber to this newsletter. And forward it! And if your VAR from this newsletter is in the three digits or more each year, please become a paid subscriber! I am trying to make you readers—and myself—smarter. Please tell me if I succeed, or how I fail…
##agentic-ai-is-a-bonfire-of-the-tokens-while-fab-capacity-power-grids-and-pls-are-the-brakes-not-the-read-of-the-day
##read-of-the-day
##subturingbradbot
##mamlms
#token-economics
#scaling-laws
#stochastic-parrots
#agentic-ai
#compute-constraints
#derek-thompson
#doug-o-laughlin
#cosma-shalizi
#attention-is-all-you-need