Epoch AI: Gradient Updates

What does the war in Iran mean for AI?

Josh You — Fri, 10 Apr 2026 16:40:42 GMT

This post is part of Epoch AI’s Gradient Updates newsletter, which shares more opinionated or informal takes on big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Originally posted on Epoch AI.

Disclaimer: My background is in the economics of AI and compute. I’m not an expert in war, diplomacy, or oil and gas, but I’m familiar with what economic inputs matter for AI. I am also writing about a very dynamic situation. So specific claims about the situation in Iran and Hormuz and its impacts on supply chains should be read as tentative and based on relatively quick research.

Since the US and Israel went to war with Iran at the end of February, shipping through the Strait of Hormuz — the sole sea route out of the Persian Gulf — has mostly shut down. This has disrupted around 10% of the world’s supply of oil, as well as exports of natural gas, helium, urea, and aluminium, and others. Iran has also struck targets in the Gulf states, notably oil and gas facilities and a few data centers.

On April 8, the US and Iran agreed to a two-week ceasefire that would reopen the Strait of Hormuz, though it is not clear how quickly shipping will resume or whether the ceasefire will hold. As of writing on April 9, Hormuz shipping has not meaningfully resumed amid reports of possible ceasefire violations, and a prolonged conflict still seems very plausible to me.1 If the Iran war and Hormuz closure continues, how might this affect AI?

I will mainly take a compute-centric view to this question. Compute is the main physical input to AI, and other inputs like labor and data are much less likely to be disrupted by this conflict. So I’ll focus on how this conflict could affect the production and deployment of AI chips and data centers, assuming Hormuz stays closed for many more months. (But note, war is unpredictable, and I won’t pretend to know what the longer-term nth-order effects of this conflict may be.)

Several components of the AI compute supply chain might be directly affected by this war: energy shocks could make it harder to power chip fabrication plants (fabs) and data centers, helium shortages could impact fabs, and direct attacks on Gulf states and data center construction could disrupt data center construction and investment flows for AI.

These aren’t the only exports disrupted by the conflict (e.g. the Persian Gulf is also an important source of aluminum) and there’s certainly a risk that I missed something important. But the economics of chip fabrication and data centers will generalize to other inputs.

Overall, I don’t think a prolonged conflict in Iran will be a huge deal for the AI industry or for the global compute buildout, unless the war escalates dramatically. That said, an extended conflict will be broadly harmful to the global economy, with diffuse second-order effects on AI demand, revenue, or investments.

The Iran War’s impact on energy

Oil is not commonly used to generate electricity — the relevant fuel that normally passes through Hormuz is natural gas, which must be chilled into liquefied natural gas (LNG) to be transported by sea.

The closure of Hormuz disrupted 20% of the world’s LNG supply, causing a ~70% increase in natural gas prices in Europe and Asia (live tracker here), which both depend on LNG imports for natural gas.2 This could significantly increase the cost of electricity: Taiwan, Japan, and South Korea all get around 30-40% of their electricity from gas (though this share is only 3% for mainland China), and much of Europe has a high share as well.3

Courtesy of Ember Energy Institute and OWID, who also provide a nice world map

By contrast, the US has been insulated from the LNG shock, with prices stable since before the war. This is because the US produces more gas than it consumes (it’s the largest LNG exporter in the world) and its LNG exports are bottlenecked by the capacity of its exporting infrastructure. US LNG exports were already at capacity before the war, so in the short run European and Asian buyers cannot bid up US natural gas to the global average.4 In short, the Hormuz closure has significantly increased electricity costs in much of Europe and Asia, but not in the US.

However, there are two major steps of the compute supply chain that depend on electricity in Asia and Europe: chip fabs and data centers.

Energy for fabs

Most of the world’s AI chips are manufactured in East Asia. TSMC in Taiwan handles nearly all leading-edge wafer fabrication and chip packaging, while South Korea hosts two of the three main players in the world’s memory fab capacity (SK Hynix and Samsung).5

However, despite both regions’ dependence on LNG, I think it is pretty unlikely that a prolonged Hormuz crisis will interrupt AI chip production.

Fabs consume a lot of energy, but the returns to that energy are enormous. TSMC used 25,000 GWh of electricity in 2024, around 8% of Taiwan’s total electricity consumption. But at the average price for industrial customers of NT$3.81 (~11 cents USD) per kWh, TSMC’s energy bill in 2024 would have been just $3 billion, or under 3% of its 2024 revenue.

Making AI chips is very profitable, so there’s a lot of headroom for power prices to increase. Chipmakers have very good margins (>50% operating margins for both TSMC and SK Hynix). And that’s before the also excellent margins at the chip designer/seller level, with Nvidia’s gross margins at 70-75%. Energy is such a small fraction of the final chip price that TSMC could absorb a doubling of power costs, or pass it to Nvidia, who could easily absorb it too.6

So power prices in Asia would have to increase a lot to threaten chip fabs, e.g. by well over ten-fold. This is far greater than the ~70% increase we’ve seen so far.

Could gas prices continue to climb? I’m not qualified to say what the upper bound might be. The research group Oxford Institute for Energy Studies estimated that a one-year closure could push gas prices in Europe to $40 per million BTU, almost quadruple the pre-war price (though still lower than the 2022 peak following Russia’s invasion of Ukraine, which did not broadly shut down Europe’s most profitable factories to my knowledge). And Taiwan might be even more dependent on LNG than Europe. In general, demand for fuel or electricity is very inelastic in the short run, meaning that supply shocks can cause disproportionate effects on prices.

There are also policy dynamics at play: Taiwan subsidizes residential power relative to industrial power, and in a severe crunch Taiwan and South Korea could ration power away from commercial customers to meet basic needs. But I suspect that chip fabs would be the highest-priority industrial customers in both countries, with other businesses rationed first.

And the US government could intervene in a severe shortage. The government of Taiwan has said that they have received supply assurances from a “major” LNG producing nation to meet its needs. They didn’t say which producer this was, but in any case AI chips are so important to the US’s economic and strategic interests that the US government may ensure that Taiwan gets the LNG they need if chip production is threatened.

Overall, it looks unlikely to me that the energy restrictions will disrupt AI chip production unless the scale of the energy shock dramatically escalates.7 The energy crunch in Taiwan and Korea would have to be truly catastrophic, not just severe.

Energy for data centers

AI data centers are disproportionately located in the US, which is insulated from the LNG shock. But a significant share are in Europe and Asia (estimating the precise distribution is out of scope here), where the war could cause serious energy problems for the buildout of AI data centers.

AI data centers need much more electricity than fabs. The data centers running TSMC’s chips AI draw significantly more power than TSMC’s several gigawatts of direct electricity use.8

Still, as with chip fabs, it’s unlikely that already-built AI data centers in Europe or Asia will shut down. At normal prices, energy costs are a modest share of a data center’s total cost of ownership or revenue potential.9 Once the upfront capital cost to build an AI data center has been incurred, energy prices would have to massively surge to force a shutdown.

However, planned data centers are a different story. The hyperscaler clouds don’t enjoy Nvidia/TSMC-level margins, and -some “neoclouds” (smaller, newer AI cloud companies) are reportedly on at least somewhat shaky financial footing. Even if power is a minority of total costs, a 10–20% increase in total costs from an energy crunch could be enough to kill some projects.

It’s not clear if this would make a big difference in equilibrium. Nvidia could cut prices if enough customers can no longer afford to power its chips. Rising GPU-hour prices due to demand growth could rescue marginal projects. And the decision to build a data center depends on forecasted energy costs over the coming years; over time Asia and Europe could build up alternative energy sources, or the US could ramp up LNG exports, reducing global prices. Finally, total data center capacity may be fungible by location, with the US eventually absorbing data centers diverted from Europe and Asia.10

Still, an extended Hormuz closure causing a large, longer-term energy shock in much of Asia and Europe would plausibly be at least moderately important in slowing the overall scale of AI compute investments in the medium term.

Helium

Helium is even more outside my wheelhouse than oil and gas, but it deserves a mention. Helium isn’t just used for party balloons and blimps. It’s essential for chip fabrication, serving as an inert gas and a means to cool wafers, with one source claiming that 24% of global helium is used for semiconductors. Helium is also necessary for many other uses like MRI machines, cooling superconductors for scientific research, and fiber optics; see this post from Construction Physics for an overview.

The Persian Gulf supplies around one-third of the world’s helium, primarily from Qatar. These exports were blocked by the Hormuz closure, so in relative terms this disruption is even greater than for oil and LNG.

However, the loss of Qatar’s helium supply doesn’t look likely to interrupt AI chip production. The New York Times spoke to several experts who believed that chipmakers would be able to outbid other users of helium, though the logistics may be challenging. South Korea claims it has sufficient helium reserves to last through June, with SK Hynix and Samsung working to secure supply from the US.11

I don’t know enough about helium to be too confident here. But, as with energy, it seems unlikely that the most profitable subsets of the semiconductor industry — AI chips and the requisite high-bandwidth memory — will be the ones squeezed out by a 33% helium shock.

Gulf data centers and investment flows

Over the past month, Iran has been targeting many neighboring Gulf states12 with drone and missile attacks, including direct attacks on (non-AI) Oracle and AWS data centers in the region. At least in the case of AWS, these attacks were serious enough to cause service disruptions. And Iran’s Islamic Revolutionary Guard Corps has directly threatened to destroy Stargate UAE, the gigawatt-scale data center that is currently under construction on behalf of OpenAI in the United Arab Emirates.

These attacks may cause disruptions or outright cancellations of AI data centers in the Gulf states. This could be due to direct damage, but more likely through the perception that the Gulf region will face greater geopolitical risks even after this conflict ends.

This is potentially quite significant. Stargate UAE alone is a large project, initially planned for >1 GW of power capacity with an eventual goal of 5 GW. But I am personally somewhat skeptical that the Gulf was ever going to host more than, say, 10% of the world’s AI data centers. Gulf states like Saudi Arabia and the UAE are very wealthy and energy-abundant relative to their small populations, but they are fundamentally much smaller in terms of GDP (~2% of the world total), total electricity production, workforce, or geographic area than regions like the US, the rest of Asia, or Europe.13

Courtesy of Ember Energy Institute and Our World in Data

Indeed, the Gulf states don’t actually seem to be a great place to build data centers. Amelia Michael, a researcher with GovAI, observes that the UAE has higher energy costs than the US, along with environmental challenges and a limited local workforce.

One possible driver of Gulf AI megaprojects is that the Gulf is a very important source of capital funding for AI: companies plan data centers there to cultivate ties with deep-pocketed investors and sovereign wealth funds. These Gulf AI investments have reportedly added up to $300 billion in total.

This spigot of funding may be interrupted if the Iran War continues, which may be more consequential than the direct impact on data centers. If the Gulf proves unsuitable for data centers, that capital can flow elsewhere — but only if the capital is still there. Gulf economies are built on oil, and a prolonged export disruption will massively crimp the cash flows that would have flowed into AI.14 Oil exports haven’t been completely cut off: Saudi Arabia and the UAE are still exporting >5M barrels of oil per day through pipelines bypassing Hormuz, but that’s a fraction of the 20M barrels/day that formerly flowed through the strait.

Gulf investors may also become more risk-averse, prioritizing investments they perceive to be safer; though as of March 12, both the UAE and Saudi Arabia have said they don’t plan to change their investment strategies.

In the near term, funding disruptions could affect OpenAI and Anthropic, both reportedly planning IPOs as soon as this year, with potentially many billions of Gulf funding at stake. One sign that Gulf investments are important to AI is that Anthropic decided to seek Gulf investments despite moral and reputational risk. Anthropic CEO Dario Amodei wrote in a leaked memo last year about his decision to tap the “truly giant amount of capital in the Middle East, easily $100B or more”, despite his moral reservations about enriching “dictators” (and the subsequent leak suggests that this decision was controversial internally).

But I also don’t want to overstate the importance of Gulf funding. The Gulf states are a massive source of concentrated capital, with the sovereign wealth funds of the UAE, Saudi Arabia, Kuwait, and Qatar adding up to over $5 trillion. Because many of these funds invest heavily in tech/venture startups, they are hugely important to the startup industry. But ultimately, the Gulf states aren’t that rich relative to the entire rest of the world.

The overall pace of the global compute buildout will be mostly determined by hyperscaler capex, financed by the immense cash flows of tech giants like Microsoft, Google, Meta, and Amazon15 and by a huge global bond market. Total AI capital spending may exceed $1 trillion per year soon. Gulf investment flows into AI are on the order of tens of billions annually: significant, but not critical.

Takeaways

Overall, I think the Iran War won’t be a huge deal for the AI sector, even if Hormuz stays closed for many more months. Chipmaking is massively profitable, so TSMC, Samsung, and SK Hynix will most likely be able to secure the power, helium, and other resources they need. Some data center plans in Europe or Asia may be cancelled due to rising costs, but in the US data centers will be insulated from the energy shock. I think the most important direct effect may be the disruption of Gulf investment flows, but they aren’t the biggest source of funding for AI.

But this all assumes the conflict doesn’t escalate dramatically. A wider regional war, or spillovers to other theaters, or severe second-order macroeconomic effects, could change the picture in ways I can’t model here. The Iran War (if it continues to be a war) is a dynamic situation, and there’s a real risk that this post will be outdated shortly after publication. I will keep “monitoring the situation” with interest.

Thanks to Anson Ho, Luke Emberson, Jaime Seville, and Lynette Bye for their feedback on this post

And there will be prolonged after-effects even if there’s an immediate, lasting peace deal and reopening of the Strait, due to damage to oil and gas infrastructure caused by the war, and impact of oil well “shut ins”: shutting down an oil well can cause damage to the oil. All my claims in this post are conditioned on “the war and Hormuz closure do not actually end this week”, but the distinction may not be too critical.

Not all natural gas is liquefied, so the 20% shock to LNG is a smaller share of overall world gas, but LNG is necessary to export gas to an island like Taiwan or a pseudo-island like South Korea.

In general, since gas is just one input into electricity generation into these countries, the percentage increase in gas prices should be an upper bound in the percentage increase in power prices. However, the respective price increases can be similar in a gas shortage: to clear an electricity market, prices need to be equal to the marginal cost of generating power, so power prices will be closely linked to the cost of gas generation (fuel plus other operating expenses).

This is not the case for oil; the price of oil in the US is similar to global prices.

This simplifies a very complex upstream supply chain including semiconductor manufacturing equipment. But TSMC and the memory companies are probably the most energy-intensive part of the chipmaking process.

If Nvidia sells a GPU for $40k at a gross margin of 75%, then it makes a gross profit of $30k and pays its suppliers (mainly TSMC) $10k. If TSMC has an operating margin of 50%, that means it costs them $5k to produce the GPU including marginal production cost and overhead operating costs. I cite operating margin instead of gross margin for TSMC/Hynix to ensure I’m including all fab-related costs.

For example, if there are escalating direct attacks on oil and gas infrastructure on both sides—Iran has already attacked LNG in Qatar—leading to severe long-term damage. There is a pipeline to Oman bypassing Hormuz with a capacity of ~20% of Gulf LNG exports before the war, so these attacks could reduce Gulf gas exports even further in the short term. Or there could be spillover to other theaters; Ukraine has escalated its attacks on Russian oil production to try to prevent Russia from benefiting from the recent surge in oil prices.

This can be found by dividing TSMC’s consumption of 25,000 GWh per year by the number of hours in a year.

One estimate of the energy cost of running 20k Nvidia H100s, using prevailing power prices in the US, is $20.7M per year ($1000 per H100 per year), while an H100 can be rented out for roughly $2/hour, or over $15,000 per year.

Availability of suitable sites with grid connections and infrastructure may be a constraint as well as raw energy supply, so reducing the geographic scope of data center construction probably will have some effect on the global compute build out.

The US is the largest producer of helium outside of the Middle East, and also has around one-third of the world’s supply.

“Gulf states” is common shorthand for the six wealthy Arab-majority monarchies that border the Persian Gulf: Saudi Arabia, United Arab Emirates, Bahrain, Qatar, Kuwait and Oman. Iraq and Iran itself are also major oil exporters on the Persian Gulf, but have a rather different vibe.

To clarify, while I would expect Asia (ex-Gulf and ex-China) to eventually build more AI data centers than the Gulf states even absent this war, the impact of the Iran war on Gulf data centers may be more significant than the energy impact on Asian data centers, so the disruption to Gulf data centers could still be larger.

My simple model here is that because profits from the Gulf oil and gas industry are reinvested into sovereign wealth funds or privately-held investment funds, they are the main source of outgoing capital flows from the Gulf. But it’s conceivable that even if oil money stops coming in for a while, Gulf investors that are highly bullish on AI could sell their non-AI investments so they can plow more into AI, though this seems quite risky.

See chart 12 in Understanding AI’s “16 charts that explain the AI boom”

Keeping up with the GPTs

Anson Ho — Wed, 08 Apr 2026 04:05:38 GMT

Gradient Updates shares more opinionated or informal takes on big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

If the last decade of AI has taught us one lesson, it’s that scaling compute builds better models. This sounds great — until you realize your competitors have ten times more compute than you.

This is the situation that many Chinese and open model companies find themselves in; relative to frontier companies, they’re “compute-poor”. Just last year, Anthropic spent over ten times more on compute than Minimax and Zhipu AI combined, and the gap is even wider for OpenAI:

Data from Epoch’s data on AI companies and Data Insights.

You don’t need to be an AI expert to see that this is a huge handicap. With less compute, it’s harder to run experiments, train bigger models, and serve many users.

But compute-poor AI labs have an ace up their sleeve. Even lacking frontier-level compute, they can try to use theirs more efficiently to punch well above their weight. That’s how DeepSeek was on the heels of OpenAI despite using a fraction of the training compute (at least on some benchmarks), driving the stock market bananas.1

The big question is: are these efficiency gains big enough for compute-poor labs to really compete, potentially even leapfrogging the compute-rich labs?

Breaking down the efficiency gains

To figure this out, we need to look at how compute-poor labs can try to boost their compute efficiency. As far as I can tell, there are three main ways to do this:

Develop new algorithms and improve data faster than compute-rich labs
Replicate innovations from frontier labs
Leverage compute-rich labs’ model capabilities — e.g. train on synthetic data generated by rival models

Out of these three approaches, the first is the only one that gives them a shot at overtaking the frontier — the latter two mainly help play catch-up. But catching up could still be important because it makes it easier to leapfrog the frontier later on. So in theory, all three approaches could matter a lot for competition.

Whether they matter a lot in practice depends on one key condition: whether they improve efficiency asymmetrically, helping compute-poor labs much more than compute-rich ones. After all, if both sides benefit equally from the same trick, a ten-fold compute gap stays a ten-fold compute gap. So let’s see how well each holds up.

Approach 1: Innovate faster than the compute-rich labs

If compute-poor companies can’t outspend the compute-rich frontier, maybe they can out-think them. Such labs often have many brilliant researchers spinning up new algorithmic innovations, like DeepSeek’s MLA and GRPO — things that demonstrate great research taste.

Being compute-bottlenecked also pushes these labs to be efficient, experimenting with risky algorithmic “moonshots” that dramatically save compute. In contrast, frontier labs might tend to stick to tried-and-true recipes — if their training runs are many times larger, then it’s also many times more costly for them to take a risk that fails.

But compute-rich labs have their own efficiency advantages. For example, running experiments often requires lots of compute, like to test if algorithms scale well. That’s why most compute spending at AI labs goes to experiments:

And as I explained in my last essay, some innovations like the Transformer give bigger efficiency gains as you scale up training compute. So these innovations benefit the compute-rich more than the compute-poor.

What’s more, it’s not just the compute-poor labs that have talent — the compute-rich labs do as well, and they probably have the best talent. They can pay exorbitant amounts for the crème de la crème, often poaching researchers from other labs.2 They also seem to be paying more for junior roles. For instance, early-career researcher salaries at OpenAI and Anthropic are around twice as high as at DeepSeek, even after accounting for purchasing power.3 I’d guess this is part of why the vast majority of Chinese AI researchers who go to US institutions tend to stay there. And anecdotally, Chinese AI researchers see the US as the most desirable place to work (at least for now).

In practice, I think this is why many of the most notable innovations have come from those with the most compute. For example, innovations like modern Transformer-based LLMs, scaling laws, and reasoning models were developed by OpenAI and Google — organizations that fall squarely within the “compute-rich” category.4

So I don’t see why I should expect compute-poor labs to find new software innovations much faster than compute-rich labs — on the contrary, I think the opposite is more likely.5

Approach 2: Replicate innovations from frontier labs

If they can’t beat the frontier labs at developing new software innovations, why not replicate them? Knowledge (like algorithms) in frontier AI labs could “spill over” into compute-poor labs, keeping the latter in the competition without needing to pay big R&D costs to find the innovations in the first place. That would be a big deal, because research costs are the main reason why AI companies aren’t yet profitable, and this then makes it easier to leapfrog the frontier with new innovations or by closing the compute gap.

This could work in a few ways, including AI researchers gossiping, researchers moving from compute-rich to compute-poor labs, or researchers at compute-poor labs reconstructing innovations from frontier labs based on public information.

This could lead to especially large and fast spillovers if there are “four minute mile” effects — after one AI lab makes a breakthrough, other labs realise they can do it too, so they pour effort into reimplementation.

There is evidence this already happens, such as case studies of people broadly reimplementing frontier lab breakthroughs: despite limited public information,6 a small team of academics got pretty close to reproducing AlphaFold 2 within seven months of its release.7 Another example is reasoning models: after OpenAI released the first reasoning model (o1), several different AI labs followed suit within months, sometimes using several times less training compute.8

But are these dynamics big enough to bridge the gap? Probably not, for several reasons.

For starters, innovation economists seem to think that these kinds of spillover effects are a big deal,9 but I doubt even they would say these kinds of spillovers are strong enough to make up for a ten-fold difference in compute.10

Moreover, even if compute-poor companies instantly knew how to reimplement a closed model’s innovations upon release, they’d need to launch a new training run that takes (say) three months to finish – three months where the compute-rich labs are gaining further ground. And this is ignoring the time it takes to figure out what things actually work, as in the case of reasoning models and AlphaFold 2. In practice, companies like MiniMax and Zhipu AI are still spending much more on experiment compute than on final training runs, meaning spillovers aren’t enough to cut out those experiment compute costs:11

And finally, the mechanisms for these spillovers don’t even favor compute-poor labs that much. Just as the compute-poor can copy the compute-rich, the compute-rich can copy the compute-poor, especially if their models are open — there’s a reason why big AI labs still follow the academic literature. And while researchers often move from compute-rich to relatively compute-poor labs (e.g. from OpenAI to Mistral AI), they also go the other way. This is especially true for people who move from compute-poor academia to compute-rich industry.

So once again, I’m skeptical that “copying” innovations boosts compute efficiency that much more for the compute-poor than compute-rich — at the very least I doubt it’s enough for a ten-fold gap in compute.

Approach 3: Leverage the capabilities of frontier models

Compute-rich labs have already developed expensive frontier models, and they’re letting people use them. So why not use them to improve compute-poor models?

In practice, the biggest “oomph” here probably comes from getting frontier models to generate lots of outputs, and then training on that data. This is often (inaccurately) called “distillation”,12 and it lets you transfer the capabilities of frontier models into smaller models. A notable recent example comes from Anthropic, who accused DeepSeek, Moonshot, and MiniMax of distilling from Claude’s outputs.13

The question is, just how much of a difference does distillation make in practice? The evidence is patchy, but if we squint at the numbers I think they weakly suggest that companies can potentially get several-fold compute efficiency gains.

One piece of evidence is what happens when you distill DeepSeek-R1 into Qwen2.5 models:14

On MATH and GPQA Diamond, we see enormous efficiency gains — requiring several times less pre-training compute for GPQA, and multiple orders of magnitude less for MATH. So at first glance, distillation seems like it could be a huge deal for crossing the ten-fold compute gap to the frontier.

The problem is that this anchors too much on specific benchmarks. It’s suspicious that the efficiency gains vary dramatically between MATH and GPQA. So what if we retain capabilities across a range of benchmarks?15 Once again the evidence is shoddy:

Olmo 3 was able to outperform the best open models at its scale across multiple benchmarks, while using 6× fewer tokens. This involved a pipeline that included distillation, so we can interpret this as an upper bound of the efficiency gains in the paper.
Based on how quickly they’re run, it’s plausible that OpenAI mini models had several times fewer parameters than the original models they were distilled from. At the same time, they arguably perform only slightly worse on various benchmarks, and are just edged out in head-to-head comparisons. For example, human voters preferred GPT-5 to GPT-5-mini around 56% of the time. (That said I think this framing is a bit optimistic — there are other benchmarks where the mini models are quite a bit worse.)

It’s not clear to me how comparable these examples are — the exact efficiency gains depend on how much data was used, how comprehensive the benchmarks are, and so on. Annoyingly they’re also based on different metrics — one helps you use less data, the other fewer parameters, and it’s not clear exactly how this maps onto compute savings. But I think they weakly suggest that you can potentially get several-fold compute efficiency gains, even if we account for a broad set of benchmarks or domains.

And you don’t need much data to make this work: it’s possible to distill DeepSeek-R1 with around 4 billion tokens.16 For comparison, MiniMax may have been able to get 100 billion tokens of data from interactions with Claude.17 Nathan Lambert points out that this much data could “meaningfully improve a models’ post-training”, which sounds about right to me. So this is probably the most compelling of all the potential explanations we’ve seen for how compute-poor AI labs could try to keep up with the frontier.18

That said, there are several caveats to bear in mind. First, there are limits to how well distillation works. As the model gets smaller it’s much harder to distill capabilities into it. We can see this when distilling DeepSeek-R1 into different versions of Qwen2.5 — performance on benchmarks like AIME, MATH, and GPQA fall as the model gets smaller:

To be more concrete, I’d be pretty surprised if a distilled model could be a hundred times smaller but still capture the same broad set of capabilities as its “teacher” model.

A second caveat is that distilled models may be more likely to be “benchmaxxed” — they’re optimized for a specific set of benchmarks, and don’t do so well in general. For example, OpenAI’s “mini” models often do worse on things that involve lots of knowledge recall.

Third, compute-rich frontier labs could keep the most powerful models to themselves or select customers, making it hard for the compute-poor to keep up via distillation.

Fourth and finally, it’s hard for distillation to be the full story for how models develop reasoning capabilities on a range of environments. As Nathan Lambert points out, this typically requires large-scale RL where the model learns from its own trial-and-error, rather than purely distilling from a stronger model.So while distillation certainly helps reasoning, compute-poor labs will probably still need to establish RL environment infrastructure and do the RL training itself.

Overall, leveraging frontier model capabilities probably does meaningfully boost efficiency for compute-poor labs relative to the compute-rich. I’d weakly guess that it doesn’t get them all the way to covering a 10× compute gap — probably it narrows the gap several times.

Putting things together

Here’s where I land on the three approaches:

Coming up with new software innovations: This doesn’t seem to asymmetrically favor compute-poor AI labs — in fact it might favor the compute-rich.
Replicating innovations from frontier labs: Spillovers often go in both directions, and in general I’m skeptical that the effects of this are large enough to make up for a ten-fold difference in compute.
Distilling frontier model capabilities: This could help the compute-poor get very close to broad frontier capabilities with several times less compute. I doubt this is enough to save ten times less compute, but this is a meaningful efficiency boost, and you can get much bigger savings if you only care about specific narrow capabilities.

So if we put everything together, I think compute-poor labs probably can’t fully make up for their 10× compute disadvantage to compete at the frontier. The compute gap is just too large, and most approaches don’t help the compute-poor that much more relative to the compute-rich.

But it’s not something that we can totally rule out, especially if there are big algorithmic breakthroughs like reasoning models. And of course, I’m basing my opinion on pretty limited evidence — I’d love to hear if people know of good data or arguments for or against my position.

What does this mean for the future of AI?

So far, I’ve mostly been talking about “compute-poor” AI labs, which have around ten times less compute than those at the frontier. This framing helps us analyze the efficiency gains from things like distillation and replicating innovations, and helps us show that compute is the dominant factor in who “wins” in AI competition. But we still need to address a big question: who exactly are these “compute-poor” labs?

Compute-poor = Chinese AI labs?

One type of “compute-poor” AI lab, at least today, is Chinese AI companies. For example, we saw earlier that MiniMax and Zhipu AI collectively spent under ten times the compute of Anthropic in 2025.

So does this mean that Chinese labs are destined to be relegated to the sidelines, unable to compete with American AI companies? Not necessarily, because it depends on how much compute American and Chinese AI labs can get their hands on. If the compute gap narrows, it becomes easier to compete with the frontier. If it widens, the opposite is true.

This of course hinges on a huge open question: how will the compute gap evolve over time? So far I think it’s not grown massively, and I think this is why they’ve not fallen far behind in capabilities to the frontier, as measured by the Epoch Capabilities Index (a fancy aggregation of benchmark scores):

But this interpretation is a little debatable. For example, the gap in training compute has grown, though not dramatically so:

But it’s rather confusing for a couple of reasons:

The best US and Chinese models aren’t always trained with the most compute. For example, Grok 3 and 4 are probably also inefficient for their size, and I’d guess the same is true for Doubao Pro.
This is looking at training compute rather than overall compute budgets. I’ve mostly been talking about the latter throughout this essay, but sadly we don’t have enough data on it.
Another confusing thing is that different AI labs were scrambling to figure out reasoning, which gives a much larger efficiency boost compared to most other innovations.

Overall I’d weakly guess that the overall compute gap hasn’t changed much over time, and this is why the capabilities gap hasn’t changed much either.

But there are some reasons to think that the gap will grow. As I’m writing this, American hyperscalers are driving a data center buildout that’s larger than the Manhattan Project and Apollo Program at their peaks. As part of this, frontier labs will soon have data centers that each cost tens of billions of dollars and contain millions of GPUs. In contrast, Chinese AI companies are hampered by export controls, and being behind in domestic AI chip production. This could widen the compute gap in the near term, and hence lead to a divergence in the capabilities of US and Chinese AI models. On some measures this is already happening, and Tang Jie (CEO of Zhipu AI) even recently said the following:

“The truth may be that the gap [between US and Chinese AI] is actually widening.”

On the other hand, Chinese AI labs could continue to get more compute by smuggling chips, or by renting it through the cloud. And the further we look into the future, the greater the odds of China catching up in domestic AI chip production.

So in my book, the overall picture of US-China AI competition looks like this: in the short run (say 2-3 years), I think it’s unlikely that the gap will narrow much, because of export controls and challenges in rapidly scaling up domestic chip production. But it’s more uncertain in the longer run. It depends on hard-to-predict chip export controls. It depends on how hard it is to smuggle and rent chips. It depends on how much China wants to really push for AGI, scaling up their compute stocks — and there are signs that the winds are shifting in this direction from China’s most recent Five-Year Plan.

And there’s an additional wildcard that makes things even more uncertain — frontier AI companies can run more of the best AIs to speed up their own AI research, relative to their competitors. Right now these gains are maybe noticeable but not game-changing, but that’ll probably change in the next few years. If so, this could widen the capability gap between the compute-rich and compute-poor, even if the compute gap stays fixed.

Compute-poor = Open models?

Another type of “compute-poor” AI lab is the labs developing open models. Like Chinese AI companies, their training compute hasn’t fallen super far behind the frontier, though the gap has grown…

…and the capabilities gap looks like a constant-ish lag:

Perhaps this isn’t so surprising. There’s quite a lot of overlap between open model labs and Chinese AI companies, so we should expect a lot of their challenges to be the same. For example, a big open question is how the compute gap between open model labs and frontier companies will change over time, just like what we saw with Chinese labs.

But the future of open models at the frontier depends on a separate open question (no pun intended): will future AI models be open-weight?

This is probably even harder to answer than the question about compute scaling. One issue is that companies seem to flip-flop on whether to release open models. For example, Alibaba has recently shown signs of tilting further toward closed releases.19

But my favourite example of this has got to be Meta — in July 2024, Mark Zuckerberg wrote an essay called “Open Source AI is the Path Forward”. He raised multiple arguments for open models, like privacy, safety through lots of scrutiny of AI models, and leveraging the broad AI community to improve Meta’s models. Now they seem to be backing away from this strategy — first with reported discussion about abandoning the open-weight Llama 4 Behemoth in favor of a closed model, and now with their next-generation “Avocado” model allegedly to be released closed-weights. Someone has even written an article called “Open Source AI was the Path Forward”. (Last-minute pre-publication edit: apparently Meta now plans to “eventually” offer open versions of their next generation of AI models, so maybe Open Source AI is eventually the Path Forward).

Sometimes labs also go the other way too. After DeepSeek-R1’s release as an open-weights model, several Chinese labs followed suit. For example, Baidu abandoned their commitments to closed models with their ERNIE 4.5 model, citing DeepSeek as a motivator driving this decision.20

The upshot is that the capability gap between open models and the frontier depends a lot on which labs choose to be “open”. This depends on things like government strategies, the ideologies of company CEOs, and also business incentives, as Nathan Lambert points out:

The bottom line

So what have we learned from all of this discussion? For me the primary takeaway is this: compute is the biggest factor for which companies can compete at the capabilities frontier — efficiency matters too, but it’s probably not enough to make up for ten times less compute.

At first glance this seems to paint a gloomy picture for Chinese and open model labs, but I think it’s not so clear cut. This is because the label of “compute-poor” may not map onto them very cleanly in the future. Maybe it’s hard for them to compete with the “compute-rich” frontier labs today, but in the future this’ll depend on how much compute they have, and which labs choose to be “open” or “closed”.

Moreover, I’ve been framing the whole analysis around competition at the frontier of capabilities, but strictly speaking that might not even be the only priority of these AI labs! For example, we could imagine Chinese models simply finding a different part of the “capabilities vs cheapness” pareto frontier:

In fact, if you look at token usage on OpenRouter, you’ll notice that a bunch of the most popular language models are actually Chinese (though this is kinda misleading — for example, most people who use Claude probably don’t use it through OpenRouter):

I happen to think that competing along the dimension of raw capabilities matters a ton as the world gets closer and closer to building AGI. But it’s not the only thing that matters, and it’s very possible that different AI labs might not even be “racing” toward the same thing at all.

I’d like to thank JS Denain, Jaime Sevilla, David Owen, Ben Cottier, Luke Emberson, Lynette Bye, and Stefania Guerra for their feedback and support on this post.

Notably, my colleagues at Epoch estimate that DeepSeek-R1 was trained with 4 × 10²⁴ FLOP, compared to a speculative guesstimate of 4 × 10²⁵ FLOP for o1, assuming the base model is GPT-4o and post-training compute costs are relatively small. That said, it’s worth saying that many claims about how DeepSeek-R1 was super cheap are overstated.

Though I suppose we still need to see what happens with Meta Superintelligence Labs after their big hiring spree.

For example, junior research scientists at OpenAI and Anthropic tend to earn in the ballpark of $350,000 per year. In contrast, junior researchers at DeepSeek earn around ¥50,000-100,000 per month, which in US dollars works out to $100,000-200,000 per year (note that the annual compensation is for 14 months rather than 12, because there are two extra salary-equivalent payments, e.g. at year-end or from bonuses).

Perhaps there’s a counterargument that you didn’t necessarily need a lot of compute to come up with these innovations in the first place. But even if that’s true, it’s still telling that the innovations were developed by compute-rich AI labs in actuality.

But even if I’m wrong about this, a ten-fold gap in compute is a lot, and it’s rare to find original software innovations that can make up for it. One exception might be reasoning models, but notably these were invented by OpenAI — hardly a compute-poor AI lab!

As I understand it, the main thing that was known was some high-level details about the model architecture. RoseTTAFold was developed knowing that AlphaFold 2 used end-to-end prediction from MSAs with attention and an equivariant structure module, but not the specific architecture, loss functions, or training innovations, which were only revealed when the AlphaFold2 paper was published on the same day as RoseTTAFold (July 15, 2021).

I say “pretty close” because their attempt (RoseTTAFold) doesn’t quite match AlphaFold 2’s performance — if we fish out the numbers using WebPlotDigitizer, we see that RoseTTAFold scores 80.3% compared to AlphaFold 2’s 90.2% on the “Average TM-score” metric in the CASP14 competition. On the other hand, RoseTTAFold may have used several times less compute — according to the paper’s supplementary material, it was trained for 4 weeks on 64 NVIDIA V100 GPUs. Assuming 16-bit precision and a utilization of 30%, this works out to 1.4e21 FLOP. In contrast, my colleagues estimate that AlphaFold 2 needed 3e21 FLOP, which is around twice that. I don’t know how the performance of AlphaFold 2 and RosettaFold scale with more training compute, but I’d guess that AlphaFold 2 was more compute-efficient overall.

A caveat: I’ve heard through word of mouth that Anthropic was actually able to reimplement reasoning models within about two weeks of o1’s announcement, which doesn’t sound crazy to me. I’m not sure what this means in terms of the time lag between o1 and other reasoning models though — probably OpenAI had “finished” developing o1 some time before they announced it.

Note that I’m referring to spillovers between firms in the same industry (“technological spillovers”) rather than spillovers from firms in the same region (“geographical spillovers”). Both matter, and in fact the latter probably differentially benefits the cluster of compute-rich AI labs in the Bay Area relative to other AI companies.

It’s not directly comparable to the metrics we’re looking at here, but we can try to get a sense of the effect sizes. According to one paper, if other firms working on similar tech increase their total R&D by 10%, your own firm’s market value also grows by 2-3%. That doesn’t sound like the kind of thing that means you can compete with ten-fold less compute!

If it did, it would also disincentivize frontier labs from doing R&D a la Kenneth Arrow.

Strictly speaking this is an abuse of terminology — distillation as originally defined means training a smaller model on a “teacher” model’s full probability distribution over all possible outputs, not just its final generated outputs.

Another piece of evidence is “slop forensics” to figure out which models were related to one another (and hence which models inherited capabilities from frontier models).

This works by looking at which words or phrases show up disproportionately frequently in some models, to pick up on its “signature”. Then it compares these signatures to see the relations between models, which gives something like an evolutionary tree. If you stare at the outputs, you could conclude that DeepSeek switched from training on synthetic data from OpenAI models, to synthetic outputs from Gemini. I’m not sure how much to update on this kind of evidence though. For example, I’ve also heard suggestions that Anthropic’s models might be trained on data from DeepSeek’s models! So what’s the overall picture?

Note that this is using the more accurate definition of “distillation”, with access to probability distributions over tokens, rather than just final output tokens.

Here the constraint is to maintain around 95% of original benchmark performance, and broad capabilities across a range of different domains. You can probably do a lot better than that if you only keep capabilities on a particular set of benchmarks.

The paper describes using 800,000 examples with around 5,000 tokens each, which works out to around 4 billion tokens.

MiniMax allegedly had at least 10 million exchanges with Claude to gather its data. If each “exchange” had around 10,000 tokens, as Nathan Lambert suggests, this works out to 100 billion tokens in total.

Some kinds of distillation involve much more data. For example, some Llama 4 models were trained with “codistillation”, where the smaller model is trained alongside a much larger one that’s still learning itself, so it sees as much data as the larger model. This kind of distillation is probably much harder for compute-poor labs to replicate just by externally accessing frontier models.

As far as I can tell, it’s not the case that Alibaba now only releases closed models — they have a hybrid structure where most models are open, but their best models are closed. This isn’t totally new either — it’s been the case since mid-2024.

Though I’m not sure if they’ve reverted back to a closed-weight approach with ERNIE 5.0.

What do frontier AI companies' job postings reveal about their plans?

JSD — Thu, 26 Mar 2026 15:02:53 GMT

Originally posted on Epoch AI.

AI companies guard their strategies closely. Their hiring pages, however, are public.

And those posts contain clues about what products a company is developing, who it hopes to sell them to, and which bottlenecks it sees coming. A posting for a “Camera ISP Software Engineer” suggests a device with a camera. A search for “Forward Deployed Engineers” hints at the challenges of deploying AI inside companies. A cluster of roles mentioning robotics implies ambitions well beyond chatbots.

We analyzed open roles at the leading foundation labs, including OpenAI, Anthropic, xAI and Google DeepMind. Here is what we found:

First, sales and sales-related hiring has increased sharply at both Anthropic and OpenAI over the past year. Anthropic’s go-to-market share of open roles grew from 17% to 31% and OpenAI’s from 18% to 28%. This increase has been particularly concentrated in technical roles that help clients deploy AI to their companies.
Second, open roles can provide insight into the product roadmap at the labs. For example, OpenAI and DeepMind are both investing in hardware products, such as robotics and consumer devices. In contrast, Anthropic appears to be focusing on improvements to its core products instead.
Third, career pages can help give insight into how the leading AI companies are using different strategies to acquire key inputs such as compute or data. For example, OpenAI has 21 open roles related to custom silicon, while Anthropic, which does not have an internal effort aimed at developing its own chips, has none.

A few caveats before we dive in. Job postings tell us about roles a company is trying to fill, not about its current staff. For example, a team with 20 open roles could be large and growing, or a brand-new team that doesn’t exist yet. We also can’t tell how many people will be hired for each posting: a single “Research Engineer” listing might yield one hire or ten or never get filled.

Go-to-market is the top hiring category at OpenAI and Anthropic

Sales and sales-related hiring has increased sharply at both Anthropic and at OpenAI over the past year: Anthropic’s share of open roles dedicated to go-to-market grew from 17% to 31%, while OpenAI’s grew from 18% to 28%. Perhaps unsurprisingly for companies with rapidly increasing revenue competing over a largely untapped market, sales-related roles now represent the largest category of open roles at both companies. Research, by comparison, now makes up only 12% of open roles at Anthropic and only 7% at OpenAI.

One subcategory has grown especially fast: technical roles dedicated to helping customers actually adopt AI. Both companies are now hiring roles such as AI Success Engineers, Partner AI Deployment Engineers, Solutions Architects and Forward Deployed Engineers, designed to help customers find opportunities to use AI and integrate it into their businesses. Over the past year, the share of open roles dedicated to adoption increased from 5% to 11% at Anthropic and from 11% to 17% at OpenAI. This suggests customers are struggling to fully utilize OpenAI and Anthropic’s products, and an important aspect of bridging this gap is to educate customers on what is possible.

The geographic distribution of sales roles also tells us something about where these companies see their market. More than half of each company’s sales roles are located in the U.S. with 52% of open roles at Anthropic and 55% at OpenAI. Neither company discloses its geographic revenue split, but the hiring concentration suggests the U.S. remains the primary market for both Anthropic and OpenAI by a wide margin.

Internationally, both companies are hiring aggressively across Europe and Asia-Pacific. Anthropic tilts more toward Europe, where 29% of its open sales roles are located versus 21% for OpenAI. OpenAI tilts more toward Asia-Pacific, at 24% versus Anthropic’s 19%. Within Asia-Pacific, the growth is concentrated in Japan, South Korea, India, Singapore, and Australia. Notably absent are China, the Middle East, Latin America, and Africa. The labs’ focus on global sales suggests that they do not expect to be crowded out of the European and Asia-Pacific markets by national champions.

Government sales appear to be another important priority. OpenAI and Anthropic both have 10 sales roles covering federal civilian, defense, and state and local government. Of these roles, 1 of OpenAI’s roles specifically targets national security and 2 of Anthropic’s roles specifically target national security. xAI has 2 open sales roles targeting international governments, located in London and Dubai, and 1 open role targeting the US government. These roles demonstrate the importance of government as a future revenue stream for the foundation labs.

Unlike Anthropic, OpenAI and xAI, where open sales roles can give insight into the current go-to-market priorities of the AI labs, DeepMind’s job ads reveal little about their sales motion at all, since Google’s existing sales organization handles distribution for Gemini.

Job postings shed light on new product bets at OpenAI and DeepMind

Open roles can also offer a window into what each lab is building. Anthropic has 5 open product and engineering roles aimed at improving Claude Code, while OpenAI has 10 similar roles aimed at improving Codex. Both OpenAI and Anthropic each have an engineering role dedicated to building product add-ons for financial services. OpenAI also has 3 roles targeting new features for ChatGPT Health and OpenAI for Healthcare.

That said, the view from job postings is imperfect for existing products. It can be hard to tell whether a role is expanding an existing feature or building something new, and platform or infrastructure roles often span multiple products. Therefore, job postings are most informative when they reveal new bets.

First, OpenAI’s postings show it is investing heavily in a consumer hardware device, with 15 open roles for the project. The postings reveal a fair amount about what the device looks like: a Camera ISP Software Engineer role describes building imaging systems for a battery-powered portable device, a Research Engineer role focuses on running transformer models directly on the device, and an Operating Systems Engineer role references custom silicon. Taken together, these roles suggest something like a portable device equipped with a camera, running its own AI chip, and designed to run AI models on the edge. Moreover, two Singapore-based hardware and operations hires suggest a move toward manufacturing. DeepMind is also making a hardware bet, with two open roles for the development of XR glasses, one of which suggests voice commands will be a key interaction mode.

Second, both OpenAI and DeepMind are betting on robotics. OpenAI has seven robotics roles that indicate work on training robots in simulation at scale and improving simulation realism. The postings also suggest some robots may have soft components or coverings, and that production is scaling up. DeepMind is hiring 9 roles dedicated to its robotics program, which suggest that they are building a humanoid robot, with dexterous hands.

Beyond hardware, OpenAI has 2 incubation-stage roles focused on social products, and one for a jobs platform designed to help people train, certify their skills, and get matched with employers. Anthropic has a general research product manager role dedicated to creating entirely new product categories, and another aimed at developing new consumer products.

Job postings also offer clues about how labs secure compute and data

Open roles at the labs also give insight into how each company approaches its raw inputs: compute and data. The clearest divide is between labs that are building their own compute infrastructure and labs that are contracting for it.

OpenAI has 21 open roles, mostly engineering, associated with its internal chip development effort, which represents 3% of current listed roles. Anthropic, which is not developing its own custom silicon, has taken a different path: it has multiple roles focused on overseeing datacenter design and construction with external partners, including a Data Center Mechanical Engineer responsible for directing cooling and mechanical system design produced by external firms, and a Data Center Design Execution Lead who owns the bridge between Anthropic’s technical requirements and third-party delivery partners. Anthropic also has 3 open legal roles dedicated to negotiating datacenter, or colocation contracts.

Another input that features prominently in hiring is RL training environments. Anthropic has multiple roles focused on building environments for training models on new capabilities, including an Environment Scaling team that builds RL environments and manages vendor relationships, and a Universes team building ultra-realistic long-horizon settings for agentic training. OpenAI is also hiring researchers for a Synthetic RL team developing RL training methods using self-play, simulators, and synthetic feedback.

In contrast to OpenAI, Anthropic and DeepMind, which do not have dedicated human labeler roles, xAI’s hiring suggests a distinct model with respect to human data. It has 27 open human data roles, which suggests that it prefers keeping its data labeling operations in-house. It’s also notable that xAI is comfortable advertising these roles publicly. Other labs also rely on human labeling at scale, but tend to outsource it or keep the hiring less visible.

Conclusion

Job postings are an imperfect signal, but they’re one of the few public windows into how the leading AI labs are evolving. The picture they paint right now is of companies that are investing heavily in selling and deploying their products, diversifying into new product categories, and competing for key inputs like compute and data. As the labs continue to grow and their strategies diverge, their career pages will remain one of the best places to watch it happen.

Thanks to Lynette Bye for helpful comments and editing.

Final training runs account for a minority of R&D compute spending

JSD — Tue, 24 Mar 2026 16:16:01 GMT

Originally posted on Epoch AI.

In the popular picture of how AI companies use compute, there are two big buckets: training and inference.

But in reality, the R&D side is more complex. The final training run — the one that produces the model with a name — is only the last step in a long, expensive process of exploration. Before that run begins, companies burn through compute on: running experiments at various scales, generating synthetic data, testing which ideas work before committing to a final run, and training models that are never released.

This distinction matters. When people discuss compute thresholds or the cost of training a frontier model, they often mean the final training run. However, the full cost of developing that model is much higher. And if most of the spending is exploration rather than execution, then a competitor who learns what works from the frontier could replicate the results for a fraction of the original cost.

So there’s more to R&D compute than final training runs, but how much? Last year, we estimated the breakdown of OpenAI’s 2024 compute spending at around $5 billion on R&D compute that year, with only about $500 million – roughly 10% – going to the final training runs that produced released models such as GPT-4.5. The rest went to scaling experiments, synthetic data generation, basic research, or other R&D workloads.

However, that estimate only looked at a single company (OpenAI) for one year (2024). It’s hard to know how general this pattern is, since companies rarely disclose how they allocate their compute.

But in early January, two smaller Chinese AI companies, MiniMax and Z.ai, disclosed their R&D compute spending as part of their IPO processes. Since most of MiniMax and Z.ai’s models are also published, we can estimate how much compute went to their final training runs, and compare.

We find that the pattern holds: for all three companies, final training runs account for a minority of R&D compute spending despite major differences in scale, country, and business model. We also explore what these ratios might tell us about catch-up growth: companies that can learn from the frontier should in theory need less experimentation and devote more of their R&D compute to training. MiniMax’s higher ratio is consistent with this, but Z.ai’s is not, and with only three data points, the evidence remains inconclusive.

Breaking down MiniMax and Z.ai’s compute spending

Our analysis relies on two types of data: the R&D compute spending and the final training run compute spending.

The R&D compute spending comes directly from MiniMax’s and Z.ai’s IPO documents. Both companies went public on the Hong Kong Stock Exchange in early January 2026. Since their disclosed financials do not cover the full year 2025, we have to instead rely on the most recent periods available: For MiniMax, we use reported compute spending from Q4 2024 to Q3 2025, and for Z.ai, we use H2 2024 to H1 2025.

Following the procedure in our OpenAI estimate, we match the reported spending window to the training compute of models released one quarter later. For MiniMax, we look for the training compute of models released in 2025. For Z.ai, we look for the training compute of models released in 2024 Q4 – 2025 Q3. This accounts for the fact that firms spend the R&D compute for a model several months before its public release.

Most of the final training run compute spending comes from our database of AI models, with several assumptions. Epoch reports the models’ training run compute in FLOP. To convert FLOP into USD, we use the following equation:

Where:

GPU peak FLOP/s = 9.89e14 (H800 dense BF16/FP16), unless developers specifically state the model is trained with FP8 precision.
MFU (Model FLOP Utilization) is assumed to be 0.15-0.35 for pre-training and supervised fine-tuning and 0.05 - 0.20 for RL (reinforcement learning), since RL is inference-heavy and inference has a lower utilization than training. We propagate this MFU uncertainty by Monte Carlo simulation (a way to estimate uncertainty intervals through a series of calculations).
Price/GPU hour is assumed to be $1.50 - $3.00. We also propagate this price uncertainty by Monte Carlo.

Sometimes, firms do not disclose sufficient information to estimate training compute with precision in our database of AI models. If so, we estimate a 90% confidence interval for the training compute of those models.

Most model estimates do not significantly affect the final total training compute, except for MiniMax’s Hailuo 02. We estimate a 90% CI for its compute spending of [$5M, $51M], with a median of $18M – about 40% of Minimax’s total 2025 training compute. We think this is reasonable, since Hailuo accounted for 32.7% of MiniMax’s 2025 revenue and training spend on a model tends to correlate with its commercial importance. Note that if we exclude the Hailuo model family entirely from the final training run compute, MiniMax’s training-to-R&D ratio is more on par with OpenAI’s training-to-R&D ratio, with a median of 12.2% and a 90% CI of [6.4%, 23.5%].

Lastly, data for OpenAI’s compute allocation comes from our previous data insight.

Final training runs are a small fraction of R&D compute spending

With these estimates in hand, here’s what we find. For all of OpenAI, Z.ai, and MiniMax, final training runs account for a small fraction of total R&D compute spending: 9.6% for OpenAI, 22.6% for MiniMax, and 12.3% for Z.ai. This pattern holds despite major differences between the three companies. OpenAI spent $5 billion on R&D compute in 2024, while MiniMax spent $141 million and Z.ai $216 million. They operate in different countries, under different regulatory environments, and with different business models. Yet in all three cases, the bulk of R&D compute spending went to something other than the final training runs for released models. This shows that our earlier findings on OpenAI in 2024 were not a fluke.

An important caveat is that these figures measure compute spending, not compute in terms of actual FLOP performed. This is particularly important because training runs and research often have different levels of MFU (Model FLOP Utilization), or how efficiently the GPUs are used. R&D workloads tend to achieve lower GPU utilization than final training runs, because experiments involve more overhead like waiting between runs, failed jobs, debugging, and other idle periods that lower GPU utilization. In contrast, final training runs optimize to keep hardware busy more consistently. As a result, even if the two activities incur the same spending, R&D could use fewer actual FLOP due to lower MFU.

R&D compute and catch-up growth

So far, we’ve focused on what these three companies have in common: final training runs are a small share of R&D compute spending in all three cases. But the ratios aren’t identical. Can we learn anything from the differences?

A major difference between OpenAI and Z.ai/MiniMax is that OpenAI operates at the frontier of AI capabilities, and is spending an order of magnitude more on compute. At the frontier, companies face many plausible research directions, so they need to spend heavily on experiments to figure out which ideas work at scale.

If a company is farther from the frontier, it should be able to learn what works from leading companies, skip much of this costly experimentation, and allocate more of its R&D compute to actual training runs. There are many ways this can happen: ideas spread through publications and rumors, competitors can train on data generated by frontier models, they can analyze a model’s behavior to reverse-engineer how it was built, or they simply learn that something is possible and focus their efforts accordingly.

This predicts that lagging companies should have higher training-to-R&D ratios than frontier companies. MiniMax fits this prediction: its ratio appears higher than OpenAI’s or Z.ai’s. This is also consistent with Anthropic’s disclosure that MiniMax ran a large-scale distillation campaign against Claude, extracting over 13 million exchanges (more than DeepSeek or Moonshot). Z.ai, on the other hand, doesn’t clearly fit the catch-up prediction, since its training-to-R&D ratio looks similar to OpenAI’s.

Overall, we still only have three data points, with wide error bars on each. So these results don’t provide particularly strong evidence for or against the catch-up story. The pattern is suggestive in MiniMax’s case, but we can’t draw confident conclusions about what drives the differences in training-to-R&D ratios across companies. Better understanding the business models and incentives for Z.ai and MiniMax could help explain why the two companies seemingly differ in their R&D compute allocation.

Conclusion

Final training runs account for a minority of R&D compute spending across all three companies we examined: this confirms that the pattern we found for OpenAI in 2024 extends to smaller companies in a different country. The differences between companies are harder to interpret with so few data points.

We’d like to do this analysis for more companies, but the binding constraint is getting reliable data on compute spending. IPO disclosures have been our best source so far, and there may be more opportunities soon when OpenAI, SpaceX, and Anthropic IPO.

Thanks to Konstantin Pilz and Lynette Bye for helpful comments and editing.

Appendix

R&D Compute Spending

Minimax and Z.ai disclose the R&D compute in their IPO documents. In particular, Z.ai’s R&D compute is reported in RMB. To convert that to USD, we use an exchange rate of 7.2 RMB/USD, which was the average exchange rate in 2025.

Final Training Run Compute Spending

By the “Final training run compute” of a model, we mean the compute used to train that model specifically, including pre-training and post-training. If a base model as well as a fine-tuned version of it were both released, we do not double-count the compute used to train the base model.

To estimate the cost of final training runs in USD, we use the aforementioned equation:

We pair this with a Monte Carlo simulation of 50,000 draws to propagate uncertainty across our three input parameters:

MFU, modeled as log-normally distributed, with the 10th and 90th percentiles anchored at 0.15 and 0.35 for pre-training and supervised fine-tuning, and 0.01 and 0.10 for RL post-training.
Price/GPU hour, modeled as log-normally distributed, with the 10th and 90th percentiles anchored at $1.50 and $3.00.
FLOP counts for models missing training compute information. We will explain in more detail the distribution below.

We also assume GPU peak FLOP/s = 9.89e14 FLOP/s (H800 dense BF16/FP16), unless developers specifically state the model is trained with FP8 precision, in which case we use GPU peak FLOP/s = 1.513e15 FLOP/s. Minimax-M2 is the only model trained with FP8 precision in our sample.

A key methodological choice concerns how uncertainty in MFU and price/GPU hour is handled across models. We assume these are largely market- or period-level conditions rather than model-specific parameters. Thus, in each Monte Carlo draw, we sample one common MFU and one common price/GPU hour and apply them to all models.

FLOP counts are primarily from our AI models dataset. Caveats and details are in model_flop.md in our GitHub repository.

The least understood driver of AI progress

Anson Ho — Thu, 26 Feb 2026 16:44:02 GMT

Originally posted on Epoch AI.

AI software progress is one of those things that everyone vaguely knows about, but only a handful of people in the world truly understand its significance. Consider that many of the most fervent debates in AI to date depend enormously on it: How did DeepSeek seem to catch up to OpenAI’s o1 within months while using less training compute? When will the world develop AGI? And if we automate AI research, will AI progress accelerate like crazy a la Situational Awareness and AI 2027?

I don’t know your stances on these questions, but I do know that you can’t have a well-informed opinion on them without understanding software progress. So I figured I should write a post describing the most important things that you need to know, starting from the basics and leading up to the current frontier.

Here are the main takeaways, one for each section of the post:

AI software progress is about reducing the training compute you need to get to the same level of capability, through better algorithms or data. This is commonly called “algorithmic progress” including by myself, but as we’ll see, this is probably a bit of a misnomer.
Almost all the evidence points to very fast software progress: each year, the training compute needed to get to the same capability declines several times — possibly even ten times or more. But existing estimates are also incredibly uncertain, because they depend on limited observational data and dubious statistical assumptions. Relaxing some of these assumptions could change some estimates by close to an order of magnitude!
Recent evidence suggests that these estimates might not measure what we thought they did. Software progress has often been framed as continually finding new algorithmic innovations, which cumulatively bring massive efficiency gains. But most software progress might actually be due to data quality improvements (hence “algorithmic progress” may be a misnomer). And much of the measured efficiency gains might come from scaling up just a small handful of “scale-dependent” algorithmic changes — innovations that have a greater impact at higher training compute scales — rather than from discovering lots of new algorithms. Both of these effects were mostly unaccounted for in prior literature.
Some argue that automating AI research could lead to a “software intelligence explosion”, with AIs recursively improving themselves. Previous analyses suggest this is plausible but uncertain. However, some of these analyses are based on overly conservative estimates of software progress, which makes an explosion seem more likely when corrected. On the other hand, they also ignore scale-dependent innovations, which pose a compute bottleneck — it may be hard to get fast software progress without also scaling up training compute, potentially making an explosion less likely. The net effect isn’t totally clear: the compute bottleneck argument is suggestive but the empirical evidence is shaky, and there are plausible reasons automated AI researchers could overcome it.
There are still big open problems about AI software progress, like the rate of progress in post-training, and the strength of compute bottlenecks. I think it’ll be hard to make progress on these questions, but it’s still worth trying to answer them because AI software progress is so overwhelmingly important.

Now let’s dig into the details.

I. AI software progress: Doing more with what we have

First things first, what do we mean by “AI software progress”? You’re probably familiar with some examples: when the field of AI switched from LSTMs to Transformers, that was software progress. When OpenAI released the first reasoning models that took time to think before responding to users, that was software progress. It seems simple enough: with better algorithms and data (“software”), you can train better AI systems.1 There are other kinds of software progress too (like in inference), but for this post I’ll focus primarily on training.

But what exactly does it mean to have “better” software? Historically, people have defined this through the lens of AI scaling, where using more training compute is the main thing that increases capabilities. Then you define software progress relative to this: if you have better software, then you need less compute to reach the same level of capabilities.2

To see what I mean, let’s say we’re training a Transformer with a certain amount of compute, shown as a red dot below. If we change the amount of training compute, we get a relationship between training compute and capabilities, which we can draw as a line:3

Now let’s introduce a new software innovation, say rotary embeddings or training data filtering. Under this definition of software progress, this reduces the training compute we need to reach a certain capability, and so we shift this curve to the left by some factor, shown as the “compute efficiency gain”:

What’s more, this algorithmic shift also means that we can do more with the same amount of compute, reaching the blue level of capabilities:

If we had wanted to reach that blue level of capabilities using the original transformer, we would’ve needed to use “compute efficiency gain” times more training compute:

This is why researchers often define the “effective compute” = “compute efficiency gain” × training compute. Because of the software innovation, we in effect have a larger compute budget that we can use to reach higher levels of capability. We can do more with the compute we currently have.4

Of course, this picture is oversimplified. For example, this assumes that the lines are parallel — more on that later. But even though it’s oversimplified, I think it helps clarify why it’s so important to understand software progress. For example, if AI software progress is really fast, then frontier labs could build AGI with much less compute than today’s software, which could mean building AGI much sooner. And if these innovations are easily replicable by many actors, the implications extend further: shortly after one lab makes a big algorithmic breakthrough, another lab catches up. Shortly after frontier LLMs become good enough to help novices develop bioweapons, the novices will have enough compute to train these LLMs themselves. No wonder some people are so nervous about this!

But whether these implications actually hold depends on the numbers. How fast is software progress in practice? If we’re right about this theoretical framing, then this could be one of the most important questions for what AI will look like over the next few years. So let’s take a look at the evidence.

II. How fast is AI software progress?

To answer this question, I searched through the literature for AI software progress estimates, some of which I contributed to.5 In the case of pre-training, this looks very fast, on the order of several times per year:

You can find the full data and sources in the Appendix.

Again, this means that we need several times less training compute each year to reach the same capability — that’s super fast! But I don’t think we should trust these specific numbers very much because they’re full of problems, and you don’t have to squint very hard to see them. First of all, the error bars look almost comically wide in the graph above — across the different estimates, they range from around 1.1× to 300× per year! That’s why I say “several times per year” rather than any particular number — it seems pretty plausible based on the range of estimates, but I don’t want to signal too much confidence.

Why are these estimates so uncertain? I think the biggest reason is that there’s not enough high-quality data out there. All the estimates in the graph above are based on observational approaches, which build a dataset of existing AI models and then try to back out an estimate from that. To do this, each model in the dataset needs three ingredients: (1) a training compute estimate, (2) a measure of its “capabilities”, and (3) a description of when it was released.6 But in practice we often don’t know the training compute, and it’s hard to find a capability metric that’s used for a long time, because things like benchmarks saturate too quickly to capture long-run trends. So we’re left with less data than we’d like, and more uncertainty than we’d want.7

These uncertainties are compounded by how we estimate the rate of progress. For example, one approach is what I like to call the “lines on a graph” approach — pick a capability level, find models that achieve it with less training compute over time, and measure how fast that compute requirement falls.8 That looks like this (hence the name of the approach):

(Source) The “lines on a graph” approach to estimating the rate of AI software progress. The rate of progress is given by the slope of the line.

Unfortunately, because we don’t have large datasets, these “lines on a graph” are fit with only a handful of data points, making the estimates really noisy. On occasion this can yield ludicrously fast estimates, like 30,000× per year or even 5¹²× per year!9 We can mark these sorts of cases as obviously dubious outliers and focus more on the median trend across the lines-on-graphs, but it somewhat undermines my trust in this approach.

The other way to estimate the rate of software progress is what I’ll call the “fancy statistics” approach. Here the idea is to have some more complicated model relating compute, capabilities, and time, fit the parameters to the data, and then use that to back out an estimate. The key difference is that it’s more explicit about what this relationship looks like, whereas the “lines on a graph” approach is a bit more agnostic about that. So if the assumed relationship is wrong, that could throw off the estimates.

But for the most part, both approaches share similar weaknesses. For example, in one of my own papers I modeled software progress as growing exponentially and only depending on time. In doing so, I implicitly assumed that the “software quality” is the same across all labs at any point in time, which is clearly false — some labs have better algorithms and data than others. Parker Whitfill argued that this could bias the estimates to be substantially faster or slower — for instance, growth rates might be as much as 9× slower than what my paper estimates! Sadly for me, that’s potentially quite a devastating rebuttal.

Even if we set aside these uncertainties, there are still plenty of problems with all the estimates I showed you. For example, all the estimates fail to fully capture software improvements in post-training,10 which is especially egregious because reasoning post-training is now all the rage among frontier models. But the sad reality is that there’s not enough public post-training compute data to let us reliably work out a growth trend.11

One last problem I’ll raise has to do with how we measure “capabilities”. Typically this means looking at AI benchmarks, but models can be “bench maxxed” to do well on benchmarks in a way that fails to generalize to real-world use-cases. So if we measure software progress using only these metrics, we’d end up overstating growth rates — what if we’re just getting much better on the things that we can easily measure, and not so much elsewhere? My guess is that this should make us slightly more skeptical about fast progress, but not by that much, because we do see models generalizing to some extent. I wish I knew how to be more precise than that.12

Okay, so there are myriad problems and uncertainties with existing software progress estimates, some of them potentially quite devastating. Now what? If we could be so far off the mark, should we trust these numbers at all?

I think there’s still a decent case for software progress being “several times per year”. For one, the estimates above are based on different approaches and all suggest that these rapid growth rates are possible — I doubt that’s just a coincidence. It’s also consistent with numbers in computer vision reported by frontier AI companies, and corroborates Dario Amodei. What’s more, we can just look at example case studies. Here’s one example from last year, which suggests an annual growth rate of 2-5× per year:

(Source) gpt-oss-20b does substantially better than GPT-3 on MMLU, despite using the same amount of training compute. If we look at the relationship between pre-training compute and MMLU performance among non-reasoning GPT models, we can back out a rate of algorithmic improvement from this — this works out to around 2-5× per year.

Overall I’d say that the evidence looks pretty shoddy, so I don’t think it makes sense to confidently declare that “software progress is 3× per year” or any particular number. But I think we have enough evidence to think that software progress might really be several times a year, and to make a best guess contextualized with a lot of uncertainty.

So here’s my best guess: after accounting for all training compute (including post-training), I think we’re seeing software progress at around 10× per year, and my 80% credible interval would probably range from 2× to 50× per year. But again, this is the kind of thing that might change depending on the day and what I had for breakfast that morning.

Subscribe now

III. What drives software progress? (Or, why all the estimates we just saw are misleading)

If we think about it a bit more, these estimates seem quite fishy: have model architectures and training algorithms really changed enough to justify something like a 10× per year rate of progress?

Consider GPT-2 and DeepSeek-V3. These two models were released almost six years apart, which at 10× per year would imply a million-fold difference in compute efficiency. That seems crazy, especially because they don’t look that different on an algorithmic level. Sure, DeepSeek-V3 is a mixture of expert model, uses different variants of Attention and positional encoding schemes, and so on — but how is that supposed to explain a factor of one million?

I think there are two plausible factors, both of which play an important role. One is that these improvements aren’t just coming from better algorithms — the bulk of this comes from better data. To my knowledge, the best argument for this position comes from Beren Millidge, who brings several pieces of evidence to the table:

We’ve shifted from gnarly, uncurated web data to heavily processed and often synthetic data, which is strongly targeted towards the tasks that we want models to do well on. This could include techniques like filtering data and finding the right data mix to get better performance.
Researchers have been throwing tons of effort into getting better training data. For example, Surge AI had a revenue of over $1 billion last August, and Scale AI was probably in a similar boat. I’d also add that this checks out with how OpenAI and Anthropic are spending in the billions on data-related costs — this includes RL environments, which are now very much a part of the AI zeitgeist.

To my mind, the strongest point here is about synthetic data (and distillation). These are techniques that could help models reach certain capabilities with much less training compute.13 Here’s an illustrative example from an article I co-authored last year:

The efficiency gains from distillation can be dramatic. DistilBERT, an early demonstration of this technique, preserved 97% of its teacher model’s capabilities while using 40% fewer parameters and requiring merely 3% of the compute budget that went into training the original BERT model.

In other words, in this case distillation made it possible to reduce training compute by something like thirty to forty times while keeping roughly the same capabilities.14 Synthetic data can help push beyond this — a good example that Millidge raises is the Phi series of models. So if you stack these kinds of gains, you can sort of see how this could help explain orders-of-magnitude improvements in software efficiency.15 But I’m still not totally sure, because some of the studies I listed in the previous section deliberately (try to) exclude distilled models and synthetic data while still finding absurdly fast rates of software progress. I’m not yet totally convinced that the numbers add up.

This brings us to the second explanation for ultra-fast-software-progress, namely that software improvements depend on the scale of training compute. To see what I mean, let’s go back to the picture we drew earlier to illustrate software progress:

This depicts software improvements as a parallel shift to the left. But what if it’s not parallel? What if the slope changes, for example like this?

This illustrates how software progress can depend on the scale of training compute — in this case, models trained with more compute see larger efficiency gains (and also larger capability gains).

Now the line could also become less steep, and it doesn’t have to be a straight line — maybe there are some diminishing returns to log(training compute), for example. But the important thing is that the size of the compute efficiency gain could change with scale.

So how might this explain the really fast estimates of software progress? The idea is that software improvements may have made the line steeper, and since we’ve been scaling up training compute, we take advantage of these larger gains at scale. So when we measure software progress, this shows up as rapid efficiency gains.

The most direct evidence for this is the paper On the Origin of Algorithmic Progress in AI.16 Here’s one example where they compare the scaling of a modern Transformer architecture (purple) and an old-school LSTM (green):

This meshes pretty well with the stylized stick-figure diagrams that I’ve been showing you. The main difference is that the y-axis is flipped, where a lower “validation loss” means a higher capability. We can see that the Transformer has a steeper scaling slope, and the efficiency gains grow with training compute scale. At 10¹⁵ FLOP the efficiency gain is 6.3×, whereas at 3 × 10¹⁶ FLOP the efficiency gain is 26×.

And these gains grow a lot as you scale up training compute — if you naively extrapolated the purple line to 10²³ FLOP the total efficiency gain is over 20,000×! If you work through the numbers more carefully, this works out to an annual improvement of 2.23× per year at 10²³ FLOP scales — so it’s easier to see how you can measure compute efficiency gains of several times per year.17

To my mind, these results lend quite a bit of weight toward the “scale-dependence” argument for how measured software progress has been so fast. I also think it has some intuitive appeal to it, because there’s been evidence for a while now that some AI architectures scale better than others.

But I don’t think this is really a slam-dunk argument either. For starters, I don’t want to over-index on one paper. The estimate of 2.23× per year is also substantially lower than my 10× per year best guess earlier, though this could change if the authors look at a broader set of innovations, or if they consider higher compute scales.18 There are also some suspicious aspects to this study: for example, it uses scaling experiments between 10¹³ FLOP and 10¹⁸ FLOP (or under 10¹⁷ FLOP for LSTMs) to infer efficiency improvements as far as 10²³ FLOP — so it involves heroically extrapolating out five orders of magnitude. This is a problem because the result about scale-dependence might itself depend on which compute scale we’re looking at, which is very meta.19 Ideally, it’d be great to do something like this experiment but at larger compute scales, and also experimentally accounting for the gains from better data and post-training. Both of these are things that the paper glosses over.20

Overall, I’d say that both data quality improvements and scale-dependence play an important role — it’s not clear to me that the numbers totally add up with either approach. And each approach fails to explain certain observations — for example, capabilities have improved in the last year without much training compute growth. I think this is probably better explained by data quality improvements than scale-dependent innovations (though these aren’t mutually exclusive, since better data could itself change the scaling slope).

This means that almost all existing estimates of software progress were misleading. Many efficiency gains that we thought were from algorithmic improvements actually came from better data. And many efficiency gains that we thought came from continually inventing lots of new algorithms could’ve come from scaling up a tiny handful of algorithmic innovations!

In fact, this seems to be the case in practice, if you break down the 20,000× gains between the LSTM and Transformer that we saw earlier:

Note that the paper in question uses the Deep Learning Era trendline from Epoch’s “Notable AI models” dataset as the frontier of training compute. This is why the “compute frontier” is close to 10²³ FLOP in 2024-2025.

Here you can see that most efficiency gains came from two scale-dependent innovations — (1) the shift from LSTMs to the Kaplan Transformer, and (2) “Chinchilla Rebalancing”.21 Other innovations seem scale-independent and also contribute very little overall — they barely shift the curve to the left and don’t change the slope. So even if we measure fast software progress, it might just be because we’re scaling quickly — it may not have much to do with making further innovations.

IV. How this impacts the software intelligence explosion debate

Everything we’ve seen so far bears on one of the most important debates in AI — namely, how likely is a software intelligence explosion? The idea is that once you automate AI research,22 you’ll get a rapid feedback loop: the AIs get smarter (or more numerous), their research efforts drive more AI software progress, making the AIs even smarter (or more numerous) still, and so on.23

This matters a lot because if it’s true, it could engender an enormous amount of AI and societal change in the span of years. What does “enormous” mean? Well opinions on this vary a lot, but they often include things like eliminating most cancers, compressing a century of technological progress into a decade, or the extinction of humanity. Saying that the stakes are high would be the understatement of the century.

So how do the results we’ve seen — about the rate of software progress and its drivers — impact this debate? The short answer is that some existing analyses of the software intelligence explosion are based on overly conservative estimates of software progress, and essentially all of them are based on faulty assumptions.

To see why, consider the model that people usually use to study the software intelligence explosion, which relates AI research effort to AI software progress. The core parameter that relates these two is the “returns to AI software R&D”, which tells you how well you can turn more research effort into more software progress.24

This parameter has a key threshold: if it’s greater than 1, then you get a software intelligence explosion. And in practice, literature estimates suggest that might be the case (though it’s really uncertain):

(Source) Estimates of the returns to AI software R&D are very uncertain but straddle 1. This suggests that an intelligence explosion is plausible, but it’s also plausible that the feedback loop of AIs improving themselves simply fizzles out.

However, as pointed out by Daniel Eth and Tom Davidson, these estimates rely on an overly conservative estimate of software progress of 3× per year (this is the Ho et al. 2024 estimate that we saw earlier). I now think that software progress is closer to 10× per year, so this change should increase the estimates of this parameter several-fold,25 making a software intelligence explosion seem much more likely.

But there’s also a good reason to think that these are drastic overestimates. That’s because existing models of the software intelligence explosion don’t account for how a lot of software progress could come from scaling training compute. Recall how scaling up training compute for just two innovations led to most observed efficiency gains between LSTMs and Modern Transformers, rather than humans discovering tons of new innovations.

Instead, existing models either assume that research effort is the only thing that matters, or assume that it depends on a mix of research effort and the compute used to run experiments.26 But to my knowledge, none explicitly account for training compute scaling being a source of software progress, so they could heavily overstate the importance of research effort.

In effect, this means that existing estimates overstate the returns to software R&D, and makes the software intelligence explosion seem much less likely. One way to think of this is as a “compute bottleneck” — it’s hard to make super fast software progress without also scaling up training compute.

There are some reasons to expect that this training compute bottleneck can be overcome. For example, in principle there may be innovations that really boost performance at the same training compute, hence driving a lot of software progress. This could look like software innovations that scale really well, even if they don’t look great at small scales:

Alternatively, you could in principle find a ton of scale-independent innovations with big capability gains:

And of course you could get huge capability gains from a combination of scale-dependent and scale-independent improvements. If doing these things is easy, then you could skyrocket capabilities without increasing training compute very much.

My personal inclination is to say that these things are probably quite hard — otherwise, why haven’t we found these innovations already? But I can’t theoretically rule it out, and you could always say “maybe you can’t find these innovations, but if you had a country of geniuses in a data center, they certainly would! Not to mention what would happen if AGI is as smart as Von Neumann+++…”.

Moreover, the strength of training compute bottlenecks depends on the “scale-dependence” explanation for rapid AI software progress. But it may apply less well to software progress that’s driven by data quality improvements — perhaps you could have a software intelligence explosion just by building better RL environments! I don’t know to what extent this is possible, and in principle data quality innovations could also be scale-dependent, but I think it’s still a very important consideration.

There’s a lot more to the debate about the software intelligence explosion and its bottlenecks which is too long to get into here. But overall I think the upshot is this: some prior work on the software intelligence explosion is based on outdated estimates of software progress, which make an “explosion” seem more likely. On the other hand, essentially all existing work is also based on a heavily flawed model, making the “explosion” seem less likely. Personally I now think the software intelligence explosion is less likely than before looking into this, though I also think that the bottlenecks aren’t strong enough to preclude it altogether.

V. Conclusion

Let’s recap all the main things that we’ve seen:

Software improvements are about reducing the training compute you need to get to the same capability, or doing more with the training compute you have.
It progresses very quickly, with efficiency gains increasing several times per year. But these estimates have gigantic error bars, due to data limitations and statistical assumptions.
The numbers might also be somewhat misleading. Much of the measured progress is plausibly driven by data quality improvements rather than algorithmic innovations. And it’s likely that a large share of measured efficiency gains reflects scaling up a small handful of scale-dependent innovations, rather than the continual discovery of new algorithms.
If most efficiency improvements came from a small handful of scale-dependent innovations, then existing models of the software intelligence explosion may be flawed. Correcting for underestimated software progress makes an explosion seem more likely, but accounting for training compute bottlenecks from scale-dependent innovations pushes in the other direction.

Given all of this background, we finally have enough ammunition to answer the three questions I raised at the start of the essay:

Q: How did DeepSeek seem to catch up to OpenAI’s o1 within months, despite being trained with less compute?
A: If software really improves several times a year, it’s less of a mystery why DeepSeek was able to get close to o1 after a few months, especially if there was a lot of distillation involved.27 Peter Wildeford describes this in more detail here.
Q: When will the world develop AGI?
A: No one knows for sure, but it’s likely that software progress matters tremendously. For example, one of the most prominent models of AGI timelines is Ajeya Cotra’s bioanchors model, which originally had a median timeline of around 2050. But as Scott Alexander points out, this may have been over a decade too long because Cotra underestimated AI software progress (Cotra’s most recent estimate is probably somewhere in the 2030s).
Q: If we automate AI research, will AI progress accelerate like crazy a la Situational Awareness and AI 2027?
A: The answer is essentially Section IV. That is, nobody really knows, but you can form your opinion on it by answering two subquestions: (1) “What are the returns to AI software R&D?”, and (2) “Will compute be a major bottleneck to the software intelligence explosion?”. I lean towards the answer being “no” because I’d guess compute bottlenecks will be strong, but I think there’s quite a lot of room for disagreement.28

I hope these questions also help reiterate why AI software progress is such a big deal. These are foundational questions about competition between AI firms, and what happens to the world if we develop what would be the most influential technology in history. And the answers to them can be enormously swayed by the speed and nature of software progress — the problem is that we understand it so poorly that we can’t narrow down our uncertainty.

So what should we do? As a starting point, I want people to truly appreciate just how important software progress is. After that, we can try to do the best we can to understand it better. Perhaps that means running ablation experiments at larger compute scales, seeing what happens to software progress if researchers are compute-bottlenecked, or estimating software progress in post-training. And maybe some AI researchers already have good answers to some of this stuff, what do I know?

I think it’s going to be very hard to reduce our uncertainty, but at least in my mind, this topic is important enough that it’s worth giving it a shot.

I’d like to thank Aaron Scher, JS Denain, Josh You, Luke Emberson, Jaime Sevilla, David Owen, Parker Whitfill, Hans Gundlach, Brendan Halstead, Joshua Turner, Eli Lifland, Alex Fogelson, and Lynette Bye for helpful feedback and support. Special thanks to Aaron inspiring me to write this post in the first place.

Appendix: Estimates of software progress

The table below shows all the estimates used to generate the bar chart at the beginning of section II, as well as some additional estimates from post-training.29

Sometimes people call this “algorithmic progress” but I want to account for the possibility that these improvements are due to better data.

There are other ways to define “software progress” besides looking at training compute, which I’ll get to later in the post.

Here I’m just using “capabilities” as an abstract concept to illustrate what I mean in a simple way. But if you want something more concrete, you can consider the Epoch Capabilities Index, or log(METR time horizon). You could also consider something like benchmark performance, but then the assumption of a straight-line relationship doesn’t make that much sense there.

Another note: in principle it’s also possible that some capabilities can’t be accessed with infinite training compute — this is just one of the (many) assumptions of this simplified picture.

This definition of effective compute breaks down somewhat if you include distilled models — that’s because the compute efficiency gain largely comes from being able to use data from an existing larger model. In particular, if you’ve only trained models up to 10²⁵ FLOP, distillation would shift the curve left, but it doesn’t allow you to access higher capabilities with the same 10²⁵ FLOP.

I’ll primarily focus on LLMs for this post, because these are the most prominent AI systems (e.g. ChatGPT probably has close to a 1 billion weekly active users by now), and are the main kind of model that are pushing towards “AGI”.

Strictly speaking, Ho et al. 2024 uses data on the parameters and data rather than the training compute (which is related to the product of the two).

It’s also hard to get data about counterfactuals just from observational data — for example, it doesn’t really tell us what happens if we scale up GPT-4 by ten times, using the techniques available at the time. We could try to figure this out by running experiments, but that’s a different approach altogether.

Specifically we consider a narrow range of capabilities rather than a single capability level, otherwise we wouldn’t find any models at all!

The 30,000× number comes from the “Results table” in the post, with a capability threshold of 45. The 5¹²× per year improvement arises due to the following problem: GPT-5 was released about a month after Grok 4 and used around 5× less training compute, while achieving similar performance, implying a rate of 5¹²× per year. But it’s not clear that this actually reflects genuine software progress.

Whitfill, Snodin, and Becker 2025, Ho et al. 2025, and Scher 2025 include some gains from post-training, whereas Ho et al. 2024 explicitly only considers pre-training. In practice, all papers may reflect pre-training to a large extent, because there is little public data on post-training compute, which is needed to properly estimate post-training software progress.

The main estimate I know of is one from Anthropic’s Responsible Scaling Policy, which incidentally is also 3× per year. Sadly this number doesn’t look like it’s meant to be taken seriously — Anthropic describes it as an “informal estimate”, and gives no reasoning for it.

There is some other work that looks at the gains from individual post-training innovations rather than a trend. For example, one analysis looks at the efficiency gain from a range of post-training innovations, and finds that they are usually between 5× and 30× on relevant benchmarks. In some cases, the software efficiency gain can even go up to 100×! But I think this is likely very specific to the task, and is also outdated since the paper was written before reasoning models were a thing.

More recently, Arden Berg and I looked specifically at the software efficiency gain from early reasoning models, such as o1 and DeepSeek-R1. After some detective work, we estimated that you’d need around 10× more pre-training compute to match the same performance without RL training, at least on benchmarks like GPQA diamond and MATH Level 5. In practice this’ll depend on how much RL training you do and the amount of inference scaling, the domain, and in any case it doesn’t give us a trend in the software efficiency over time.

Incidentally, this is one of the reasons that Ajeya Cotra shifts her estimate of software progress downward in her bioanchors report, and Tom Davidson employs a similar argument in his work on takeoff speeds.

This depends on how exactly we measure “training compute”. For example, both model distillation and synthetic data generation often rely on using a larger model to help develop a smaller one. But typically when we count training compute, we focus only on the small model, which neglects the compute needed to train the large model. When we then use these (understated) training compute estimates to measure software progress, we then pick up a big efficiency gain.

This is probably less relevant for frontier models, though I’m not sure that’s universally true. For example, I think there’s some chance that o3 was in part distilled from GPT-4.5.

Interestingly enough, my coauthors and I tried decomposing software progress in language models and found that most of it comes from data efficiency improvements (as opposed to parameter efficiency), which is consistent with this hypothesis. But I think this is pretty weak evidence because the confidence intervals are super wide, and sometimes the fit would suggest most gains were from parameter efficiency instead, after making small tweaks to the optimization setup. The overall compute efficiency result seemed a lot more stable though.

Another example that shows scale-dependence of software progress in RL is Figure 2 in Khatri et al. 2025, which shows how improvements to RL algorithms change the slope of training curves.

Unfortunately, existing estimates just don’t really account for this. The “fancy statistics” approaches tend to assume away scale dependence, and the “lines on a graph” approaches don’t have enough high-quality data to reliably identify it. I only know of two exceptions to this, that try to account for efficiency contributions in a very granular way.

Note that the “lines on a graph” approach could try to measure this in principle, where you see faster software progress at larger scales than smaller scales. But besides being limited by limited noisy data, there’s an additional complication: at smaller scales you can do more experiments, and that pushes towards faster progress at small scales. This is perhaps a part of the reason why Claude Sonnet models can end up better than Opus.

My current impression is that there’s no clean way to directly compare the numbers from existing estimates (which ignore scale-dependence) and the numbers from this paper — they’re measuring somewhat different things. The paper instead proposes a plausible justification for several-times-per-year estimates of software progress.

Having spoken with the first-author of the paper (Hans Gundlach), my impression is that the project was pretty compute-constrained, hence the small-scale experiments. Somebody give them more compute!

Specifically, the paper doesn’t account for post-training improvements, and it uses a literature estimate for data improvements (rather than accounting for this experimentally).

The latter refers to the shift from using Kaplan scaling laws to Chinchilla scaling laws in Transformers — Chinchilla uses a different ratio of training data and parameters to get higher performance with the same compute.

Interestingly, the estimate shown in the figure (from Gundlach et al. 2025) estimates a higher compute efficiency gain shifting from Kaplan to Chinchilla scaling laws compared to Ho et al. 2024. I suspect this is because these were estimated in different ways — the former estimates the gain using the loss function / scaling law reported in Kaplan et al. 2020, whereas the latter uses the stated recommendations for how to allocate training compute across parameters and data. In principle these should be the same, but perhaps there’s a difference between them in practice.

I think the proponents of this view also think that AI will lead to substantial speedups well before AI research is fully automated, but the real hyperspeed feedback loop really comes when AI can fully do AI R&D, removing the last human bottlenecks in the chain.

Note that the case of more numerous AIs involves improvements in inference efficiency, which is separate from the training efficiency estimates I discussed earlier.

Technically speaking, this parameter tells us how compute efficiency changes when you double cumulative research effort. If doubling cumulative research effort also doubles compute efficiency, then the returns to R&D are 1. If it quadruples, then the returns are 2. In general, if the output increases by 2^r, then the returns are r.

One caveat is that I’m uncertain what the rate of software progress was prior to the recent rise in post-training. I think it’s possible that progress has sped up in recent years.

Why include experiment compute at all? The idea is that if you don’t scale up your innovations, you don’t know how it’s going to work, so you need a lot of compute to do large-scale scaling experiments to test this out. This would thus pose a big compute bottleneck, and it seems to be backed up by claims from frontier lab AI researchers. For example, Alex Paino (who helped pre-train GPT-4.5) describes how lots of software innovations “look good at small scale, but don’t look good at large scale.” Relatedly, some innovations like RLHF and reasoning training plausibly would’ve been a waste of time with small models like GPT-2, but likely become very important at the scales of GPT-4 and GPT-5. The ideas for both of these could’ve come about (or actually did come about) at smaller scales, but scaling up was essential to seeing if they would actually work.

But note that some argue that experiment compute bottlenecks haven’t actually been that strong historically, if you look at the numbers. Moreover, you might expect AI systems to substantially reduce experiment compute costs because they have much better “research taste”, further making it more likely that you’ll find innovations that scale super well with training compute.

Though catch-up between US and Chinese models isn’t all about distillation either, as Nathan Lambert recently argued.

It also depends on what “accelerate like crazy” means — I think AI progress would definitely speed up, but I’m more skeptical of something like “5 orders of magnitude of software progress in a year”. I’d probably still give it something like a 15% chance though!

Note that Parker Whitfill noticed a bug in the code for Ho et al. 2024, although this didn’t change the numbers very much. Moreover, the reported confidence interval for this study was calculated after using updated data, so it’s wider than the numbers shown in the original paper (although the original central estimate remains the same).

How persistent is the inference cost burden?

JSD — Mon, 16 Feb 2026 16:47:19 GMT

Originally posted on Epoch AI.

Toby Ord has written a thoughtful post on how RL and inference compute scale for frontier AI models.

As I understand it, the core of his argument is

(1) RL scaling primarily bears fruit by enabling models to productively use longer outputs, which means you need to scale inference compute to realize the gains
(2) RL scaling itself delivers poor returns, requiring roughly 10,000x more compute to match what 100x more inference provides.

Combined with the fact that inference costs are per-use and can’t be amortized like training costs, this paints a picture of a significant and persistent economic burden as we shift away from pretraining scaling.

There’s a lot I agree with in Toby’s analysis, and I find the framing useful. However, I think both claims above may be overstated. On (1): even though inference costs are per-use, the dollar cost to reach a given capability level falls rapidly over time, so the burden might not be so persistent. On (2): the RL scaling data is thin, and there’s likely been substantial compute efficiency progress in RL since o1 and o3.

I. What I agree with

I agree with Toby that for tasks that models can solve at all by increasing inference compute, it frequently seems more efficient to scale inference compute than to increase RL compute.1

Scaling reinforcement learning also makes it possible to perform new tasks the model could not achieve before with any amount of test-time compute (for example, many METR tasks). However, as Toby argues, RL unlocks these new tasks in large part by allowing the model to productively use longer outputs (chain of thought, tool calls, or writing) when completing them. Often, these harder tasks inherently require more steps to complete, and just scaling reinforcement learning does not particularly seem to make more steps happen within individual forward passes (see also this related post by Ryan Greenblatt).

Thus, to solve these new, harder tasks with your newly RL-scaled model, you have to pay a large inference cost — a variable cost that unlike training costs is not amortized by serving more users.

II. Fixed-capability costs fall fast

Toby suggests that with RL scaling approaching its limits, inference scaling is all that remains2, and inference scaling gives models more time to think rather than making them smarter per step. He argues this means the compute scaling that has driven recent AI progress will substantially weaken, while also imposing persistent per-use costs.

I agree that per-use costs can be significant in the short run3, but I’m not convinced they’re persistent, since the price of inference to reach a given capability level drops rapidly. In a related post, Toby looks at the cost of running models on the METR suite, and how that has changed as time horizons have increased. However, all the models considered in this analysis were state-of-the-art at release, so this analysis does not account for how much cheaper inference can become.

First, individual forward passes can become less expensive over time. Improvements to model training mean that smaller models can match the capability of earlier larger ones. Distillation is a salient example: once a frontier model can generate high-quality reasoning traces, you can train smaller, cheaper models to imitate those traces. On top of that, a steady stream of algorithmic improvements in inference efficiency (e.g. speculative decoding, paged attention, sparse attention, offloading or compressing the KV-cache, etc.) all reduce the cost per token.

At the same time, over time models can reach the same level of capabilities while needing to do fewer forward passes. Models can be trained to reason more concisely: Anthropic, for instance, substantially reduced the verbosity of Claude’s reasoning between Claude Sonnet 3.7 and Claude Sonnet 44.

Hardware improvements also help: the cost per FLOP decreases with each new generation of GPU. Combined with the algorithmic gains above, we can see the net effect on concrete tasks. On FrontierMath, reaching roughly 27% accuracy required about 43 million output tokens with o4-mini with high reasoning effort in April 2025, but only about 5 million tokens with GPT-5.2 with low reasoning effort in December 2025. Even accounting for the difference in price per output token, that’s roughly a 3x cost reduction over eight months. Overall, the trend seems to be very roughly a 5–10x cost reduction per year for reaching a given capability level5. Concretely, suppose the first AI that can perfectly replicate the web interface for a complex economic model can only do so for $50,000. While that is a high initial price, if the most aggressive cost reductions hold, it would become $5,000 a year later, and $500 after two years.

This changes the economic picture: even if it is initially costly for a user to reach a given capability level, this gets much cheaper over time. Nevertheless, this decrease might slow down in the future. There is likely a number of parameters below which models are simply too small to have strong general agentic capabilities regardless of how they are distilled. And heavily distilled models do seem to be more brittle overall. The fast cost reductions we observe might partly be an artifact of measuring capabilities through benchmarks, where distilled models tend to perform disproportionately well. That said, even with these caveats, 5–10x per year is a very fast trend. Overall, this makes me more interested in tracking the capabilities of smaller models and trends in falling inference costs: stay tuned as we will have more rigorous analysis on this soon!

III. The returns to RL scaling might be higher

Toby’s core estimate comes from comparing the two panels in OpenAI’s original o1 announcement chart. Both panels show performance on AIME with a logarithmic x-axis spanning roughly two orders of magnitude (100x). The left panel shows RL training scaling, while the right panel shows inference scaling. Toby observes that the slope of the RL panel is roughly half that of the inference panel: 100x more inference takes you from around 20% to around 80%, while 100x more RL training only takes you from around 33% to around 66%. At that rate, matching the jump from 20% to 80% would require 10,000x more RL compute. Toby then checks this against the o3 training curve and the o1-to-o3 and o3-to-GPT-5 comparisons, which are broadly consistent.

This is an interesting observation, but I find it hard to draw strong conclusions from it. The x-axis numbers are removed, we’re reading slopes off a small number of data points, and scaling trends are frequently sensitive to details we don’t have access to from the outside.

More importantly, algorithmic progress can shift a given RL scaling curve substantially. For example, in Figure 2 of this paper studying RL scaling laws, Scaled RL achieves more than twice the slope of GRPO on comparable benchmarks (see figure below), including when the y-axis is linear while the x-axis is logarithmic. This is at much lower compute scales than o1 and o3, so it’s not directly comparable, but it illustrates how much the efficiency of RL training can vary with the algorithm.

It also makes sense to me that OpenAI wouldn’t have been laser-focused on maximizing RL compute efficiency in the early days of o1 and o3. RL compute was still a small fraction of total training cost, so the returns were good enough without needing to optimize compute aggressively. Since then, there’s likely been considerable algorithmic progress (in large part through better and more diverse RL tasks and environments), which likely improved the slope of RL scaling. Much of the effort since o1 and o3 has also gone into applying RL to new domains beyond math and coding, rather than pushing RL as far as possible within those two domains.

Conclusion

Toby's discussion of RL scaling versus inference scaling is useful, and the core observation that RL gains come largely with longer chains of thought is well-taken. But the picture he paints may overstate how much of a bottleneck this will be for AI progress. The cost to reach a given capability level falls fast, so the inference cost burden is more transient than it might appear from looking only at frontier models at launch. And the RL scaling data we have is still thin, so I would treat the 10,000x figure with a grain of salt. This has made me more interested in studying how quickly cheaper models catch up to frontier capability levels, and how inference costs for a fixed task decrease over time.

Thanks to Toby Ord, Ryan Greenblatt, Greg Burnham, Anson Ho, David Owen, Tao Lin and Josh You for helpful comments.

In fact, the returns to inference compute scaling might be even better than Toby mentions, since his estimates focus on increasing reasoning effort but mostly do not account for other ways of increasing inference compute, such as best-of-n sampling.

At least, unless or until there is a new paradigm of compute scaling.

As Ryan Greenblatt points out, even for these initial costs, it matters how they compare to the cost of humans solving the task.

Relatedly, single-forward-pass reasoning capabilities appear to be increasing exponentially (thanks to Greg Burnham for pointing this out). This also suggests that tasks that once required long chains of thought could be handled with much fewer tokens. Note that this trend does not seem to be driven by RL itself, and might trade off with the previous trend, to the extent that it requires larger models.

Our earlier data insight on inference price trends isn’t ideal evidence here, since it doesn’t account for reasoning models or tasks where longer outputs provide genuine value, and probably is unusually vulnerable to benchmark climbing/contamination.

How close is AI to taking my job?

Anson Ho — Sat, 07 Feb 2026 02:54:58 GMT

Originally posted on Epoch AI.

I. Searching under the streetlight

How can we anticipate when AI will be able to do our jobs? AI researchers have mainly tried to answer this question by building complex AI benchmarks. The problem is that this approach is fundamentally flawed.

A good example of this is OpenAI’s GDPval. On paper, it’s a cool benchmark that captures AI performance on a wide range of real-world job tasks in the US economy. The benchmark tasks were meticulously constructed to be realistic, involving the hard work of hundreds of experts and likely millions of dollars — placing it among the most expensive economics papers of all time.1 If there’s one benchmark that could be the leading indicator of AI job automation, it’s GDPval.

Unfortunately, the benchmark seems to have fallen prey to the same issue plaguing most other benchmarks. Shortly after release, AI models have beaten the human baseline — GPT-5.2 reached parity with industry experts, and Claude Opus 4.6 likely does even better. And yet, the actual economic impacts of AI remain muted. The benchmark doesn’t fully reflect the economic effects, and so it’s falling short in its role as a leading indicator for automation.

This isn’t really OpenAI’s fault, rather it’s a fundamental challenge with AI benchmarks. These need to be designed so that you can automatically run evaluations quickly and regularly, but this constrains us to tasks that are “clean” and close-ended. So when we try to generalize these to the messy workflows and fuzzy tasks of the real world, we inevitably encounter many issues.

If this is right, then the problem is that we’ve mostly been searching under the streetlight, primarily tracking progress on the things that benchmarks can easily measure. But it also suggests that there’s a way to search beyond the streetlight: we just need to sacrifice a bit of rigor and ease of evaluation, and in exchange we can get a better sense of automation on truly real-world tasks.

Here’s the idea: pick out a set of actual work tasks that people do on a regular basis, and then see how well AI is able to do them, and regularly check how much better AI is getting. This kind of subjective AI evaluation on realistic tasks isn’t a totally novel idea — for example, in AI safety, models are often “red-teamed” in a way that involves subjectively rating their risk levels. There are also many studies looking at AI uplift on real-world tasks. But I think there’s room for a lot more of this kind of analysis, especially because all the benchmarks seem to be running into the same bottlenecks. So I think it’s an underrated way of tracking AI’s potential impacts in actual jobs.

So to demonstrate how this could work, I decided to try getting today’s best AI agents to do my job.

II. Trying to automate my job away (for science)

The first step is to pick out three tasks that people have actually done at Epoch, with a heavy bias towards my own work. I thought it’d be good to focus on things that capture the full gamut of important AI bottlenecks — long-horizon planning and execution, creativity and taste, and so on. So these are tasks that should be relatively difficult for AIs today, but play an important role in my work.

After that, I spent 30-60min trying to get AIs to autonomously solve each task and analyzed where current models seem to be falling short. I could’ve done a lot more to improve AI performance on each task, but I wanted this to broadly reflect how I’ve been using AI in practice given my time and work constraints. Then I roughly forecasted when I thought AIs would do them to my liking (i.e. such that I’d personally use their outputs in practice), and how well AIs would do by the end of the year.

So if we put things together, this should give a more concrete picture of how close AI is to taking my job. Here’s where I ended up:

My probabilities that an AI successfully completes the given task by the specified date. For a task to be “completed”, the final product needs to be good enough for me to want to use it in practice. Tasks are done autonomously by the AI except for the second task, where two rounds of comments are provided.

To be clear, this doesn’t mean that I expect to lose my job by early 2029 — more on that later. But first, let’s dive into the details of the tasks.

Task 1: Replicating an interactive web interface for an economic model

About a year ago, we at Epoch released a complex economic model of AI automation, which we called the GATE model. And by “complex”, I mean that it has over forty parameters, is solved by some clunky variant of gradient descent, and has a bunch of add-ons that can make the simulation especially unstable.

To accompany the release, we provided an interactive web interface that lets users input their own parameters, look at a bunch of fancy graphs about how economic growth could go crazy, and read some documentation. The point is, it’s by no means a simple thing to implement, and the frontend isn’t a piece of cake either.

Here’s what that looks like:

So the first task is this: Can an AI model build a functional replica that’s as good as the published web playground?2 The AI is given access to the academic paper and the original webpage, which includes a bunch of documentation. It’s also provided with descriptions on which graphs need to be created, as well as parameter ranges for each of the parameters.3

I asked Claude Code to implement this, mainly because it’s widely regarded as the best tool for coding. Initially it complained about the message being too long if I dumped in the whole paper PDF, so I gave it the arXiv URL. It then spent a while looking up the equations in the paper, quickly wrote some HTML and Javascript, and this was the result:

This looks pretty cool to start off with, and it’s certainly a functioning website, but it goes downhill from there. The most crucial issue is that many of the model predictions differ vastly from the official simulation, which suggests something went wrong in the reimplementation. For example, the GWP change is way lower than what you observe in the mainline scenario in our actual GATE model.4 Another issue is that the website has missing features, like a “comparison mode” that allows users to compare different parameter settings.

The big question is: how well will AI systems do on this task in the future? I think we can think about this in several ways.

One is to consider the trend in METR’s time horizons, which I think applies reasonably well given that this is a software task that’s pretty well-defined. By the end of the year, we expect AI to be able to do tasks roughly one day long with a 50% success rate.5 In comparison, I’d guess that this task would take several days for a person familiar with the paper and is able to play around with the web interface. So this task seems plausibly solvable by December, but it’s less likely than not.

Still, it’s a clunky task with a lot of moving parts. Perhaps the hardest part is the backend, which involves some nitty-gritty optimization with numerical stability problems — you’d probably need to stare at many graphs and iterate to get it right. The frontend has its own challenges, but I think Claude Code’s implementation isn’t so bad on this front.

Putting things together, I think that models have about a 10% chance of implementing this as well as the existing live version by the end of the year. In my mind, this’ll probably be because models flunk some part of the code implementation for the economic model’s dynamics, or there are some non-obvious bugs that need a bunch of people trying diverse things on the website to identify. Fixing these issues takes a bit more time, and by late 2027 or early 2028, I think there’s a roughly even chance that this task gets solved.

So my guess is that our web dev team will continue to produce interactive web interfaces that are of a similar complexity, like Epoch’s distributed training simulator and webpage for frontier data centers, at least in terms of core functionality. And if you work on things of a similar difficulty, I think you’ll probably be in the same boat.

Task 2: Writing an article

Now let’s move on to the second task: Can AI write articles for this newsletter?

To test this, I asked Claude Opus 4.5 to try and write one of my previous articles, which summarized how well people forecast AI progress in 2025 (notably, this means that Claude didn’t need to come up with the post idea itself). I gave it the list of questions and resolutions from the AI Digest (who ran the survey), all the results, and asked it to look at our previous posts to get a sense of our style.6

Its initial attempt at this wasn’t very good — it didn’t include graphs, links to relevant sources, and even left out some of the questions from the survey entirely! The writing style was a tad stiff, the structure was odd (e.g. the survey demographics were put near the end of the article), and the results weren’t very well justified. For example, here’s Claude explaining why forecasters significantly underestimated progress on the cybersecurity benchmark Cybench:

Cybersecurity evaluations have received less public attention than coding or math benchmarks, and forecasters may have had weaker mental models of the rate of progress in this domain. It also suggests that capability improvements on agentic tasks may be occurring faster than the broader forecasting community appreciates.

As far as I can tell, there isn’t any justification for any of the claims in the paragraph — don’t ask me how Claude came to these conclusions!

That said, it’s a bit unfair of me to judge it purely on its zero-shot performance — when I’m writing I usually get lots of helpful comments from others. So I gave Claude two rounds of feedback, totalling around forty comments, each about one to two sentences long. For example:

Unfortunately, while this solved some problems (like adding graphs and links), new issues emerged. Sometimes Claude would fail to resolve subtle graphical issues that would normally involve multiple rounds of iteration, like putting text in the right part of the graph. And there were many such problems littered throughout the article — even after my feedback, the writing was still stilted, had many inaccuracies, and terms that weren’t explained to my liking.

In the end, the post wasn’t unreadable, but there were so many small errors that I felt better off going back to square one and rewriting the post from scratch. So, at least in the near future, I’ll continue to be writing articles.

But just how near is the “near future”? When will AI essentially be able to write a full article of this complexity that I’ll really like, after giving a few rounds of comments? Unlike Task 1, I don’t see a clear trend that we can extrapolate to forecast progress. So I think we’ll probably need to turn to the age-old tradition of “vibes-based forecasting”.

On the one hand, there are a bunch of reasons to expect AI to get a lot better on this task very soon. One is that AI writing has improved a ton over the last few years, and it’s not crazy at all to expect that this continues. Labs have also been making efforts to improve writing, as we’ve seen with models like GPT-4.5, and people are increasingly building writing benchmarks.

What’s more, the task of writing an article isn’t just about the quality of the writing itself. In many ways, Claude was actually more bottlenecked on things like data analysis and graphing, as well as insufficient context about my writing preferences. But I expect AIs to get vastly better at practically any kind of coding-related task over the next few years, and they’ll also have a greater degree of continual learning, which would help surmount these barriers. Surely it’s not that hard to give AIs context about my writing preferences and Epoch’s worldview?

On the other hand, there are also reasons for slower progress, at least relative to domains like coding. For one, labs are probably going to focus on things that are more lucrative than writing, like finance and coding. Consider that there are around 1.7 million software engineering jobs in the US, with a median pay of $133,000. That’s several times higher than both the total writing jobs (around 350,000) and median pay (in the ballpark of $70,000).7

This relative focus on coding becomes even stronger because writing is one of those fuzzy tasks where it’s hard to say what’s good or bad. So it’s hard to measure improvements in writing, making it hard for AI companies to “hill-climb” their way to becoming Ernest Hemingway. This is also part of the reason why reasoning seems to help more on tasks like math than writing.

Personally, I find the arguments for slower progress more compelling. I’d probably give a 5% chance that AI will be better at writing these posts than me by the end of the year, and maybe a 50% around late 2028/early 2029. For posts that are more complex and require coming up with novel ideas, I’d probably go a bit later than that, but that depends on how well I can actually write those posts!

This means that I’m more bearish on AI progress for this task compared to the previous one — that’s interesting because I actually find this task a lot easier. Perhaps this is a case of “Moravec’s Paradox strikes again” — AI is often good at the things humans find hard, and bad at the things we find easy.

Task 3: Publishing an article

The third and final task may be even more susceptible to Moravec’s Paradox: Can AI port an article from Google Docs to Substack and Epoch’s website? This requires agentic computer use, which seems to trip up AI models in strange ways.

Most of this task is boring grunt work, spent going through many tedious steps, across different platforms. We start off with a finalized draft of the post and thread written up in Google Docs, which then gets ported to two places:

Epoch’s website: This involves making a new branch on the GitHub repo for Epoch’s website, creating a new markdown document with the right metadata, copying all the content in, formatting the tables, attaching images, and so on.
Substack: This one is easier, but there are some oddities of Substack. You can’t make tables, so we have to take screenshots of the tables from our website’s preview. And footnotes don’t get copied over from Google Docs, so you need to add them manually.8

Overall, it usually takes me about two hours to do this task. If only it were as simple as a single copy and paste, life would be so much easier — or so I thought. To my surprise, Claude with its Chrome browser extension struggles to copy+paste from Google Docs to Substack:

First it tried to download the file. It clicks on File, but then complains that no window opened, even though I could see it opening. Not sure what’s going on there.
Then it tried to highlight everything in the Google Doc and do a copy+paste operation. But for some reason it wasn’t able to do this for all the text, and had to resort to copying text block by block, scrolling down the page.
This was taking forever, so I tried restarting the task, (naively?) hoping that things would be better this time. Unfortunately it got worse: it ran into the same issue, but instead of copying text block by block, it opted for the strategy of scrolling down the page and reading the text screenshot by screenshot!

After it started doing that, I threw in the towel and decided I’d try another agent. This time it was ChatGPT Atlas’ turn. Initially it seemed to get stuck identifying which tabs were open, but after another try it managed to get going. To my surprise, it successfully copied the main post onto Substack — after watching Claude’s agonizing failed attempts to copy and paste text, this was mindblowing.

Alas, I only managed to feel the AGI for about fifteen more seconds before ChatGPT started making some really silly mistakes. It was transferring the footnotes, but it didn’t use Substack’s footnote feature, and it didn’t add all the footnotes. Even worse, the footnotes were all made up, and there was a bizarre formatting issue where ChatGPT placed the cursor in the wrong place:

ChatGPT Agent successfully ports the main text of our last Gradient Update, but messes up the footnotes.

So at the moment, AI agents seem to fail quite miserably at this task, and are also extremely slow. This corroborates pretty well with Claude’s performance on Pokémon, as well as the finding that METR’s time horizons are 40-100× shorter for visual computer use tasks compared to their headline numbers. For example, this could bring a five hour (300 minute) time horizon down to a three minute time horizon.

But while the time horizons are much shorter, the growth rate is about the same as the METR’s main results, with roughly two doublings each year. Today’s agents are also much better at Pokémon compared to a year ago. If these trends hold, we should expect to see very fast progress in the near future.

And people are trying to make it happen — labs have been creating agents like ChatGPT Agent, Kimi K2.5’s Agent Swarm, and Claude Cowork, and we’re still in the early days of AI agents. There are also huge incentives for labs to push in this direction, overcoming bottlenecks to AI adoption among billions of computer users worldwide.

Of course, there are some bottlenecks that we need to account for. One bottleneck that I learned the hard way is reliability. When ChatGPT Agent was messing around on Substack, there was a point when the cursor hovered over the button to publish the post. I was practically freaking out at the idea of sending an already-published post with garbled footnotes to over ten thousand subscribers, but fortunately nothing happened, and I was able to intervene.

By the end of the year, I’d probably give roughly a 10% chance of AIs completing the task autonomously, though they might need to do it very very slowly. In the median case, I think AI agents will get much further in the task — porting most of the footnotes correctly, adding content to an IDE and attaching the images, and so on. But I think it’ll be very slow, there’ll be missed steps, and more likely it’ll get stuck on something that’s trivial to most computer-literate humans.

Looking further out, I think there’s a 50-50 chance that AI agents can complete this task by around mid-2028.

III. What this all means for my job, and perhaps yours too

Let’s take stock of what we’ve seen so far. Based on the three tasks, today’s AI seems unable to fully generate complex interactive web simulations, is okay at writing articles, and very bad at publishing them. But I also think that we’re only a few years away from AIs that can solve these tasks, and the tasks that are hard for me might not be the ones that are hard for the AIs. It depends on how amenable those tasks are to things like RL, and on how economically valuable they are.

Looking at the forecasts, the last task that I expect to fall comes in early 2029, but that doesn’t necessarily mean that AI will replace me by 2029. The obvious reason is that these three tasks don’t comprehensively represent my job, and I wasn’t trying to pick out the hardest-to-automate tasks. So even if AI is able to solve them, there’s a good chance that I’ll just work on other things instead. The bottleneck shifts to something new.

For example, we at Epoch have mostly prepared for podcast episodes by brainstorming questions in advance. Now let’s say that Claude becomes better than us at coming up with good podcast questions in advance. Then the bottleneck shifts from writing questions to being able to understand them and ask good follow-ups during the recording itself, in a way that understands what kinds of things humans might be curious about on the fly.

The bottleneck is shifting from us Epochians preparing good questions in advance, to being able to understand them and ask good follow ups during real conversation. But there may still be some time before Humanity’s Last Podcast. (Source: Epoch After Hours)

So if we want to know when AI takes my job, rather than just “do a bunch of tasks”, we also need to know how far these bottlenecks can shift. My “strong opinion, loosely held” is that this can go surprisingly far. One reason is that Moravec’s Paradox really messes with our intuitions. AI capabilities are very spiky, and if we don’t know what this spikiness looks like, then it becomes hard to predict what kinds of things will stump future AIs. This becomes doubly hard because it’s often hard to know what bottlenecks exist until you actually encounter them. The last bottlenecks will eventually fall. But if we forecast job automation by saying “job X is just doing Y” and predicting when AI can do Y, we’ll likely produce overly aggressive timelines.

In practice, I imagine the path towards automating my job probably looks something like this: in the next year or two, my day-to-day work will look pretty similar to today at a high level. But if we narrow down to subtasks, we’ll see a lot more AI use, like how GPT-5.2 is now essentially my default search engine instead of Google. Between 2027 and 2029, AI will continue to get much better at coding, writing, and agentic computer use, until eventually it’s able to do large chunks of what I do today. I’ll increasingly shift my attention to things that are hard to stuff into AI contexts, manage multiple AIs, deal with bizarre AI weaknesses, and do more high-level ideation or other fuzzy tasks. This continues for a handful more years before it’s finally game over for my job.

What about you and your job?

For the most part, I’ve framed this post around AI progress on my job, and how I personally expect to be using AIs. But what about everyone else and their jobs?

First of all, while I probably use AI more than 99% of people on Earth, there are some people who use AI much more than me. I suspect this is especially true for coders based in the Bay Area. As a particularly extreme example, Boris Cherny (the creator of Claude Code) puts my puny Claude + Chrome extension setup to shame:

(Image source)

This can strongly influence just what kinds of tasks AI systems can or cannot do. Some AI users might be able to elicit AI capabilities a lot better than I can — this in turn impacts how AI changes their work, and how far AI is from taking their jobs.

There’s clearly also a ton of variation across different occupations, which you can see from resources like How People Use ChatGPT and the Anthropic Economic Index, or even just by asking your friends. So unless you and I work on similar things, I can’t say much about AI timelines on your specific job.

The solution is to have lots of people from various AI-exposed fields do a similar analysis to what I’ve done — that means economists, lawyers, mathematicians, and so on. The procedure is simple: pick out three work tasks that you do regularly (and ideally spend a large fraction of your time on), and then spend a bit of time getting AI to attempt the task. Document how much progress the AI makes, look at the bottlenecks, and make your forecasts for when they’ll be overcome.

I’m hoping that this’ll give a lot more concrete evidence about how AI can be used and is bottlenecked in truly real-world tasks, especially in your own personal situation. So it might be worthwhile trying to get AI to automate your job — you might learn things about both your life decisions and the future of labor automation. That’s something that even a million-dollar benchmark may not be able to do.

I’d like to thank JS Denain, David Owen, Greg Burnham, Luke Emberson, Markov Grey, Jaime Sevilla, and Lynette Bye for helpful feedback and support.

I don’t know of an official source for how much this cost, but we can do a ballpark estimate: The full benchmark has 1320 tasks, each of which needs 7 hours of expert time to complete (on average). Assuming that experts are paid around $75 per hour, that adds up to around $700,000, just accounting for the costs of measuring human performance. But this ignores costs from building the benchmark, operations, and multiple rounds of expert review, which I think are likely to raise the costs into the millions.

Hat tip to Jérémy and Tangui for the inspiration behind this question.

So far, the code for both the model and the web implementation have been private or shared with a handful of collaborators, so I think the chances of test set leakage are quite low. Note also that this task isn’t 100% the same as the one that our web dev team had to deal with — it’s easier in some ways (e.g. there’s a completed paper and webpage which the AI can reference), but it’s harder in some other ways (e.g. it doesn’t have the benefit of iterating many times with the people who designed the economic model).

Funnily enough, this trajectory probably seems more plausible to most economists, who are extremely skeptical about the plausibility of explosive economic growth. That said, even to them, the GWP growth rate should look quite odd — it starts off at around 5-6% without much automation, then goes down to 3% with more automation, and then goes back up to 5-6% around the time of full automation!

To give a sense of where this number comes from, the current time horizon at a 50% success rate is around 6 hours — Claude Opus 4.5’s time horizon is 5.3 hours, and GPT-5.2 (high)’s time horizon is 6.6 hours. Each year we might see roughly two doublings of the time horizon, so by the end of the year this might be about a day long.

I gave it links to several of our most recent Gradient Updates articles, and asked it to look at them to get a sense of the formatting and style. I also asked it not to look at any other Gradient Updates, so that it wouldn’t look at the already-published version of the target post — probably there’s a smarter way of doing this but I think it should do the trick for the sake of this post.

Here’s a rough justification: according to the Bureau of Labor Statistics, there are 135,400 “writers and authors” jobs, 56,400 “technical writer” jobs, 115,800 “editors”, and 49,300 “news analysts, reporters, and journalists”. In total that’s around 350,000 writers, and if we eyeball the numbers in these links the median pay is probably in the ballpark of $70,000.

If you know of a better way to do this, please let me know.

Can AI companies become profitable?

Jaime Sevilla — Wed, 28 Jan 2026 23:50:19 GMT

This post was written in collaboration with Exponential View. It is part of Epoch AI’s Gradient Updates newsletter, which shares more opinionated or informal takes on big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Originally posted on Epoch AI.

Update (March 6, 2026): We’ve revised our estimates based on new information and feedback from people familiar with the matter. In particular, we’ve 1) increased our estimate of inference costs given new information, and 2) lowered our estimate of sales and marketing spending after excluding inference compute costs associated with serving free users. This article reflects these updated figures.

Are AI models profitable? If you ask Sam Altman and Dario Amodei, the answer seems to be yes — it just doesn’t appear that way on the surface.

Here’s the idea: running each AI model generates enough revenue to cover its own R&D costs. But that surplus gets outweighed by the costs of developing the next big model. So, despite making money on each model, companies can lose money each year.

This is big if true. In fast-growing tech sectors, investors typically accept losses today in exchange for big profits down the line. So if AI models are already covering their own costs, that would paint a healthy financial outlook for AI companies.

But we can’t take Altman and Amodei at their word — you’d expect CEOs to paint a rosy picture of their company’s finances. And even if they’re right, we don’t know just how profitable models are.

To shed light on this, we looked into a notable case study: using public reporting on OpenAI’s finances,1 we made an educated guess on the profits from running GPT-5, and whether that was enough to recoup its R&D costs. Here’s what we found:

Whether GPT-5 was profitable to run depends on which profit margin you’re talking about. If we subtract the cost of compute from revenue to calculate the gross margin (on an accounting basis),2 it seems to be about 30% — lower than the norm for software companies (where 60-80% is typical) but still higher than many industries.
But if you also subtract other operating costs, including salaries and marketing, then OpenAI just barely broke even, even without including R&D. And they likely ended up at a loss after accounting for their revenue-sharing deal with Microsoft.
Moreover, OpenAI likely failed to recoup the costs of developing GPT-5 during its 4-month lifetime. Even using gross profit, GPT-5’s tenure was too short to bring in enough revenue to offset OpenAI’s R&D costs in the four months prior to launch. So if GPT-5 is at all representative, then at least for now, developing and running AI models is loss-making.

This doesn’t necessarily mean that models like GPT-5 are a bad investment. Even an unprofitable model demonstrates progress, which attracts customers and helps labs raise money to train future models — and that next generation may earn far more. What’s more, the R&D that went into GPT-5 likely informs future models like GPT-6. So these labs might have a much better financial outlook than it might initially seem.

Let’s dig into the details.

Subscribe now

Part I: How profitable is running AI models?

To answer this question, we consider a case study which we call the “GPT-5 bundle”.3 This includes all of OpenAI’s offerings available during GPT-5’s lifetime as the flagship model — GPT-5 and GPT-5.1, GPT-4o, ChatGPT, the API, and so on.4 We then estimate the revenue and costs of running the bundle.5

Revenue is relatively straightforward: since the bundle includes all of OpenAI’s models, this is just their total revenue over GPT-5’s lifetime, from August to December last year.6 This works out to $6 billion.7

At first glance, $6 billion sounds healthy, until you juxtapose it with the costs of running the GPT-5 bundle. These costs come from four main sources:8

Inference compute: $4 billion. This is based on public estimates of OpenAI’s total inference compute spend in 2025, and assuming that the allocation of compute during GPT-5’s tenure was proportional to the fraction of the year’s revenue raised in that period.9 This cost covers serving both paid and free users.10
Staff compensation: $1 billion, which we can back out from OpenAI staff counts, reports on stock compensation, and things like H1B filings. One big uncertainty with this: how much of the stock compensation goes toward running models, rather than R&D? We assume 40%, matching the fraction of compute that goes to inference. Whether staffing follows the same split is uncertain, but it’s our best guess.11 12
Sales and marketing (S&M): $0.5 billion. This would include expenses such as their SuperBowl ad and business sale campaigns, promotions and discounts, and other related expenses. Notably, it excludes the compute costs of serving free users.13 14
Legal, office, and administrative costs: $0.2 billion, assuming this grew between 1.6× and 2× relative to their 2024 expenses. This accounts for office expansions, new office setups, and rising administrative costs with their growing workforce.

So what are the profits? One option is to look at gross profits. This only considers the direct cost of running a model, which in this case is just the inference compute cost of $4 billion. Since the revenue was $6 billion, this leads to a profit of around $2 billion, or gross profit margin of around 30%.15 This is lower than other software businesses (typically 70-80%) but high enough to eventually build a business on.

On the other hand, if we add up all four cost types, we get close to $6 billion. That’s close to the revenue, so on operating profit terms, we think the GPT-5 bundle just about broke even.16

Stress-testing the analysis with more aggressive or conservative assumptions doesn’t change the picture much:17

Confidence intervals are obtained from a Monte Carlo analysis. Margins are rounded to the nearest 5%.

And there’s one more hiccup: OpenAI signed a deal with Microsoft to hand over about 20% of their $6 billion revenue,18 making them end up at a loss.19 This doesn’t mean that the revenue deal is entirely harmful to OpenAI — for example, Microsoft also shares revenue back to OpenAI.20 And the deal probably shouldn’t significantly affect how we see model profitability — it seems more to do with OpenAI’s economic structure rather than something fundamental to AI models. But the fact that OpenAI and Microsoft have been renegotiating this deal suggests it’s a real drag on OpenAI’s path to profitability.

In short, running AI models is likely profitable in the sense of having decent gross margins. But OpenAI’s operating margin, which includes marketing and staffing, is likely close to zero. For a fast-growing company, though, operating margins can be misleading — S&M costs typically grow sublinearly with revenue, so gross margins are arguably a better proxy for long-run profitability.

So our numbers don’t necessarily contradict Altman and Amodei yet. But so far we’ve only seen half the story — we still need to account for R&D costs, which we’ll turn to now.

Part II: Are models profitable over their lifecycle?

Let’s say we buy the argument that we should look at gross margins. On those terms, it was profitable to run the GPT-5 bundle. But was it profitable enough to recoup the costs of developing it?

In theory, yes — you just have to keep running them, and sooner or later you’ll earn enough revenue to recoup these costs. But in practice, models might have too short a lifetime to make enough revenue. For example, they could be outcompeted by products from rival labs, forcing them to be replaced.

So to figure out the answer, let’s go back to the GPT-5 bundle. We’ve already figured out its gross profits to be around $2 billion. So how do these compare to its R&D costs?

Estimating this turns out to be a finicky business. We estimate that OpenAI spent $15 billion on R&D in 2025,21 but there’s no conceptually clean way to attribute some fraction of this to the GPT-5 bundle. We’d need to make several arbitrary choices: should we count the R&D effort that went into earlier reasoning models, like o1 and o3? Or what if experiments failed, and didn’t directly change how GPT-5 was trained? Depending on how you answer these questions, the development cost could vary significantly.

But we can still do an illustrative calculation: let’s conservatively assume that OpenAI started R&D on GPT-5 after o3’s release last April. Then there’d still be four months between then and GPT-5’s release in August,22 during which OpenAI spent around $5 billion on R&D.23 But that’s still higher than the $2 billion of gross profits. In other words, OpenAI spent more on R&D in the four months preceding GPT-5, than it made in gross profits during GPT-5’s four-month tenure.24

So in practice, it seems like model tenures might indeed be too short to recoup R&D costs. Indeed, GPT-5’s short tenure was driven by external competition — Gemini 3 Pro had arguably surpassed the GPT-5 base model within three months.

One way to think about this is to treat frontier models like rapidly-depreciating infrastructure: their value must be extracted before competitors or successors render them obsolete. So to evaluate AI products, we need to look at both profit margins in inference as well as the time it takes for users to migrate to something better. In the case of the GPT-5 bundle, we find that it’s decidedly unprofitable over its full lifecycle, even from a gross margin perspective.

Part III: Will AI models become profitable?

So the finances of the GPT-5 bundle are less rosy than Altman and Amodei suggest. And while we don’t have as much direct evidence on other models from other labs, they’re plausibly in a similar boat — for instance, Anthropic has reported similar gross margins to OpenAI. So it’s worth thinking about what it means if the GPT-5 bundle is at all representative of other models.

The most crucial point is that these model lifecycle losses aren’t necessarily cause for alarm. AI models don’t need to be profitable today, as long as companies can convince investors that they will be in the future. That’s standard for fast-growing tech companies.

Early on, investors value growth over profit, believing that once a company has captured the market, they’ll eventually figure out how to make it profitable. The archetypal example of this is Uber — they accumulated a $32.5 billion deficit over 14 years of net losses, before their first profitable year in 2023. By that measure, OpenAI is thriving: revenues are tripling annually, and projections show continued growth. If that trajectory holds, profitability looks very likely.

And there are reasons to even be really bullish about AI’s long-run profitability — most notably, the sheer scale of value that AI could create. Many higher-ups at AI companies expect AI systems to outcompete humans across virtually all economically valuable tasks. If you truly believe that in your heart of hearts, that means potentially capturing trillions of dollars from labor automation. The resulting revenue growth could dwarf development costs even with thin margins and short model lifespans.

That’s a big leap, and some investors won’t buy the vision. Or they might doubt that massive revenue growth automatically means huge profits — what if R&D costs scale up like revenue? These investors might pay special attention to the profit margins of current AI, and want a more concrete picture of how AI companies could be profitable in the near term.

There’s an answer for these investors, too. Even if you doubt that AI will become good enough to spark the intelligence explosion or double human lifespans, there are still ways that AI companies could turn a profit. For example, OpenAI is now rolling out ads to some ChatGPT users, which could add between $2 to 15 billion in yearly revenue even without any user growth.25 They’re moving beyond individual consumers and increasingly leaning on enterprise adoption. Algorithmic innovations mean that running models could get many times cheaper each year, and possibly much faster. And there’s still a lot of room to grow their user base and usage intensity — for example, ChatGPT has close to a billion users, compared to around six billion internet users. Combined, these could add many tens of billions of revenue.

It won’t necessarily be easy for AI companies to do this, especially because individual labs will need to come face-to-face with AI’s “depreciating infrastructure” problem. In practice, the “state-of-the-art” is often challenged within months of a model’s release, and it’s hard to make a profit from the latest GPT if Claude and Gemini keep drawing users away.

But this inter-lab competition doesn’t stop all AI models from being profitable. Profits are often high in oligopolies because consumers have limited alternatives to switch to. One lab could also pull ahead because they have some kind of algorithmic “secret sauce”, or they have more compute.26 Or they develop continual learning techniques that make it harder for consumers to switch between model providers.

These competitive barriers can also be circumvented. Companies could form their own niches, and we’ve already seen that to some degree: Anthropic is pursuing something akin to a “code is all you need” mission, Google DeepMind wants to “solve intelligence” and use that to solve everything from cancer to climate change, and Meta strives to make AI friends too cheap to meter. This lets individual companies gain revenue for longer.

So will AI models (and hence AI companies) become profitable? We think it’s very possible. While our analysis of the GPT-5 bundle is more conservative than Altman and Amodei hint at, what matters more is the trend: Compute margins are falling, enterprise deals are stickier, and models can stay relevant longer than the GPT-5 cycle suggests.

We’d like to thank JS Denain, Josh You, David Owen, Yafah Edelman, Ricardo Pimentel, Marija Gavrilov, Caroline Falkman Olsson, Lynette Bye, Jay Tate, Dwarkesh Patel, Juan García, Charles Dillon, Brendan Halstead, Isabel Johnson, Markov Gray and Stefania Guerra for their feedback and support on this post. Special thanks to Azeem Azhar for initiating this collaboration and vital input, and Benjamin Todd for in-depth feedback and discussion.

Epoch AI has commercial relationships with multiple AI companies, including OpenAI and Anthropic. We disclose these relationships and associated conflicts of interest on our funding page. The views and analysis presented here are independent and not reviewed or endorsed by these companies.

Our main sources of information include claims by OpenAI and their staff, and reporting by The Information, CNBC and the Wall Street Journal. We’ve linked our primary sources though the document.

Technically, gross margins should also account for staff costs that were essential to delivering the product, such as customer service. But these are likely a small fraction of salaries, which are in turn dominated by compute costs — so it won’t affect our analysis much, as we’ll see.

Similarly, OpenAI’s revenue sharing agreement with Microsoft should, in theory, be counted as cost of goods sold and included in the gross margin per usual GAAP accounting. But the revenue sharing reflects more how the profits of the operation are divided between OpenAI and Microsoft, rather than an actual cost, so we have excluded it from the analysis.

We focus on OpenAI models because we have the most financial data available on them.

Should we include Sora 2 in this bundle? You could argue that we shouldn’t, because it runs on its own platform and is heavily subsidized to kickstart a new social network, making its economics quite different. However, we find that it’s likely a rounding error for revenues, since people don’t use it much. In particular, the Sora app had close to 9 million downloads by December, compared to around 900 million weekly active users of ChatGPT.

Now, while it likely didn’t make much revenue, it might have been costly to serve — apparently making TikTok-esque AI short-form videos using Sora 2 cost OpenAI several hundred million dollars. Here’s a rough estimate: In November (when app downloads peaked), Sora 2 had “almost seven million generations happening a day”. Assuming generations were proportional to weekly active users over time, this would mean 330 million videos in total. The API cost is $0.1/s, so if the average video was 10s long, and assuming the API compute profit margin was 20%, this adds up to 330 million × $0.1 × 10 / 1.2 ≈ $250 million. This is significant, but it’s minor compared to OpenAI’s overall inference compute spend.

Ideally we’d have only looked at a single model, but we only have data on costs and revenues at the company-level, not at the release-level, so we do the next best thing.

For the purposes of this post, we assume that GPT-5’s lifetime started when GPT-5 was released (Aug 7th) and ended when GPT-5.2 was released (Dec 11th). That might seem a bit odd — after all, isn’t GPT-5.2 based on GPT-5? We thought so too, but GPT-5.2 has a new knowledge cutoff, and is apparently “built on a new architecture”, so it might have a different base model from the other models under the GPT-5 moniker.

Admittedly, we don’t know for sure that GPT-5.2 uses a different base model, but it’s a convenient way to bound the timeframe of our analysis. And it shouldn’t matter much for our estimates of profit margins, because we’re simply comparing revenues and costs over the same time period.

Also note that GPT-5 and GPT-5.1 are still available through ChatGPT and OpenAI’s API, so their useful life hasn’t strictly ended. We assume for simplicity that usage has been largely displaced by GPT-5.2.

In July, OpenAI had its first month with over $1 billion in revenue, and it closed the year with an annualized revenue of over $20 billion ($1.7 billion per month). If this grew exponentially, the average revenue over the four months of GPT-5’s tenure would’ve been close to $1.5 billion, giving a total of $6 billion during the period.

One category of loss we are ignoring is the remeasurement of convertible interest rights, which purportedly amounted to half of OpenAI’s net loss in the first half of 2025. As with the revenue sharing deal, this only affects how profits would be divided between investors, not the product margins, so it is not relevant to the overall profitability of AI.

Another category of expense we didn’t account for is the cost of acquisitions. We also consider these not overly relevant to determine the overall profitability of operating OpenAI’s bundle of products, though it could be relevant to their R&D.

Last year, OpenAI earned about $13.1 billion in full year revenue, compared to $6.1 billion for the GPT-5 bundle. At the same time, it was reported they achieved a 33% gross margin on inference compute, implying they spent around $8.8 billion running all models last year, so if we assume revenue and inference compute are proportional throughout the year, they spent 6.1 billion / 13.1 billion × 8.8 billion ≈ $4.1 billion.

This is then increased by an additional $200 million from other sources of IT spending, including e.g. servers and networking equipment. The total is then still around $4.1 billion + $0.2 billion = $4.3 billion.

The Information reported that OpenAI spent around $4.5bn in serving paid users in 2025 of over $8bn spent in inference, implying they spent around $4bn serving free users. That’s it, they spent 40% to 50% of their compute serving free users.

H1B filings suggest an average base salary of $310,000 in 2025, ranging from $150,000 to $685,000. This seems broadly consistent with data from levels.fyi, which reports salaries ranging from $144,275 to $1,274,139 as we’re writing this. Overall, let’s go with an average of $310,000 plus around 40% in benefits. We also know that OpenAI’s staff counts surged from 3,000 in mid-2025 to 4,000 by the end of 2025. We smoothly interpolate between these to get an average staff count of 3,500 employees during GPT-5’s lifetime.

Then the base salary comes to: 3,500 employees × $310,000 base salary × 1.4 benefits × 40% share of employees working on serving GPT-5 × 127 / 365 period serving ≈ $0.2 billion (the 127 comes from the number of days in GPT-5’s lifetime).

We then need to account for stock compensation. In 2025, OpenAI awarded $6 billion to employees in stock compensation. Assuming they awarded them proportionally to staff count over the year, and given the exponential increase of staff counts, that would indicate that over 42% of the stock was awarded during GPT-5’s lifetime. Assuming 40% goes to operations as before, that results in $6 billion x 42% x 40% = $1 billion stock expense for operating the GPT-5 bundle. The total staff compensation would then be around $1.2 billion.

It’s debatable whether the very high compensation packages for technical staff will continue as the industry matures.

This is the line item we are most uncertain about. The Information reported that OpenAI had spent $2bn in sales and marketing during the first half of 2025, “nearly doubling what it spent in all of 2024.” However, they previously reported only $300mn in S&M spending during 2024. Our explanation of the discrepancy is that the 2025 H1 S&M spending includes the cost of inference for free users, OpenAI’s primary way of advertising their products, while the 2024 figure does not.

To estimate the S&M spending excluding compute, we assume it grew proportionally to revenue compared to their 2024 expenses. This would result in an estimate of $300mn 2024 S&M budget x $6.1bn GPT-5 tenure revenue / $4bn 2024 revenue = $500mn.

It’s possible this could have grown much more given OpenAI’s growing focus on B2B, which could be especially S&M expensive and include hefty commissions for closed deals.

This corresponds to around 10% of revenue during the period, which isn’t unusual compared to other large software companies. For example, Adobe, Intuit, Salesforce and ServiceNow all spent around 27% to 35% of their 2024-2025 revenue in S&M. That said, there are certainly examples with lower spends — for example, Microsoft and Oracle spend 9 to 15% of their revenue on marketing, though note that these are relatively mature firms — younger firms may spend higher fractions on S&M.

In comparison, Anthropic projected in mid-December a gross margin around 40%, suggesting gross margins in the 30-40% range might be representative of the industry.

How does this compare to previous years? The Information reported that in 2024 OpenAI made $4 billion in revenue, and spent $2.4 billion in inference compute and hosting, $700 million in employee salaries, $600 million in G&A, and $300 million in S&M. This implies a gross margin of 40% and an operating margin of 0% (excluding stock compensation).

In broad strokes, we perform a sensitivity analysis by considering a range of possible values for each cost component, then sampling from each to consider a range of plausible scenarios (a Monte Carlo analysis). The largest uncertainties that feed into this analysis are how much staff compensation goes to inference instead of R&D, S&M spending in the second half of 2025, and revenue during GPT-5’s tenure.

Two more caveats to add: first, this 20% rate isn’t publicly confirmed by OpenAI or Microsoft, at least in our knowledge. Second, the revenue sharing agreement is also more complex than just this number. Microsoft put a lot of money and compute into OpenAI, and in return it gets a significant ownership stake, special rights to use OpenAI’s technology, and some of OpenAI’s revenue. There also isn’t a single well-defined “end date”: some rights are set to last into the early 2030s, while other parts (including revenue sharing) continue until an independent panel confirms OpenAI has reached “AGI”.

Strictly speaking, a revenue share agreement is often seen as an expense that would impact gross margins. But we’re more interested in the unit economics that generalize across models, rather than those that are unique to OpenAI’s financial situation.

The deal was signed in 2019, a year before GPT-3 was released, and at this time it may have been an effective way to access compute resources and get commercial distribution. This could’ve been important for OpenAI to develop GPT-5 in the first place.

OpenAI’s main R&D spending is on compute, salaries and data. In 2025, they spent $9 billion on R&D AI compute, and about $1 billion on data (which includes paying for human experts and RL environments). We can estimate salary payouts in the same way we did in the previous section on inference, except we consider 60% of staff compensation rather than 40%, resulting in a expense of $4.6 billion. Finally, we add about $400 million in offices and administrative expenses, and $600 million in other compute costs (including e.g. networking costs). This adds up to about $16 billion.

In fact, we could be substantially lowballing the R&D costs. GPT-5 has been in the works for a long time — for example, early reasoning models like o1 probably helped develop GPT-5’s reasoning abilities. GPT-5.1 was probably being developed between August and November, covering a good chunk of the GPT-5 bundle’s tenure. But there’s a countervailing consideration: some of the R&D costs for GPT-5 probably help develop future models like “GPT-6”. So it’s hard to say what the exact numbers are, but we’re pretty confident that our overall point still stands.

Because OpenAI’s expenses are growing exponentially, we can’t just estimate the share of R&D spending in this period as one third of the annual total. Assuming a 2.3× annual growth rate in R&D expenses — comparable to the increase in OpenAI’s R&D compute spending from 2024 to 2025 — the costs incurred between April 16 and August 7 would account for approximately 35% of total yearly R&D expenses.

Are our results sensitive to whether we try to account for the costs and revenue of GPT-5.2? In December 2025, OpenAI made around $1.7bn in revenue, corresponding to about $600mn in gross profit, while in February 2026, they made around $2.1bn in revenue, or $700mn in gross profit. So in the three months that GPT-5.2 has been the default model, OpenAI has made about $2bn in gross profit. But of course, GPT-5.2 likely took some non-trivial effort to develop: GPT-5.2 has a different knowledge cutoff, a distinct knowledge profile, and was the alleged result of a “code red” effort by OpenAI to respond to competitors. If in the four months between April and August OpenAI spent around $5.3bn in R&D, in each subsequent month they likely spent at least $1bn more. So if we think GPT-5.2 took at least two additional months' worth of R&D spending to develop, then that would put the expense to develop GPT-5.2 at over $7bn, while the gross profit of the GPT-5+GPT-5.1+5.2 bundle would currently be at around $4bn. Still not enough to break even, though depending on how tight this is as a lower bound it might get there with three more months of operations.

OpenAI was approaching 900 million weekly active users in December last year. For ads, they project a revenue of $2 per free user for 2026, and up to $15 per free user for 2030. Combining these numbers gives our estimate of around $2 billion to $15 billion.

For the investors who are willing to entertain more extreme scenarios, an even stronger effect is when “intelligence explosion” dynamics kick in — if OpenAI pulls ahead at the right time, they could use their better AIs to accelerate their own research, amplifying a small edge into a huge lead. This might sound like science fiction to a lot of readers, but representatives from some AI companies have publicly set these as goals. For instance, Sam Altman claims that one of OpenAI’s goals is to have a “true automated AI researcher” by March 2028.

How well did forecasters predict 2025 AI progress?

Anson Ho — Fri, 16 Jan 2026 19:11:46 GMT

This post was written in collaboration between the AI Futures Project, the AI Digest, and Epoch AI. It analyzes the results of the 2025 AI Digest survey. You can take the 2026 AI forecasting survey here.

This post is also part of Epoch AI’s Gradient Updates newsletter, which shares more opinionated or informal takes on big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Originally posted on Epoch AI.

Every other AI paper I read seems to start with some version of “AI progress has been fast”. And sure, that’s obviously true — a year ago there was no GPT-5, no DeepSeek-R1, and not even Claude 3.7 Sonnet! But few people seem to say exactly how fast things have been, and whether people saw it coming. So when the AI Digest released a survey for people to forecast AI progress over the last year, I was excited.

The survey helps track something akin to an “AI 2027 worldview”, where automating AI R&D leads to a surge in AI capabilities and hence a range of risks to humanity, especially from handing power off to AI systems. You can see this in the question topics: about half of the survey is about forecasting performance on benchmarks related to AI R&D. The other half looks at real-world impacts — think AI-enabled cyberattacks, humans losing control of AI, and more “mundane” things like revenue and public perception. If this worldview is even partially right (and I think it clears that bar), these questions capture some of the most consequential topics about the future of AI.

So how did forecasters do on these questions?

Demographics: Junior, short-ish timelines, high risk of AI catastrophe

To answer this question, we first need to know who these forecasters were. They were primarily recruited through the AI Digest’s outreach on X and their monthly newsletter. In total there were 421 of them, some with serious forecasting credentials. For example, there’s regular top forecaster Peter Wildeford (who came in 14th), as well as AI forecasting researcher Ajeya Cotra (who came in 3rd).

The first thing to know about the forecasters is that they were pretty junior. 73% claimed to have some professional or academic experience in AI, of which 80% claimed to have five years of experience or less.

The next thing to know is that they had very short timelines to “High-Level Machine Intelligence (HLMI)” — that is, “unaided machines that can accomplish every task better and more cheaply than human workers”, seemingly including robotics. Half of them expected HLMI to be developed by 2030, and 90% of them expected it by 2040.

Thirdly, respondents were pretty “doom-y”: three-quarters of them gave over a 10% chance that HLMI would lead to “extremely bad” outcomes like human extinction.

So to oversimplify, we can think of these forecasters as “junior AI professionals who expect HLMI by 2040, and think it might lead to horrible consequences” — fairly common within the AI safety community. And it’s worth thinking about what this means: if they’re right about AI progress and its impacts over the next fifteen-ish years, it could make even the Industrial Revolution seem like a footnote in a history book.

The million-dollar question: were they right about AI progress over the last year?

Benchmarks related to AI R&D: The median forecast was on the money (for the most part)

The first half of the survey asked participants to forecast performance on five separate benchmarks, each of which is related to AI R&D. Here’s how they did:

Note: You can see the reasoning for the resolutions in the original survey link. [Edit: The OSWorld performance was updated from 66.3% to 72.6% after this post was written.]

For the most part, the median forecast looks pretty good — they were basically right on RE-Bench and FrontierMath, and fairly close for OSWorld and SWE-Bench Verified. The one exception seems to be Cybench, where the median forecast underestimated progress by 20 percentage points.

So what went wrong with Cybench? Looking at the few dozen respondents that shared rationales for their forecasts, I’d guess that there are two main reasons at play. The first is not extrapolating existing benchmark trends, and gesturing to how they expect the benchmark to be quite hard (though it’s not very clear to me how this translates into quantitative forecasts). The second reason is overrating how reluctant labs would be to report improvements on Cybench — many respondents thought that high scores on a cybersecurity benchmark would indicate strong cyberattack abilities, putting labs under more scrutiny, so they’d have an incentive to not report on it.

Although the forecasters weren’t as far off, another interesting case is SWE-Bench Verified, where unlike Cybench, forecasters somewhat overestimated progress — the median forecast was 88% instead of the final 81%. So why did forecasters overshoot? Some of them argued that SWE-Bench’s coding tasks are especially amenable to reasoning models. Some argued that progress would be fast because many labs optimize their models specifically to do well on the benchmark — Anthropic often uses it as the first benchmark in new model announcements, and some people even have the job of hill-climbing it. Finally, some pointed out that the original state-of-the-art was 72%, and appealed to how benchmarks typically saturate quickly once scores start climbing the “steep part of the s-curve”. This led to forecasts close to the benchmark’s saturation point at 90%ish.

Broadly speaking, I think all of these arguments are right, so it’s slightly puzzling to me why SWE-Bench Verified scores didn’t end up closer to 90%. I doubt flawed benchmark questions capped scores to 80%, because separate analyses suggest the error rate is likely between 5 and 10%. Furthermore, at Epoch we’ve run many different models on the benchmark, and if we look at the fraction of problems that have ever been solved by any model, we get 87%. So maybe it’s just that models still fall short on some problems, or it might even just be random chance. In any case, I don’t think forecasters as a whole should update much from overshooting.

This doesn’t mean that individual forecasters shouldn’t update — not all the forecasters have the same beliefs after all! For example, people who expected HLMI to be developed before 2030 tended to have more aggressive benchmark forecasts than those who expected HLMI after 2030. Here’s an example of this on FrontierMath:

You might be thinking, “of course people with longer timelines expect slower benchmark improvements, isn’t that common sense?”. I actually don’t think so — you could have longer timelines not because you expect slower benchmark improvements, but because you think benchmarks are less meaningful a proxy of progress towards HLMI. For example, this is a crucial difference between the timelines of Eli Lifland (who co-authored AI 2027) and Mechanize co-founder Ege Erdil, who’s publicly argued for multi-decade AI timelines.

(Source). Note that at the time of writing the tweet, Eli had a median AGI timeline of around 2031, which still counts as having short timelines, even if he doesn’t fall into the “HLMI-by-2030” group of respondents.

Indeed, in the case of this survey, those with 2030-or-earlier timelines performed fairly similarly to those with post-2030 timelines (those with longer timelines did a tad better, but it’s not statistically significant).1

OpenAI preparedness scores: mixed results on risks

The second half of the survey increasingly turns towards broader societal impacts besides AI R&D, which as we just alluded to, may or may not follow from substantial benchmark improvements. The first part of this was to look at potential AI-enabled risks. The results were as follows:

Note: You can see the reasoning for the resolutions in the original survey link.

The survey captured potential risks by looking at OpenAI’s 2023 preparedness framework, which describes how they “track, evaluate, forecast, and protect against catastrophic risks posed by increasingly powerful models.” This is split into four risk areas — besides Chemical, Biological, Radiological, and Nuclear (CBRN) risks, there’s also cybersecurity, persuading humans, and model autonomy (which allows models to do things like avoid being shut down by human operators and self-improve). Each risk area then has four risk levels — low, medium, high, and critical — and the survey asked how likely each of the four risk areas were to increase its risk level.

From the table above, it looks like the median forecaster correctly surmised that risk levels would increase in cybersecurity and model autonomy, from “low” to “medium” risk. This perhaps isn’t so surprising, because these are heavily related to coding and computer-use, which forecasters expected AIs to get much better at (as we saw in the last section).

But they underestimated CBRN risks, which went from “medium” to “high” — so what went wrong? One small part of this may be that, like with Cybench, respondents overrated OpenAI’s incentive to underreport increases in these risks. But this can’t be the full story — the same argument would apply to all other risk areas as well. The bigger reason seems to be that they specifically underestimated how much LLMs could provide uplift on biological capabilities. Consider the conditions needed to go from a “medium” to a “high” on CBRN risks:

“Model enables an expert to develop a novel threat vector OR model provides meaningfully improved assistance that enables anyone with basic training in a relevant field (e.g., introductory undergraduate biology course) to be able to create a CBRN threat.”

Many forecasters argued it’d be hard for models to develop “truly novel threat vectors”, which is right as far as I’m aware. But they also didn’t expect models to be able to provide meaningful “uplift” to non-experts in creating CBRN threats, which turned out wrong — GPT-5.2 seems to have met this bar. Interestingly, this seems in line with a study earlier this year, which finds superforecasters and biology experts drastically underestimating uplift.

Putting these results together suggests a mixed record forecasting OpenAI’s risk levels (the final risk area of “persuasion” was removed from OpenAI’s risk framework in April, so the question was resolved inconclusively).

AI’s prominence: underestimated revenue, and overestimated public perception

This brings us to the final part of the survey, which looks at the real-world impacts of AI, namely annualized revenue and public concern about AI.

Note: You can see the reasoning for the resolutions in the original survey link.

Like with the preparedness framework questions, there was a mixed bag: forecasters substantially underestimated revenue (by two-fold), and overestimated public concern about AI.

Frontier lab revenues

For me, the most striking result of the whole survey was that people heavily underestimated AI lab revenue. The median forecaster expected OpenAI, Anthropic, and xAI to have an annualized revenue of $16 billion by the end of last year, whereas it was almost double that — $30.4 billion, to be exact. The forecast was only a year out, so how did we end up with such a big difference?

One issue could be misinterpreting the question. Importantly, the survey asked for annualized revenue rather than total revenue in 2025, so if forecasters misunderstood the question they might’ve given way too low of an estimate.

But I doubt this really explains the difference. Eli Lifland provides one reason for this: the survey itself gave numbers for annualized revenue. So anyone reasoning “revenue will grow by N times” from those provided figures would arrive at the same forecast regardless of the terminology confusion.

Another reason is that we look at the forecasters’ reasoning on these questions. Out of the 33 provided rationales, nine of them explicitly referenced annualized revenue, whereas only two of them were clearly about full year revenue. The remaining 22 were pretty vague, like claiming to “extrapolate previous trends” without saying what exactly the trend is. So I’m not totally sure, but I think this weakly suggests that most respondents correctly looked at annualized revenue.

One final possibility is confusion about the baseline annualized revenue figure provided in the survey, i.e. $4.7 billion. This number relied on OpenAI revenue numbers from August rather than December 2024, so forecasters may have anchored on an already-outdated starting point. Instead, if they had access to a more updated baseline, their forecasts would’ve been less far off.

But I don’t think this is the full story either. Suppose the forecasters had an updated figure, which we can get with the benefit of hindsight: At the end of 2024, OpenAI’s annualized revenue was around $5.5 billion, and Anthropic’s was around $0.9 billion. This gives us a total of $6.4 billion in annualized revenue. So what would their forecasts have been? From reading the forecasters’ rationales, many respondents seemed to be extrapolating “2-3× per year” growth rates, which on the high end of things would’ve led to a forecast of $19.2 billion. That’s a lot closer to the $30.4 billion resolution, but still would’ve fallen short.

If I’m right, then the reason these forecasters were off was more about how OpenAI and Anthropic revenues grew like crazy. And I really mean crazy — the annualized revenue increased from around $6.4 billion to $30.4 billion, which is around 4.8× in a year!

Public attention on AI

The last question of the survey was about public concern about AI, specifically: “What percentage of Americans will identify computers/technology advancement as the US’s most important problem by December 31st 2025?”.

The median forecaster ended up overestimating somewhat. Depending on what time period you average over, the answer was around 0.625%, compared to around 0.4% or 0.45% by the end of last year. In contrast, the median forecast was around 2%. So there’s some degree of overestimating public concern going on here.

This overestimation seems more egregious for the handful of people who gave substantially higher forecasts, say north of 5%. Some of these respondents thought that public attention would really spike because of job market impacts and increasing concerns about AI safety, which hasn’t really panned out.

Maybe it’s a mistake for me to read too much into this — I suspect people didn’t put too much thought into the exact numbers, and instead just chose fairly round numbers like 1% or 2%, thinking that public attention would increase but remain small. But overall I’d say this looks a bit like what economists frequently accuse me of — being a “technologist” who overestimates the importance of their technology.

Subscribe now

Takeaways from the survey

If I had to summarize the results in a single sentence, I’d say that the forecasters were mostly right about benchmark scores, but had mixed results with the risks in OpenAI’s preparedness framework and on real-world impacts. Most notably, they underestimated revenue and CBRN risks, and overestimated public concern about AI. We can see this in the following image:

For the most updated version of this, see the “Summary” tab of the original forecasting survey.

To me, these results really bring an important question back to center stage: “how much do benchmarks reflect broader societal impacts?”. Survey respondents did pretty well forecasting improvements on AI R&D benchmarks, but we should always be clear about what these benchmarks really tell us. For example, while RE-Bench is designed to track AI R&D, the tasks tend to be very “clean”, with nice feedback loops and few tasks occurring in parallel, which is quite unlike real-world AI research with millions of lines of code. It’s a useful benchmark, but doing well on it doesn’t imply, for example, that AI R&D is automatable.

It’s also a good time for researchers and forecasters to figure out how we can improve our forecasts of real-world impacts. One relatively obvious answer is to gather more data about these impacts — the forecasters in this survey probably did worse on real-world impacts in part because they had less high-quality data to extrapolate. So if we want to improve our future forecasts, we should probably gather data on a range of real-world indicators, especially as AI’s impacts continue expanding beyond the Silicon Valley sphere.

So what does this all mean for AI progress and its societal impacts? Dwarkesh writes, “Models keep getting more impressive at the rate the short timelines people predict, but more useful at the rate the long timelines people predict.” I think the results of this survey largely support this, though maybe the point about revenue is a counterpoint — the annualized revenue numbers seem to suggest they’re getting more useful faster than both the short and long timelines people predict!

Another way to look at it is this: the survey respondents were probably more bullish on rapid near-term AI progress than almost anyone else in the world, and yet they substantially underestimated AI revenue! So perhaps, despite all the hype, AI is an even bigger deal than we thought.

I’d like to thank Eli Lifland, Adam Binksmith, JS Denain, Lynette Bye, Josh You, Tao Lin, and Greg Burnham for their feedback and support on this post.

Appendix

In the main post I focused on the main high-level takeaways from the forecasting survey, for a general audience. In contrast, this appendix is for those of you who are also interested in the nitty-gritty details, like on the distribution of responses and how we should interpret the data.

Benchmarks relating to AI R&D

The survey focused on the following five benchmarks:

RE-Bench: Seven open-ended machine learning problems like “fine-tune GPT-2”.
SWE-Bench Verified: Resolving small real-world GitHub issues in Python.
Cybench: Professional cybersecurity competition tasks like exploiting vulnerabilities in web apps.
OSWorld: Simple computer-use tasks like duplicating powerpoint slides.
FrontierMath: Extremely challenging math problems with automatically verifiable answers.

The first three of these seem pretty clearly related to AI R&D, involving tasks with lots of coding. OSWorld and FrontierMath are more debatable, but they do capture skills that are probably quite important for this, namely computer-use and deep technical reasoning. Performance on FrontierMath and OSWorld are overall also quite correlated with the first three benchmarks, perhaps because they capture the same latent “capability” factor across different models.

RE-Bench

Research Engineering Bench (RE-Bench) is a benchmark created by METR to measure automation of AI R&D. It consists of seven challenging open-ended machine learning problems, things like “optimizing a GPU kernel” and “finetuning GPT-2”. Each task has some starting code which achieves a certain baseline performance, and the goal is to improve performance as far as possible.

Trouble is, the most natural measures of performance look quite different across each of these problems. To make them more comparable in an apples-to-apples way, the METR researchers had to normalize these different metrics, where 0 is the starting performance of the reference code, and 1 corresponds to a human baseline set by human experts. Then they average across the tasks.

At the end of 2024, the highest AI score was 0.61, held by Claude 3.5 Sonnet. The median forecast was 1.1, which is very close to the final resolution of 1.13. That said, this final resolution is based on scores from Gemini 3 Pro — it’s plausible that Claude Opus 4.5 or GPT-5.2 would’ve done better had they been evaluated, making the forecast an underestimate. Note that this evaluation with Gemini only considers 5 of the 7 RE-Bench tasks.

Looking at the available rationales reveals a couple of interesting takeaways. One observation is that some estimates seem to exceed what is theoretically possible. We can estimate the theoretical maximum score based on the “more info” section of the original survey. Each task has its own maximum, namely the following: 1.36, 2.53, 1.25, 1.81, 2.51, 1.8, and 1.48. If we then take the average of this we get 1.82, but some forecasted higher than this. This may be because they disagree with the maximum bound, but from reading the rationales, I suspect it’s because they weren’t considering this bound at all.

We can also look at the factors influencing how aggressive and conservative the forecasts were. Those with more aggressive forecasts (e.g. with a score of 1.2 or higher) tended to think that RE-Bench-esque tasks would be heavily optimized for, are amenable to reasoning model improvements, and have fast feedback loops. Those who forecasted below 0.9 seemed to not provide very substantive rationales (e.g. saying things like “this was a naive guess”). And those with forecasts in between these two ranges tended to anchor a bit more to previous rates of improvement — that said, it’s not always clear what data they were using to determine these rates, because different extrapolations seemed to yield different forecasts.

SWE-Bench Verified

The SWE-bench Verified benchmark consists of 500 software engineering problems drawn from real GitHub issues. These are Python-specific and verified by human annotators to remove errors.

Since this was discussed in depth in the main post, I won’t elaborate much on this here. However I’ll add that some of the rationales for slower forecasts seemed to be reasoning from a lower baseline than o3’s 71% (instead they were a 55% baseline, from the Amazon Q Developer Agent). This is likely because the reference numbers provided in the survey were updated several weeks after the original release. This would tend to lower the observed forecasts, but it seems this wasn’t enough to prevent the overall forecast from overshooting the observed final performance of around 80%.

Cybench

Cybench is a benchmark with cybersecurity tasks. These were taken from professional-level “Capture the Flag” competitions, where competitors attempt to identify specific vulnerabilities in computer programs.

The scores are determined using the “unguided” evaluation setting. This is where the AI model is expected to perform the whole task end-to-end; without guidance on which subtasks to complete, which serve as intermediate checkpoints.

OSWorld

OSWorld is a notable benchmark consisting of 361 computer-use tasks. These tend to focus on simple realistic tasks in Linux-based environments, like adding page numbers to a document.

At the time of forecasting, the state-of-the-art performance was 24.5%, and the median forecast was 60%. This is fairly close to the actual performance of 66.3%, though it looks as though forecasters were quite uncertain about what performance would be achieved.

I think a lot of this boils down to different intuitions on how hard OSWorld tasks are. Those with aggressive forecasts tended to think that the benchmark tasks were like “engineering problems”, where bottlenecks to higher performance would probably be resolved just by spending more researcher time and effort. Those with conservative forecasts tended to think that OSWorld required multimodal skills, which language models tend to struggle with. I think the latter argument is overstated — much of OSWorld can be done without a graphical user interface, though this likely wasn’t widely known when the forecasts were being made.

One final issue is that these performance numbers are harder to interpret than they might initially seem. Greg Burnham succinctly describes one such issue: “Contrary to standard practice, OSWorld is updated continuously. A major update in July changed most tasks, and 10% of task instructions have been updated since then. Furthermore, 10% of tasks rely on live data from the web. This makes it hard to compare results over time.”

FrontierMath

FrontierMath is a benchmark with several hundred extremely challenging math problems. Each one contains automatically verifiable answers, like numbers or matrices or SymPy objects.

Here it seems like the median forecast was very good — 40%, compared to the true answer of 40.7%. More interesting to me is why this number wasn’t substantially higher.

For instance, in May I wrote about how I expected FrontierMath performance to clearly exceed 50% by the end of the year. In fact, I think a forecast like 75% wasn’t unreasonable — the benchmark problems are roughly such that 25% are at “undergrad level”, 50% are at “graduate level”, and the remaining 25% is “research level”. So if we expect the models to be able to do most of the “graduate level” problems, we might expect a score close to 75%. And there are other reasons too, like running simulations of benchmark progress given the limited data available at the time of forecasting. The only problem is that this turned out to be wrong.

In retrospect, I think I may have made two mistakes. The first is that these difficulty tiers don’t very cleanly delineate what problems AI systems are or aren’t able to successfully solve. This is especially because the benchmark problems can be “cheesed”, where the answer is guessed using an informal argument.

My second mistake was to update too much on o3’s release, which suddenly brought state-of-the-art performance from 2% to around 25% within a month of FrontierMath’s release. In fact, (slight) overupdating seems to have been a common issue: the median prior to o3’s release was around 30% below the resolution value, and the median after o3 was around 10% too high.

Source: AI 2025 Forecasts - May Update

One more small thing that makes this tricky is that the ~25% figure provided at the start of forecasts is based on OpenAI’s internal evaluations, whereas the final resolution was based on Epoch AI’s evaluations. The latter is often somewhat lower — for example, in the case of o3 the highest score from Epoch’s numbers is closer to 18.7%.

OpenAI’s preparedness scores, frontier lab revenue, public attention on AI

Since these were discussed quite thoroughly in the main article, I won’t add much more detail beyond sharing the main graphs.

OpenAI preparedness scores

Frontier lab revenue

Public attention on AI

Data on confidence intervals

In filling out the survey, participants were optionally allowed to add 80% credible intervals for their forecasts. We present some of the data on this in the table below:

Users with 2030-or-earlier timelines had a mean score of -0.014 on the AI R&D-related benchmarks, compared to +0.059 for those with post-2030 timelines. The medians for the same groups were +0.023 and +0.088 respectively. So the post-2030 group did slightly better, but this difference isn’t statistically significant based on a t-test (p = 0.39). You can also just see that these differences are small by looking at the leaderboard, where scores go from around -6 to 2.

An FAQ on Reinforcement Learning Environments

JSD — Mon, 12 Jan 2026 20:27:19 GMT

This post is a collaboration between guest author Chris Barber and JS Denain from Epoch AI. This post is part of our Gradient Updates newsletter, which shares more opinionated or informal takes about big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Originally posted on Epoch AI.

Reinforcement learning (RL) environments have become central to how frontier AI labs train their models. In September 2025, The Information reported that Anthropic had discussed spending over $1 billion on RL environments over the following year. As Andrej Karpathy put it in his 2025 year-in-review: by training LLMs on a wide range of verifiable tasks across different environments, “the LLMs spontaneously develop strategies that look like ‘reasoning’ to humans.”

This wave of RL for capabilities started with OpenAI’s o1, which was trained on math and coding problems with verifiable answers. Since then, labs have expanded the range of tasks they train on, all the while scaling the amount of compute spent on RL training.

Without diverse, high-quality environments and tasks to train on, throwing more compute at RL risks wasting much of it. As a result, creating those tasks and environments has become a key bottleneck for scaling capabilities, and a growing market that remains largely behind closed doors.

To understand the emerging industry of building environments and tasks that labs use to RL-train their models, we interviewed1 18 people across RL environment startups, neolabs, and frontier labs. We asked them what RL environments and tasks look like, how labs use them, what makes a good one, and where the field is headed.

Main takeaways:

Enterprise workflows are a major growth area. Math and coding tasks came first, but we’re now seeing significant growth in enterprise workflows: tasks like navigating Salesforce, filing reports, or manipulating spreadsheets.
Reward hacking is a top concern. Interviewees consistently cited robustness against reward hacking as a key quality criterion. Models find ways to game graders, and preventing this requires extensive iteration on both environments and tasks.
Scaling without sacrificing quality is hard. A major challenge is scaling the quantity of environments and tasks without sacrificing quality. The hard parts are management (coordinating a growing number of task builders) and maintaining good quality assessment processes.

What are RL environments and tasks?

In modern reinforcement learning for language models, the model is given a task to accomplish and a set of actions it can take. The model attempts the task, and a grader (typically automated, such as a unit test or an LLM judging against a rubric) assigns a score to its attempts. These scored attempts are then used to update the model’s weights, reinforcing successful strategies.

The RL environment is defined by the set of actions the model can take (running code, thinking out loud, clicking buttons, searching documents) and the surrounding context that determines the effect of these actions (environment variables, file systems, the state of a simulated application). In practice, the environment is often delivered as a Docker container, but not always2.

Each task consists of a prompt instructing the model to achieve an objective, and a grader that determines whether (or to what extent) the objective was met. Terminology in this space isn’t fully standardized, and the boundary between “environment” and “task” is somewhat fuzzy3. In this piece we discuss both environments and tasks, since they’re often built and sold together.

Here are some examples of environments and the kinds of tasks they could support:

A git repository, with tasks like fixing a bug so that unit tests pass, similar to benchmarks like SWE-bench Verified. The task specifies a git repository at a specific commit with a failing test suite; the environment provides the operating system and tools to interact with the repo; the grader runs the tests and checks that they pass.
An Airbnb clone, with tasks like finding the cheapest two-bedroom listing in a given city for specific dates. The environment is a simulated website with realistic listings, prices, and filters; the agent sees a structured representation of the page (like a DOM or accessibility tree) and outputs actions like clicking elements or typing into fields. The grader verifies the final answer.
A Bloomberg terminal clone, with tasks like finding the 5-year compound annual growth rate for a list of companies. The environment simulates the terminal’s interface and data; the grader checks whether the returned figures match the correct values.
An Excel clone, with tasks like creating a pivot table showing revenue by region from a raw dataset. The environment provides a spreadsheet application with realistic functionality; the grader compares the output against a reference solution.

For computer use environments like the Excel clone, a single environment might support hundreds of different tasks. For coding environments, it’s more common (but not always the case) for each environment to contain just one task, since setting up a repo state is relatively cheap4.

How are RL environments used by labs?

Each environment and task can be used in three main ways: for reinforcement learning, for benchmarking, or for supervised fine-tuning on trajectories that solve the task.

Reinforcement learning remains the primary use case. As one RL environment startup employee put it: “RL is the main use. We have some requests for creating envs for benchmarking. I’d say perhaps 10-20x more the former vs the latter.” One difference is that benchmarks are typically built for single-turn evaluation, whereas there’s growing interest in RL environments that capture multi-turn interactions between agent and user.

Environments are also used to generate data for supervised learning, by using successful RL trajectories as training examples during midtraining. One interviewee noted: “While it might not drive purchasing today, a well-designed environment can be used as an effective mechanism for synthetic data generation. I feel this will be increasingly important as env development matures and designers target this use-case.”

An interviewee noted that supervised fine-tuning (SFT) had been growing especially for interleaved thinking and tool calling. With SFT, you can choose a single good trajectory and train on that, whereas RL requires multiple trajectories with enough differentiation between them to provide a learning signal. This makes SFT more practical when it’s relatively easy to produce good trajectories but hard to get a reliable grader or enough variation between attempts.

Usage patterns also vary by lab. One RL environment startup founder noted that trailing labs tend to be more interested in SFT than RL: “It’s quite hard to get your really large scale RL pipelines and it’s easy to throw stuff into your SFT mix.”

RL environment companies don’t always have full visibility into how their products are ultimately used, and usage likely changes over time as labs’ training pipelines evolve. As one founder put it when asked how customers use the environments they build: “To what extent are they using it to generate gold trajectories5 versus doing RL? I think that would be pretty hard to answer.”

Which companies build RL environments?

A growing number of specialized startups focus specifically on building RL environments. They cover a range of domains, from software engineering tasks to computer use to math and finance. Chris compiled a list of startups in this space here.

Traditional data providers like Mercor, Surge, Handshake, and Turing, who used to primarily provide human-labeled data, now also sell RL environments. Part of what you pay for is their QA processes and supervision infrastructure, but as one founder put it, the main value add is “they have the guys”: if you need to scale up task creation quickly, they can staff a project faster than you could hire in-house.

In-house teams at model developers are also building environments (cf. job postings by xAI or Anthropic). This includes both frontier labs and neolabs like Cursor, who can leverage user data to build training tasks. One founder noted there’s been “substantially more in-housing” recently, as labs build out their own data teams. Reasons to do things in-house include avoiding margins paid to vendors, keeping training priorities confidential, and having direct expertise in some domains like AI research.

Finally, product companies are natural partners for building environments around their own software. If you’re Salesforce or Slack, you understand your product’s interface and edge cases better than anyone. We’re seeing a wave of partnerships between labs and product companies: Benchling and Anthropic for biology workflows, OpenAI with Shopify and Stripe for shopping, OpenAI’s recent health integrations. While we don’t know the exact details of how these partnerships work, it seems likely that the product companies are often at least collaborating on environment or task creation. In some cases, though, companies are hostile to agent traffic: Amazon has sued Perplexity over its agentic shopping tool and blocked most AI agents from accessing its site.

How much do environments and tasks cost?

Costs are highly variable, depending on the domain, complexity, and number of tasks.

Contract sizes are often six to seven figures per quarter. For example, one RL environment founder noted that contracts are often seven figures per quarter or more. A neolab researcher mentioned seeing contracts in the $300k-$500k range, adding that it varies a lot depending on the number of tasks.

Environment costs depend on fidelity. SemiAnalysis reported that website replicas (”UI gyms”) cost around $20k each. But higher-quality replicas of complex products like Slack can cost significantly more: one interviewee mentioned figures around $300k for these.

Task costs also vary, but multiple interviewees agreed on a ballpark of $200-$2,000 per task. As one RL environment founder put it: “I’ve seen $200 to $2000 mostly. $20k per task would be rare but possible.” A neolab researcher confirmed this range aligns with their experience. The $20k figure comes up for especially complex software engineering tasks, but it’s rare. To put task costs in perspective: Mechanize estimates that around $2,400 in compute is spent per task during RL training. This suggests that cheaper, lower-quality tasks might end up wasting that compute.

Exclusivity affects pricing significantly. Environments and tasks can be sold exclusively to one customer or non-exclusively to multiple labs. Two RL environment founders independently agreed that exclusive deals are roughly 4-5x more expensive than non-exclusive ones.

Overall spending in this space is growing rapidly. As mentioned above, The Information reported that Anthropic had discussed spending over $1 billion on RL environments over the following year. However, this is still a fraction of compute costs. OpenAI’s R&D compute spend in 2026 is projected to be around $19 billion, roughly double the 2025 figure. Even accounting for the fact that Anthropic is smaller than OpenAI, compute spending will still dwarf RL environment spending.

What domains do RL environments cover?

Originally, the main domains were mathematics and coding6. As one neolab researcher put it: “Coding and in some sense math kickstarted all the RL environments explorations. So code and math are most plentiful.”

Mathematics tasks are relatively easy to produce, since you don’t need to build a complex environment, just tasks with verifiable answers. However, emphasis on math may be declining more recently. One interviewee noted that “math might be shrinking,” and an RL environment founder observed that math tasks are easy to create but don’t transfer as well to other capabilities.

Coding remains a major source of demand. It’s a huge market, and there’s already been demonstrated uplift from training on code tasks. We’re also seeing coding environments go beyond SWE-bench-style tasks. One RL environment founder noted: “I’ve seen code environments go from more simple PASS_TO_PASS and FAIL_TO_PASS7 type SWE-bench Verified tasks more towards being productionized. So how does a SWE actually work in an environment? They have GitHub, they have Linear, they have a code IDE.” A key challenge is that many tasks don’t have clean verification criteria beyond “passes unit tests.” As one neolab researcher noted: “There’s probably many solutions that pass the unit tests. But there’s some nicer solutions that have better trade-offs in terms of your engineering design decisions.” This poses a challenge on the grading front.

One of the major growth areas is enterprise workflows: tasks like filing an expense report, creating a pivot table in a spreadsheet, generating slides from a brief, or navigating a CRM to update customer records. One RL environment founder told us: “I think enterprise workflows are going to explode this year. I think labs index very heavily on what’s valuable and what is quantifiable, and enterprise workflows are perfect for that. By enterprise workflows I mean things that are sometimes computer use (e.g. a specific ERP8 that has no backend) or sometimes things that involve APIs that can be exposed to an agent and used without computer use.”

Environments for enterprise workflows can take various forms: MCP-style tool integrations, Playwright-style browser interactions, or screenshot-based computer use. Many rely on clones of websites and apps like Slack or SAP. One lab researcher cautioned: “There’s a lot of good reasons to use clones of websites, but what everyone does is vibe code a buggy website which isn’t useful. There’s a large amount of useless bad environments out there for that reason.”

Prime Intellect’s environment bounties give a sense of some of the domains covered by RL environments and how they’re built. They include environments based on prominent benchmarks, environments for interacting with MCP servers, AI research tasks like nanoGPT Speedrunning, and tasks based on questions sourced from textbooks.

Overall, the main directions in the near future are coding and enterprise workflows. Across both, there’s growing interest in longer-horizon tasks. As one RL environment founder put it: “Long horizon is I think the future direction. Having agents do full end-to-end tasks that involve navigating through multiple tabs, browsers, and then submitting something that involves multi-hop steps.”

A few other directions came up in our interviews: training for multi-turn user interactions rather than just single-turn performance, environments that optimize for multiple goals at once, and environments with tooling allowing researchers to inspect and change trajectories to provide feedback.

What are the top priorities and challenges?

Interviewees consistently cited robustness against reward hacking as the most important quality criterion. As one neolab researcher put it: “Reward hacking is a big issue. The model might cheat by searching up a solution, or if you’re not careful with how you script the repo, by checking out future commits9. It needs to be robust. That’s the minimum.” Another framed it similarly: “Soundness matters most: high reward must mean the task was actually solved, not hacked.” Creating robust graders rarely works on the first pass: as one RL environment founder noted, “It takes many many iterations to check against reward hacking.” Notably, this remains a top priority even though models have become better at not reward hacking over the past year.

Difficulty calibration also matters. Tasks need to be challenging but not impossible, since if the pass rate is 0% or 100%, the model won’t learn10. Moreover, as one RL environment founder noted: “You don’t want it to be zero percent because then there’s a possibility that the task is infeasible, unless you’ve had another annotator do a blind solve and they succeed.” Multiple interviewees mentioned wanting a minimum pass rate of around 2-3%, or at least one success out of 64 or 128 rollouts.

Beyond individual task difficulty, the overall distribution matters. As one neolab researcher put it: “A very important feature of the RL environment is a smooth gradient: diversity in the difficulty of the tasks.” One person noted you might want a mix: some tasks at 0%, some at 5%, and some at 30%. After some training steps, the 0% tasks become learnable. Once a task reaches around 70% pass rate, you might discard it and move on to harder tasks.

Separately, tasks should also be compositional. As one neolab researcher noted: “The skills should not be disparate tasks, they should be similar to each other and leverage common skills.”

Scaling while maintaining quality is the core operational bottleneck. As Kevin Lu has argued, scaling RL environments is one of the key challenges for continued AI progress. But scaling task creation while maintaining quality is very hard. One RL environment founder noted: “Maintaining quality while scaling is the number one bottleneck that people see. Finding the experts isn’t that hard, but managing them and doing quality control is hard.” A neolab researcher emphasized the management challenge: “It’s not easy to find people to oversee this data construction, the RL environment construction process. The contractors, you need to motivate them. Sure, you’re paying them money. But how do you make sure they’re not just using LLMs? How do you make sure they’re actually verified? Motivating the contractors and doing the quality control is the grunt work.” One RL environment founder noted that their constraint on making more revenue is simply the difficulty of scaling up task creation at the required quality level.

The skills required are a mix of domain expertise, prompting ability, and product sense. Building environments mostly involves engineering skills, but creating good tasks requires something different. As one RL environment founder put it: “Domain knowledge and expert level prompting is more important than ML skills for creating tasks.” A neolab researcher added that product sense matters too: “You want to know how people are actually using these tools.” Having a lot of experience interacting with frontier models is key, as one neolab researcher noted: “You don’t necessarily need to be an AI researcher, but perhaps a very heavy Claude Code user, a prompt whisperer like Riley Goodside, can be better at figuring out what the frontier is than an AI researcher.” Another put it simply: “The people who are probably best at this are the people who create benchmarks that actually get used.”

RL environments have quickly become a major input to frontier AI training. Key challenges include preventing reward hacking and scaling production without sacrificing quality, while demand is growing for enterprise workflows and longer-horizon tasks. This is a fast-moving space, and we expect this picture to look quite different in a year.

Thank you to all the interviewees, as well as to Jaime Sevilla, Josh You, and Lynette Bye for comments and editing help on this piece.

We had calls with 9 different people, got text/email input from an additional 9 people, and sanity checks but no input from 4 more people.

For example, the environments on Prime Intellect’s environment hub don’t rely on Docker, instead using a more lightweight approach based on their verifiers library. In some domains, for example most mathematics problems, there isn’t even a need for an environment since the model does not have to use tools.

In traditional RL terminology, the “environment” encompasses the transition dynamics (how actions affect state), the reward function, and the distribution over starting states. A “task” in that framing is simply a specific starting state. In practice, the way people use these terms varies.

Here again, the terminology isn’t fully standardized, and varies across domains: some people use “environment” to refer to a single Docker container with one task; others use it to mean a collection of containers with many tasks.

Gold trajectories are human-verified examples of successfully completing a task, which can be used as training data for supervised fine-tuning.

In a sense, RLHF and Constitutional AI came before math and coding environments (and, broadening still, physics engines like MuJoCo came even earlier). They both involve training language models with RL, although typically there's been less emphasis on tool use and environments. In practice, people tend to think of them as separate categories, and the actors involved are often different.

In SWE-bench Verified, PASS_TO_PASS refers to tests that should continue passing after the model’s edits, while FAIL_TO_PASS refers to tests that were failing before and should pass after the fix.

ERP stands for Enterprise Resource Planning, software used by businesses to manage day-to-day activities like accounting, procurement, and project management. Examples include SAP and Oracle.

See here for an example of the model looking at future commits on SWE-bench Verified.

RL algorithms like GRPO improve by comparing different attempts at the same task. If one attempt succeeds and another fails, the model learns to favor the successful strategy. But if every attempt gets the same score (all passing or all failing), there’s no signal about which approaches work better.

How far can decentralized training over the internet scale?

Jaime Sevilla — Mon, 29 Dec 2025 17:06:57 GMT

This post is part of our Gradient Updates newsletter, which shares more opinionated or informal takes about big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Previously, I discussed decentralized training in the context of hyperscalers. Microsoft, Google and other giants are building interconnected gigawatt scale datacenters, which could be used to train models at an unprecedented computational scale. The decentralization could sidestep the difficulty of securing 10 GW of power in a single location by splitting one massive run into ten more manageable gigawatt-scale blocks.

But when people think of decentralized training, they don’t first think of gigantic datacenters, owned by the same company, training models across large distances. Instead, they imagine thousands of small datacenters, or individual consumers, pooling their spare compute over the internet to orchestrate a training run larger than any single actor could manage alone.

Many companies are pursuing this vision: Pluralis Research, Prime Intellect and Nous Research have already successfully decentrally trained models at scale. But in practice, training decentrally over the internet has lagged far behind more centralized training. Even their largest models (Pluralis’ 8B Protocol Model, Prime Intellect’s INTELLECT-1, and Nous’ Consilience 40B) have been trained with 1,000x less compute than today’s frontier models (such as xAI’s Grok 4).

And, importantly, many proposals for monitoring and regulating AI depend on internet decentralized training continuing to lag far behind the frontier. The compute is easy to track when it’s centralized, either in a few massive data centers or many smaller (but still large) data centers connected via a dedicated network of fiber optic cables. As long as it only makes economic sense for frontier models to be trained thusly by a handful of the biggest companies, governments can relatively easily regulate them.

However, if that assumption – that training over the internet across thousands of computers isn’t feasible – breaks, then regulation might be left scrambling to catch up.

In this article, I review and critically examine the state of the field of decentralized training over the internet. I conclude that while decentralized training is growing fast (around 20x per year) and technically feasible at frontier scale, it’s unlikely that decentralized developers will amass frontier amounts of compute this decade.

Is decentralized training over the internet feasible?

Decentralized training over the internet is a strictly harder engineering task than centralized development.

The three big additional challenges that decentralized training poses are: low-bandwidth communication, managing a network of heterogeneous devices with inconsistent availability, and trust and coordination issues. While important, I believe the latter two won’t ultimately impede training at scale.1

Low bandwidth poses a more important problem. Typical internet upload bandwidths for consumers are around 60 Mbps. If using naive data parallelism to distribute training across nodes with this much bandwidth, it would take 5,000 years to train a DeepSeek v3 style model with 671B parameters.2

But Pluralis, Prime Intellect and Nous Research have already trained (and postrained) models with billions of parameters. The key is in a family of techniques to reduce bandwidth requirements, and enable different kinds of decentralized training: data parallelism, model parallelism, and RL training.

Decentralized data parallelism

When using data parallelism to distribute training across multiple data centers, we have the centers communicate once every batch to synchronize gradient updates.

When training over the internet, that communication cost would be prohibitive. For instance, with typical upload speeds reaching around 60 Mbps, the largest model we could train while keeping synchronization times under ten minutes would be a rather small 600M parameter model with 32-bit gradients.3

One solution is to reduce the frequency with which each data center or individual computer (each node) talks to each other. So instead of synchronizing weights with every batch, we let each node process multiple batches independently (these are called “inner steps”), then aggregate the cumulative gradient updates each node has taken in the synchronization step (an “outer step”).

Figure from Douillard et al. (2023). Each node holds a replica of the model, and trains independently for a number of inner steps before synchronizing across nodes.

This process was demonstrated in the DiLoCo paper by Douillard et al. (2023), building on previous work, such as that by Sticht et al. (2018). This process is not equivalent to the usual training process, and as such it can harm performance. However, the DiLoCo authors find that they can conduct training with up to 500 inner steps with only a small reduction in performance, for a 500x reduction in bandwidth requirements.

One key weakness of DiLoCo worth stressing is that the performance is reduced as the number of nodes increases. In particular, increasing from 1 to 8 nodes is equivalent to a 1.5x decrease in training compute (see table 4 and 11 of Charles et al, 2025). Naively, this suggests that scaling to 10,000 nodes would require 6x as much FLOP as centralized training for the same performance. This is not trivial, though it can be compensated for by training for longer, or with larger clusters.

To further reduce the strain on the bandwidth, we can reduce the size of the gradients. The main way this is done in practice is quantization, where the gradients are communicated on a lower bit precision. By reducing the gradients to 8-bits, as in the INTELLECT-1 run, or to 4 bits, as in the Streaming DiLoCo paper, we can reduce bandwidth requirements by 2-4x compared to typical 32-bit gradient synchronization. Some even have experimented with 1-bit gradient updates (Tang et al, 2021; Lu et al, 2022), though their efficacy is yet to be shown at scale.

Another type of compression is sparsification, in which only the most significant (top-k sparsification) or a random subset of the gradients (random sparsification) are communicated in each round. The decoupled momentum (DeMo) optimizer by Nous Research is a variant of this, in which larger updates are prioritized, while smaller gradients are accumulated until they reach a threshold of significance.

Both quantization and sparsification degrade the quality of the gradients communicated. To fight this, one technique often used is error feedback accumulation: the difference between the gradients computed and communicated is stored, to be communicated in future updates.

These methods can be combined. CocktailSGD experiments with finetuning models with up to 20B parameters combining quantization and sparsification. MuLoCo further combines these techniques with DiLoCo and a Muon optimizer. SparseLoCo takes a similar approach – and it is now being applied at scale as part of the Covenant72B decentralized training run by Covenant AI.

We can optimize DiLoCO further. In the Streaming DiLoCo paper, Douillard et al (2025) explain how to sequence synchronization on different subsets of parameters while training. This enables overlapping computation with communication, which together with 4-bit gradient quantization, reduces bandwidth strain by 100x compared to naive data parallelism.

Decentralized model parallelism

While most decentralized training today leverages data parallelism, others — notably Pluralis Research — have experimented with model parallelism.

In model parallelism, the parameters of the models are themselves split between nodes. This has one major advantage: since each node doesn’t need to hold the whole model, nodes with small memory such as consumer GPUs can be part of the network without constraining the model size.

For instance, the SWARM parallelism paper by Ryabinin et al (2023) proposes decentralized pipeline parallelism, where nodes process subsets of model layers, propagating activations and gradients between them. The paper notes the square-cube law: as we increase the model dimension, communication scales linearly, but compute scales quadratically. This means that larger models suffer less from the communication overhead, making pipeline parallelism a practical choice in low-bandwidth setups.

Pluralis’ Protocol Models further optimize pipeline parallelism by constraining each layer’s activations to a small subspace, for up to a 100x reduction in size. This allows them to train a 8B Llama-like model in a decentralized fashion, as well as a 7.5B model through a volunteer network.

Other forms of model parallelism have been experimented with. Prime Intellect used pipeline parallelism for SYNTHETIC-2 – a decentralized synthetic data generation project. The learning@home paradigm by Ryabinin and Gusev (2020) and the SPES protocol by Zhang et al (2025) splits different experts from a mixture-of-experts (MoE) across different nodes for training and inference. Pluralis has also experimented with decentralized context parallelism in Ramasinghe et al (2025), in which different nodes process different parts of a sequence, and need to coordinate to compute the attention context.

Decentralized RL training

Since last year we have seen a rise in the prominence of RL postraining methods, often called reasoning training.

In a typical RL training scheme (such as GRPO) we generate many reasoning traces with the latest model checkpoint. Each trace is automatically scored relative to the other traces in the group. These scores are used to update the model weights, increasing the relative log-likelihood of the highest scoring traces relative to the worse ones.

In asynchronous RL, we allow the trace generation to be done with an outdated version of the model, often 2 to 5 steps behind the current checkpoint. This has the advantage that we can overlap communicating the updates with generating the traces, mitigating communication bottlenecks.

Figure from the INTELLECT-2 release post, illustrating asynchronous RL training.

The largest decentralized RL training run to date is INTELLECT-2 by Prime Intellect, which coordinated the output of over 800+ nodes to generate traces and post train a QwQ-32B base model using asynchronous RL.

In synchronous RL, the burden of compute is tilted more towards inference than in traditional pretraining. The ratio of compute in FLOP is the same — a forward pass to compute the activations is matched by twice as much compute in the backward pass to compute the gradients — but trace generation is memory-bound, making it take more GPU-hours than the backward pass. In asynchronous RL, because the forward pass must be repeated with up-to-date model weights, the amount of inference compute is even higher.

INTELLECT-2 employs Online Advantage Filtering to shift the burden even more towards inference: groups of reasoning traces with no advantage are discarded, since they generate no training signal. This allowed for a decentralized inference network that used 4x more compute than the training nodes.

More recently, Prime Intellect unveiled their INTELLECT-3 run. This was a larger scale finetuning and asynchronous RL post-training run, albeit run in a centralized setting. The success of this run paves the way for larger decentralized RL training runs in the future.

Putting it all together: decentralized internet training at frontier scale is likely feasible

My overall impression is that all the techniques we have discussed could be combined to train much larger decentralized models than exist today.

Taking for example the INTELLECT-1 model, one of the largest decentralized pretraining runs to date. It is a 10B parameter model trained on 1T tokens, which amounts to 6e22 FLOP. And it essentially only leveraged two basic decentralized training techniques to manage bandwidth: DiLoCo with 100 inner steps per synchronization, and 8 bit quantization.

If we further quantize to 4-bits and apply 25% sparsification as in SparseLoCo, we would theoretically be able to train a model 8x larger, which per Chinchilla scaling would allow for training runs with about 64x more compute. A similar strategy is how Templar is currently decentrally training a 72B parameter model.

Going further, during the INTELLECT-1 training run, the contributor nodes ran independently for about 38 minutes before synchronizing for 7 minutes. Applying streaming DiLoCo could hide a large part of that communication. With an optimized interleaving schedule, this could allow for training a 2x larger model, i.e., 4x more compute.4

As we alluded to, many of these techniques moderately harm performance. While not trivial, the performance reduction looks modest in practice. For example, based on the evaluations results reported by developers, INTELLECT-1 would achieve an Epoch Capability Score (ECI) around 98. This is e.g. similar to Qwen2.5-Coder 1.5B, which was also trained with a similar amount of compute around the same time INTELLECT-1 was released.

My conclusion is that there exists plenty of room to experiment with bandwidth reduction techniques, such that much larger training runs are technically feasible over the internet. These techniques compromise somewhat on quality compared to centralized training, and many are not yet tested at scale nor in combination with each other, so not all will work. But there are enough avenues that I am optimistic bandwidth won’t limit the scale of decentralized training anytime soon.

Can decentralized developers amass the necessary compute?

While technically feasible, reaching the frontier of compute requires an astounding amount of resources.

The largest decentralized pretraining runs to date are INTELLECT-1, Protocol Model 8B, and Consilience 40B. These span the 6e22-6e23 FLOP range — 1000x less compute than what we estimated was used to train the largest models today, such as Grok 4.

In order to train them, decentralized networks have been set up, such as Prime Intellect’s platform, the Psyche Network by Nous Research, and the Pluralis dashboard. The largest such active network we’ve found is Covenant AI’s Templar, which is currently achieving an effective throughput of 9e17 FLOP/s respectively. This is about 300x smaller than frontier AI datacenters today, which have a theoretical training throughput of about 3e20 effective FLOP/s (eg the Microsoft Fairwater Atlanta site, assuming 30% MFU and 8-bit training).

While they are relatively small for the moment, it is worth appreciating that the scale of decentralized training runs has grown astoundingly fast. Since 2020, we have seen a 600,000x increase in the computational scale of decentralized training projects, for an implied growth rate of about 20x/year.

This is a rate of growth that even dwarfs the rate of growth of frontier AI training, recently growing as 5x/year. If both trends held, it would take five and a half years for decentralized training to catch up to the scale of centralized training.

But can this growth be maintained? We look into three reference classes that inform the largest scale decentralized training could reach: the largest volunteer computing project to date, the scale of the largest cryptocurrency networks, and an estimate of the amount of spare compute capacity today.

Starting with the last one: today, there are about ~15.7M H100-equivalents of AI compute across NVIDIA, TPU, and Trainium devices. In comparison, the largest datacenters today, such as Anthropic’s New Carlisle site, host about 300k H100-equivalents, ie less than 2% of the AI compute stock total. A significant fraction of total compute could be idle for long periods, making it available for decentralized internet-based training.

So in theory, there exists a massive amount of AI compute which could be leveraged for decentralized AI training at a larger scale than has been achieved to date.

However, leveraging that amount of compute is likely unfeasible. Decentralized training project will only be able to access a fraction of existing computational resources. Determining exactly how much is complicated.

One possibility is by comparison to the largest decentralized computing projects. The largest collaborative computational project to date is folding@home, a volunteer computing network for simulating protein dynamics relevant to drug discovery. At its peak, folding@home attained a peak throughput of 2.43e18 FLOP/s – more than the Top 500 supercomputers combined at the time. At 2.43e18 FLOP/s, a decentralized network would be able to conduct a 2.43e18 FLOP/s x 100 days ≈ 2e25 FLOP training run. Not enough to reach the frontier today, but similar in scale to the previous generation of frontier AI models, including eg Llama 3, GPT-4, Gemini 1.0 Ultra and Claude 3 Opus.

Alternatively, we could compare to Bitcoin, whose decentralized hashing network encompasses $30 billion of existing infrastructure. For AI, that would be about enough for a gigawatt scale site, which could be used to train models at the 3e27 FLOP, still enough for a frontier scale run today.

The folding@home and Bitcoin reference classes suggest that today’s largest decentralized training networks (such as Prime Intellect’s) could be expanded 30-3,000x in scale, enough to train models on 50-5,000x more compute than today, in combination with longer training durations. This suggests the current fast rate of growth of decentralized training runs could last 3 to 6 years.

Conclusion

Decentralized training over the internet has captured the imagination of many developers, due to its potential to leverage large amounts of idle compute worldwide.

Technical feasibility is an important concern. But while bandwidth imposes an important technical challenge, enough techniques are being experimented with that I am optimistic about the prospects of much larger models being trained.

Frontier companies are unlikely to pursue internet decentralized training, since they can afford to build more efficient data centers. So internet decentralized training is likely to remain the domain of smaller companies focused on decentralized networks.5

Hence, my uncertainty is about the compute such decentralized networks will be able to muster. They have grown remarkably quickly so far, though they are still far from the frontier. Looking at past volunteer computing projects such as folding@home, and the scale of decentralized computing networks such as Bitcoin, I see room for decentralized training networks to grow 30 to 3,000x in scale in coming years. If unimpeded, at the current rate of growth, we won’t see decentralized training runs catch up to the frontier of training in scale this decade. Even if they did, the performance loss from the bandwidth reduction techniques would set them back compared to centralized training.

But decentralized training could still be a very important part of AI. To the extent that decentralized networks remain associated with open weights, they could lead to larger open models to exist trailing the frontier. And while it’s unlikely that the small decentralized training projects will amass large frontier amounts of compute, thanks to compute efficiency advances and the increased efficiency of hardware, I expect decentralized training runs to not trail that far behind the frontier.

One practical implication is that decentralized training projects put a limit on the scale of models that can be affected by regulation, at least insofar as enforcing such regulations relies on training happening in large data centers.

Thank you to Arthur Douillard, Max Ryabinin, Eugene Belilovsky, Sami Jaghouar, Fares Obeid, Josh You, and Aaron Scher for comments and suggestions. Jaeho Lee and Venkat Somala collected data for the charts. Lynette Bye edited the piece.

As part of the research that went into this piece we investigated over 100 papers related to decentralized training. You can find an annotated database of these papers here.

Managing a heterogeneous, inconsistent network requires important engineering resources, but through clever engineering it is ultimately feasible (see for example the work on SWARM or Tasklets).

DeepSeek v3 was trained on 14.8T tokens using a batch size of 63M, for a total of 14.8T / 63M ≈ 235k updates. At 32-bit precision, each update takes 2 x 671B parameters x 32 bits per parameter / 60 Mbps ≈ 8 days, and the whole training would take 235k updates x 8 days / update ≈ 5,000 years.

60 Mpbs * 10 minutes / 2 uplink and downlink / 32 bits per parameter ≈ 600M parameters.

With a proper streaming schedule, we could train the same model with 38 minutes / 7 minutes ≈ 5x less bandwidth while still completely hiding the communication time behind computation. Naively, this suggests we could train a model with 5x more parameters. However, a larger model would be trained with a larger cluster, shortening the computation time per outer step. The communication time grows linearly with model size. The computation time grows with the batch size and the model size, and decreases with the computational power. The batch size grows with the 0.3 power of the dataset size, which itself grows proportionally to the model size under Chinchilla scaling. And the computational power grows proportionally to the square of model size under Chinchilla scaling. So if we increase the model size by a factor n, the computation time decreases by a factor

and the synchronization time grows by a factor n. The synchronization time can be almost fully masked while its below the computation time, so we can grow the model at the same bandwidth until it is

Solving for n, this allows for training a model

IREN (Iris Energy) — “November 2024 Monthly Investor Update” filed on the SEC’s EDGAR system says: “1 EH/s = $30m cost to deliver” (including mining hardware + infrastructure capex) and lists the assumptions (fleet efficiency 15 J/TH, $18.9/TH ASIC pricing, and $750k/MW infrastructure capex). YCharts shows 942.95 EH/s. Combining both numbers, Bitcoin infra is valued at about $30bn.

Why benchmarking is hard

Florian Brand — Tue, 23 Dec 2025 22:04:20 GMT

Benchmarks play a crucial role in the AI landscape: They inform everyone, from AI researchers to the general public, about the current state of capabilities and the overall rate of progress. Third-party organizations, such as Epoch AI, independently run and collate benchmark results on a page like the benchmarking hub.

However, benchmarking isn’t easy: at each stage of the benchmarking pipeline, there are many moving parts and degrees of freedom that can affect the final result: this makes it hard to compare any two evaluation scores. Moreover, each stage can introduce bugs or mistakes that make the results costly to obtain or invalid.

In this post, we dive into the different steps of the benchmarking process, which we split into two main parts:

Benchmark Setup: all the steps related to how the benchmark is run; for example, the prompt that describes the task to the model and instructs it to answer each question, or the methodology used to score each sample.
Model Access: all the settings involved in accessing the evaluated model itself; for example, which API provider to call the model from.

Main takeaways

Differences in how the benchmark is set up and how the model is accessed make it hard to compare any two scores on the same benchmark. For the benchmark setup, scaffolds have a huge impact on agentic benchmarks, especially on weaker models. To access the models, API providers are commonly used. Bugs and instabilities in the chosen API provider are the biggest source of evaluation errors, which particularly affects newer models.

The Benchmark Setup

Even though everyone uses the same name for a benchmark, this does not mean that everyone runs the same version (if versioning exists), nor that they implement the benchmark in the same way. To illustrate this, we use the well-established GPQA-Diamond benchmark as an example throughout this blog post.

Prompts & Sampling Parameters

GPQA-Diamond’s initial release was accompanied by an analysis repository containing code to evaluate models on the benchmark. However, this repo is not used to run the benchmark these days. Instead, practitioners typically re-implement GPQA-Diamond’s evaluation as part of a larger standardized infrastructure to run multiple models on multiple benchmarks.

Compared to many other benchmarks, GPQA-Diamond is relatively simple to (re-)implement, with few moving parts. Indeed, it just consists of the following:

A template which instructs the model to answer the question
The question and four possible answers to said question

As the question and answers are fixed (and re-used through all implementations), the component that can be different is the prompt template. Evaluators must also decide on the sampling parameters that models will be run with, in particular the temperature and potential top_p or top_k parameters. Let’s look at the prompt templates and sampling parameters used by some popular benchmarking libraries:

EleutherAI/lm-evaluation-harness: What is the correct answer to this question:{{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nLet’s think step by step:
- No system message
- Default temperature is 0.0 for API-based models
- Note: There are different templates and setups depending on the evaluation and model type.
OpenAI/simple-evals: Answer the following multiple choice question. The last line of your response should be of the following format: ‘Answer: $LETTER’ (without quotes) where LETTER is one of ABCD. Think step by step before answering.\n\n{Question}\n\nA) {A}\nB) {B}\nC) {C}\nD) {D}
- System message set depending on the model: You are a helpful assistant. for GPT-3.5 until GPT-4.1; this message for ChatGPT-based models and this lengthy system prompt for Claude 3 Opus and Claude 3.7 Sonnet. No system message for reasoning models.
- Default temperature is 0.5
OpenAI/gpt-oss: {Question}\n\n(A) {A}\n(B) {B}\n(C) {C}\n(D) {D}\n\nExpress your final answer as the corresponding option ‘A’, ‘B’, ‘C’, or ‘D’.
- No system message set
- Default temperature is 1.0 (when invoking the script via command line)
groq/openbench: Answer the following multiple choice question. The last line of your response should be of the following format: ‘Answer: $LETTER’ (without quotes) where LETTER is one of ABCD.\n\n{Question}\n\nA) {A}\nB) {B}\nC) {C}\nD) {D}
- System message: You are a helpful assistant.
- Default temperature is 0.5

So, basically everyone is doing their own thing with implementations of even simple evaluations. Fortunately, the scores reported by AI developers on GPQA-Diamond match the results of independent runs. We also tested the effect of using different prompts and temperatures with gpt-oss on GPQA-Diamond (with high reasoning effort): while we found differences between the average scores across settings (ranging from 74% to 80%), these differences were not statistically significant given the small size of GPQA-Diamond (only 198 questions). Even with the same implementation, we observe variance in this range (cf. GPT-OSS evaluation in the Appendix).

While the prompt typically has a small impact for modern reasoning models on simple benchmarks like GPQA-Diamond, this was not always the case. Moreover, as we will now see, changing the prompt can have a large impact when it comes to more complex, agentic evaluations.

Scaffolds continue to have an outsized impact

As agentic evals, such as SWE-bench Verified or RLI, become more common, one component becomes increasingly important: The scaffold, i.e., the software that operates the agent, usually a CLI such as Claude Code, OpenHands, etc.

In particular, the scaffold includes all the components mentioned above, i.e., sampling parameters and prompt templates. Even more importantly, it also gives the agent a set of tools, i.e., access to specialized applications or capabilities. The exact tools and prompts change drastically depending on the chosen scaffold.

On SWE-bench Verified, a popular agentic coding benchmark, simply switching the scaffold makes up to an 11% difference for GPT-5 and up to a 15% difference for Kimi K2 Thinking. We cover the effect of the scaffold in our SWE-bench Verified review. The choice of scaffold has the single biggest impact on the overall performance.

Customizing the harness for each model risks hill-climbing on the evaluation and makes direct comparisons between models difficult. Therefore, there are two ways to evaluate agentic benchmarks: If you care about comparing models, a standardized scaffold (like mini-SWE-agent) is usually enough. On the other hand, assessing frontier capabilities requires the usage of leading products like Claude Code. Both these choices come with very obvious trade-offs, especially in terms of reproducibility, and require additional engineering efforts.

Execution Environment

The LLM, especially in agentic evaluations, has to operate in an execution context or environment. Concretely, this means things like the virtual machine for computer use evals, or the Docker container for coding evals. Creating and maintaining these environments is hard: OpenAI only ran 477 of the 500 problems in SWE-bench Verified in their o3 and o4-mini evaluations due to infrastructure challenges.

Sometimes the environments contain critical errors, which means that agents are able to hack the eval (or are unable to fulfill a task due to bugs). This is especially true for evaluations which allow the agent to search the web for information: In the worst case, the agent is able to find the original dataset or sites which re-host parts of a problem. As the open web is so vast, it means that a banlist of domains needs to be continuously maintained.

The impact of the environment is moderate compared to the scaffold. Typically, the environment has little impact on the results, unless the agent is able to hack a large percentage of samples or if the environment is actually broken.

Scoring

The final step of every evaluation is to score the given sample. To extract and score the answer, implementations of GPQA-Diamond typically use regular expression parsing, which can be tweaked to catch all possible edge cases and model answers. Coding benchmarks such as SWE-bench Verified get scored by using a test suite, which runs the model-generated code.

Other evaluations, such as SimpleQA, use a second LLM to extract the answer and grade its correctness; while evals like tau-bench use a second LLM to simulate a user that the evaluated model interacts with. The choice of this second LLM can have a sizable impact on the benchmark score.

Model Access

While the choices made when implementing the benchmark are within the control of the evaluator, they are not the only source of variance, as the models themselves have to be accessed in some way.

API & SDK

The prompt first gets sent through the SDK and API endpoints that are called during the evaluation run. Using the standard OpenAI ChatCompletions makes it easy to use a lot of different models, as all services and providers offer an OpenAI-compatible endpoint.

However, this might leave performance on the table – OpenAI reports up to 3% improvements on SWE-bench Verified by using their Responses API, while other providers, such as Anthropic, only offer a subset of features in their OpenAI ChatCompletions compatible endpoint. Minimax reports an astronomical 23 percentage point difference in performance on tau-bench when using their API implementation compared to the standard ChatCompletions API. Therefore, to avoid undereliciting models’ capabilities, it’s important to use the correct SDK.

API Aggregator

Of course, implementing all providers yourself is a tedious task, and there are solutions for this, such as LiteLLM or Inspect AI, which offer a unified interface to interact with any model from any provider. Aside from this, services such as HuggingFace Inference Providers or OpenRouter offer a unified interface to various models and their providers. Naturally, they have to build a layer on top of the APIs, which may result in new bugs 1.

Model Provider

The biggest factor of variance in evaluation results, however, is the provider of the model itself, especially for open models. The choice of provider impacts both which sampling parameters are supported, as well as the downstream performance.

To test the effect of the API provider ourselves, we ran several popular open models on GPQA-Diamond and averaged their scores over 4 runs2. We retry each sample up to three times in case of errors and score (API) errors as failed samples3.

We find differences in benchmark scores depending on the model provider for all tested models, though the effect size varies between models. More results can be found in the Appendix4. We observed various sources of error:

Some providers return RateLimitErrors, especially when accessing GLM-4.6 over OpenRouter. The providers mainly affected are Fireworks (via OpenRouter), GMICloud, Mancer, Parasail, DeepInfra, SiliconFlow. As these samples get scored as failure, the scores for the corresponding provider are affected substantially.
Some providers return empty or cut-off responses, although the max_tokens are not reached. This includes AtlasCloud, Mancer, Fireworks (via OpenRouter).
Some providers have lower max_tokens than advertised, resulting in cut-off responses, even though a higher limit was set via the API request. This affects SiliconFlow, Friendly and Cerebras.
Some providers have max_tokens limits which are lower than needed to evaluate the corresponding model. These providers were dropped completely for the given eval. An example of this is gpt-oss 120B, which needs more than 32K tokens to properly evaluate GPQA-Diamond; providers such as Novita, SiliconFlow or Cerebras have lower limits.
A few responses run into a timeout, which is set to 10 minutes by default.
Rarely, model responses run into “doom loops”, i.e., the model re-generates part of its response endlessly, until it reaches the max_tokens limit.
For GPT-OSS, one provider (Amazon Bedrock) does not get the correct reasoning_effort from OpenRouter, resulting in the same performance for all three levels of reasoning.

Aside from these errors, we find that providers are noticeably worse at serving newer models, in our case GLM-4.6, compared to established models such as Qwen3. This is consistent with other model releases, which are accompanied by bugs that are then fixed over time. However, independent benchmark organizations want to evaluate new models as soon as possible.

The selection of an appropriate provider has the biggest impact on model performance. The model developer usually hosts their own API, which could serve as a reference. However, this hosted API often offers worse data security guarantees than third party APIs, so evaluators might prefer not to use it for private benchmarks.

In practice, this means that those running the benchmark have to experiment a lot, trying to replicate known scores with third-party APIs5 and hope that it is implemented correctly. This is a laborious and costly effort and one of the main reasons evaluations (of open models) take a lot of time.

Model Deployment

Going under the hood, there are multiple reasons why the performance for a model differs from the reported score or between different model providers. They can range from software, such as a different inference engine (vLLM, SGLang, private forks), bugs in the reference implementation, or outdated chat templates, down to the used hardware and the corresponding deployment. Yes, even using different parallelism setups, i.e., how you distribute and split up a model over multiple GPUs, has a measurable, but small, impact.

Conclusion

Running benchmarks isn’t easy, and there are a lot of variables that affect the headline numbers you will see on a graph or in a table. A lot of these variables don’t seem to matter in isolation. However, they can add up quickly over the whole stack, which results in numbers that differ substantially from the scores reported by model developers.

Some of these variables, such as the used prompts and scaffolds, are in the control of those who create, re-implement and run the benchmark, while others aren’t. This is especially true for the model access by using (third-party) APIs, which slows down lower resource actors, such as hobbyists or academics. Errors in evaluations also make it hard for others, from AI researchers to decision makers and the general public, to accurately assess the current progress of capabilities.

OpenRouter graciously sponsored the credits for the experiments on their platform. We maintained full editorial control over the output.

Appendix

We evaluated GPT-OSS 120B at all three reasoning levels, GLM-4.6, Qwen3 235B-A22B Instruct 2507, Kimi K2 Thinking and Kimi K2 0905 Instruct on GPQA-Diamond while varying the API provider and keeping everything else constant.

We evaluated GLM-4.6 (as both the agent and the user model) on tau-bench Retail while varying the API provider (keeping the same provider for both agent and user), while keeping everything else the same.

We use this implementation for GPQA-Diamond and this re-implementation of tau-bench. We also make all the data for the analysis available under this link.

GPT-OSS 120B

Kimi K2 Thinking

Qwen3 235B-A22B Instruct 2507

GLM-4.6

This particular bug has since then been (mostly) fixed.

Usually, benchmarks such as GPQA-Diamond are run between 8-32 times and then averaged. We opt for 4 runs for cost reasons. We use this implementation for GPQA-Diamond and this re-implementation of tau-bench.

Of course, this is not the policy we use when evaluating models on the Epoch AI Benchmarking Hub, where except in pathological cases we ensure that there are no more errors before reporting a score. However, we believe this convention makes sense for the purpose of evaluating providers and how they complicate the benchmarking process.

We also make all the data for the analysis available at this link.

While there are methods to detect levels of quantization and possibly deployment bugs, they require a correctly set up local deployment, knowledge of the inference engine of the provider as well as exposed sampling parameters such as the seed (which not all providers offer).

The changing drivers of LLM adoption

JSD — Sat, 20 Dec 2025 00:16:22 GMT

Originally posted on Epoch AI.

In the world of AI, half a year is a very long time. Back in July, we saw LLMs being adopted faster than almost any other technology in history. Five months later we’re still seeing rapid growth, but we’re also seeing early winds of change — both in who uses AI and how they do so.

Using the latest public data,1 and a poll of US adults we conducted with Blue Rose Research, this post shares an updated picture of the state of LLM adoption.

How quickly are consumers adopting LLMs?

More people are using LLMs — but they’re increasingly using different LLMs, different products, and in different places

Through the first half of 2025, ChatGPT’s user base grew at a remarkable pace, from under 400 million weekly active users in January to nearly 800 million by August — roughly 50 million new users per month. Since then, growth may have slowed slightly, though it’s a bit soon to tell how much of this is noise versus a lasting trend change:

Does this mean that LLM adoption has slowed down? Not necessarily — even if ChatGPT user growth is slowing, other LLMs have recently seen faster consumer growth. For example, between August and November, Gemini monthly active users increased about 30%, compared to 5% for ChatGPT. And that’s not because Gemini has very few users — both Gemini and ChatGPT have hundreds of millions of monthly active users. That said, ChatGPT still maintains the larger userbase: in our poll, about 35% of registered voters reported having used ChatGPT over the past week, compared to 24% for Gemini.

Increasingly, this consumer growth is coming from countries besides the US. According to OpenAI’s “How People Use ChatGPT” report, 30% of internet users in high-income countries were already using ChatGPT weekly by mid-20252. With less room to grow, we should expect slower user growth in the US compared to the global average3. And there are hints that this is the case — for example, ChatGPT daily users in India grew sevenfold over the past year, and the number of daily Gemini users in India grew by 15% in November alone. As a result, India now has over twice the ChatGPT users of the US.

Finally, user growth could continue or even accelerate through a different route: integrating AI into widely-used products, rather than using standalone chatbot apps. One example is Google Search, which now includes an “AI mode” that’s powered by their state-of-the-art model, Gemini 3, and was made available to the public in December. Another example is Meta, which has been pushing Meta AI across WhatsApp, Messenger, and Instagram — this integration likely explains where Meta AI’s alleged 1 billion monthly active users comes from4. And if Meta succeeds in making AI a habitual part of messaging, Meta AI could reach an enormous user base — a possibility not lost on OpenAI.

Overall, the prior trend in global user growth seems more or less on track. Growth might be slower in saturated markets like the US, but this is offset by faster growth elsewhere.

Consumers are using LLMs much more intensely on AI apps, while web traffic has stagnated

Even without new LLM users, we could still see an increase in adoption. That’s because each user could still use LLMs more intensely, such as by sending more messages. Indeed, between June 2024 and 2025, the number of ChatGPT messages grew faster than the number of weekly active users.

So what’s happened more recently? The answer depends on what metric you look at, and also on the specific LLM.

For example, one metric we can check is recent web traffic. According to SimilarWeb, ChatGPT’s web traffic has been essentially flat since September 2025. In contrast, Gemini’s traffic has grown somewhat over that same period, though it started from a smaller base, such that the combined web traffic across the two models has stagnated.

But the story looks quite different over a longer period. Between November 2024 and November 2025, ChatGPT’s web traffic grew by around 1.5×. That’s faster than most websites, but surprisingly slow compared to the roughly 3× growth in weekly active users over a similar period. This discrepancy might be due to a shift from web use to faster-growing chatbot apps — SimilarWeb captures traffic from the former, but not the latter. Notably, the ChatGPT app was downloaded 1.9 billion times between October 2024 and September 2025 — the most downloaded app of the year.

We also see people using models more intensely within these apps. As of November, ChatGPT users spent 17 minutes per day in the app, up about 6% from March. Gemini users averaged about 11 minutes per day, up roughly 120% from March, but again starting from a smaller base.

But while looking at web traffic and time use is helpful, it misses out on a crucial way by which LLMs could be used more intensely — processing more tokens even with the same number of user sessions. For example, reasoning models tend to have longer responses, and some changes in model personality could lead to more messages per session. So ideally, we’d also look at the number of messages sent and the tokens processed per user. Alas, we don’t know how much this impacts user on average, because only a small fraction of users regularly use the “reasoning” functionalities of frontier models (e.g. ChatGPT’s “thinking” mode). Moreover, we don’t know how many of the total tokens processed by ChatGPT are driven by a small fraction of ChatGPT “power users”.

Putting everything together, there’s still only limited evidence on how intensely people use LLMs and how this has changed over time. But if we squint at the data we have, it seems that individuals are likely using LLMs much more intensely than before.

AI company revenues have continued to grow incredibly fast, in line with previous trends

So far we’ve looked at two dimensions of increasing adoption: more users, and more intense use. Combining these two lets us measure total LLM use, which we can capture using revenue. This is a metric which has grown so fast that it puts companies like Google and Uber to shame — so you’d have a really hard time arguing that LLM adoption has been slow.

To put things into perspective, OpenAI’s annualized revenue was $13 billion back in August, and growing at about 4.3× a year. On that trend, we’d expect an annualized revenue of around $21 billion by the end of the year, which is essentially what we’ve seen 5. So under this lens, revenue (which roughly tracks adoption) has continued following extremely fast historical trends.

How embedded is AI in daily tasks and jobs?

Knowing about the rate of adoption already tells us a lot, but to understand how AI is impacting society, we need to look further. We want to understand how AI is actually being used, and by whom.

AI has entered the workplace beyond formal enterprise adoption

The first step is to move beyond individual consumers to the workplace. This is important because an increasing share of OpenAI’s revenue has come from enterprise products like ChatGPT Enterprise and ChatGPT Business 6.

The obvious way to measure work-related AI adoption is to look at enterprise deployments. For example, the Ramp AI Index we cited back in July measures the “share of U.S. businesses with paid subscriptions to AI models, platforms, and tools”.

But this misses a large share of actual work usage — many people are using AI for their jobs, even if their employers don’t provide them access. We can see this by looking at our survey data. On the one hand, 36% of respondents who used AI in the past week said they had used it to assist with work. On the other hand, only about 18% of those same respondents said that their job provides access to an AI chatbot, and another 18% weren’t sure!

This gap suggests that a substantial share of work-related AI use is happening on workers’ own initiative, using free tiers or personal subscriptions rather than employer-provided tools. While the proportion of respondents who reported using AI for work does vary across job categories, the effect size is lower than we expected: less than 10 percentage points between “Office job” workers (34.6%) and workers in “Manual, non customer-facing” jobs (25.8%).

Most consumers use AI to seek information

On paper, modern frontier LLMs are optimized for a wide range of tasks, such as coding and writing. But there’s one use case that stands out by far — looking up information. About 58% of AI users in our survey reported using it for this purpose in the past week. Writing assistance (32%) and advice (30%) came next, followed by technical tasks (18%), multimedia (17%), and self-expression (16%).

This pattern aligns with OpenAI’s “How People Use ChatGPT“ report, which found that “Practical Guidance,” “Seeking Information,” and “Writing” accounted for the largest share of ChatGPT messages7. It also matches the results of another recent survey, which found “General information or answering question” as the most common use case by far. For most users, these tools function as an enhanced search engine and writing aid, rather than as autonomous agents or virtual companions.

This raises a question: how much does progress on frontier capabilities actually matter for typical users? The benchmarks that AI labs optimize for — SWE-bench Verified (coding), GPQA Diamond (advanced scientific reasoning), FrontierMath (hard math problems), METR’s suite and GDPVal (autonomous task completion) — seem quite distant from “seeking information (e.g. to look up a fact, as a replacement for Google, to get product recommendations),” which is how we described the category to poll respondents. The closest benchmark to this dominant use case might be something like OpenAI’s BrowseComp, which tests web browsing and information retrieval, but usually gets less airtime.

In practice, different capabilities have been highly correlated: models that score well on coding benchmarks also tend to score well on reasoning and information retrieval. So progress on frontier benchmarks may benefit typical users anyway, if the correlation in benchmark scores actually reflects a deep shared latent ability factor. On the other hand, if benchmark scores are correlated for more contingent reasons, there could be a real tradeoff between pushing frontier capabilities in specific domains and catering to the most common use cases.

AI use is stratified by age, income and job type, and less so by gender

Who is using AI? We found that higher-income respondents were substantially more likely to have used AI in the past week, and the pattern holds across services: ChatGPT, Gemini, and Copilot usage all skew toward upper-income groups.

Age also matters: younger respondents (18-34) are more likely to use AI than those over 65, with the gap especially pronounced for ChatGPT.

How does this age gap compare to previous technologies? Among respondents 65 and older, about 34% reported using an AI service in the past week. According to Pew Research data, that’s roughly comparable to internet use among the same age group around 2007, when about 35% of Americans over 65 were online.8 It’s also similar to smartphone ownership among over-65s around 2015–2016, when about 30–40% owned one. In other words, AI adoption among older adults today looks similar to where internet and smartphone adoption were roughly 18 and 10 years ago respectively — past the early enthusiast phase, but still with substantial room to grow.

Gender, by contrast, shows relatively little difference in our data: 59.8% of men had used at least one AI tool over the past week, compared with 54.4% of women. This is notable given that early ChatGPT adoption skewed heavily male; OpenAI researchers documented a dramatic narrowing of this gender gap over time, and our polling also suggests it has largely closed for general usage.

Conclusion

In summary, consumer AI adoption remains strong, but the picture is more nuanced than headline user counts suggest. ChatGPT remains dominant and keeps acquiring users, but Gemini’s user growth has been faster over the last few months. Web traffic is plateauing, but usage has shifted to apps. OpenAI’s revenue seems to be on track, but consumer revenue is likely decreasing as a share.

Meanwhile, the way people use these tools is becoming clearer. Most consumers treat AI as a search-and-writing assistant, not as an autonomous agent completing tasks end-to-end or as a virtual partner. And a substantial share of workplace AI use is happening bottom-up, with workers adopting tools on their own rather than waiting for employer-provided access.

In particular, this includes data from OpenAI, Google, as well as third-party analytics firms like SensorTower or SimilarWeb.

This is shown in Figure 21 of the report.

Our polling data is consistent with this: as part of our polling partnership with Blue Rose Research, they asked US adults how often they use AI tools, across three waves from June through November. After reweighting to be representative of US registered voters, the share reporting they use AI tools “once or twice a week” or more frequently was essentially flat: 36.5% in June to July, 37.0% in mid-July to August, and 38.9% in November.

This is suggestive evidence that the US user acquisition is slowing down substantially, though we’re cautious about reading too much into it. Notably, the growth in Gemini users we see in global data doesn’t clearly show up in our survey, which is somewhat puzzling and suggests the polling may not be capturing all the dynamics.

Unfortunately, it’s hard to be sure how many of these 1 billion users come from deliberate adoption rather than accidental one-time triggers. We might also be skeptical of this number for other reasons — for example, our US survey suggests Meta AI still lags far behind ChatGPT and Gemini in use. But it’s hard to be certain because our survey data applies to US data, whereas Meta’s strength is in markets outside the US (where WhatsApp dominates).

Strictly speaking, OpenAI’s annualized revenue is $19 billion as of mid-December. That’s slightly below the extrapolated $21 billion, but annualized revenues are quite a noisy metric, so these two numbers seem close enough to say that revenue growth has been roughly on trend.

ChatGPT had 35 million consumer subscribers in July 2025. Assuming all subscriptions were ChatGPT Plus at $20/month (a lower bound, since some are ChatGPT Pro at $200/month), this implies consumer annualized revenue of at least $8.4 billion. Since total annualized revenue at the end of July was $12 billion, consumer subscriptions were at least 70% of revenue, most likely more.

Unlike our poll, “How people use ChatGPT” found that “Practical guidance” accounts for a higher share of messages than “Seeking information”. We think this might be because “Practical guidance” queries involve more messages per session, as users are more likely to ask follow-up questions.

The Pew data surveys U.S. adults, while our poll’s percentages apply to U.S. registered voters. This likely doesn’t affect the comparison too much: the largest difference between registered voters and the general adult population is age composition, and here we’re looking specifically at the 65+ group.

Is almost everyone wrong about America’s AI power problem?

Anson Ho — Wed, 17 Dec 2025 16:22:34 GMT

In AI circles, there’s a common argument that goes: “The US is horrible at building power, but China’s great at it. And since power is so important for the AI race, China wins by default.”

This line of reasoning is everywhere. NVIDIA CEO Jensen Huang used it to argue that “China is going to win the AI race” last month. It features in Situational Awareness, a series of essays about how the world’s in a fierce race to AGI, which received a seal of endorsement from Ivanka Trump. There’s even an entire Dwarkesh podcast episode called “China is killing the US on energy. Does that mean they’ll win AGI?”.

But we think this argument is overstated — power bottlenecks likely won’t dramatically or permanently impede the data center buildout in the US. Claims about America’s AI power predicament are partially based on a misunderstanding, and there are multiple promising approaches to meet America’s AI’s power demands. That means that people are overrating the strength of the power bottleneck, and how much that impacts the “race to AGI”.

So why do we believe this, and how could everyone be wrong about America’s AI power problem?

America’s AI power predicament — or not?

You don’t have to look hard to find horror stories of US power infrastructure. There are transmission lines that need approval from 1,700 landowners, electric truck charging depots that need to wait up to a decade to connect to the grid, and weather events that plunge millions of Texans into rolling blackouts.

And we can also just look at the numbers — over the last four decades, total US power supply has been flat as a board, whereas China’s has far surpassed it:

Source: Situational Awareness

So there’s a natural story here — regulations have killed the US’ ability to build power and “win the AI race”. And there’s some truth to this; legal hurdles can impede power buildout.

But this story points the finger at the wrong thing. It assumes that US power capacity has stagnated because America can’t build — without first checking whether demand actually warranted more capacity! Maybe Americans just haven’t wanted much more power!

Indeed, we see that real energy prices have been basically stagnant over the last twenty years.1

Data from the Federal Reserve Bank of St Louis, dividing average electricity prices by the Consumer Price Index (with January 2025 = 100).

Moreover, as we’ll see, there’s actually excess capacity available on the grid. So this suggests that the problem wasn’t an inability to expand supply; rather it was that demand remained stagnant for most of the last few decades.

You might be thinking, “there’s a big elephant in the room here — isn’t AI going to hugely increase US power demand?” And you’d be right. AI has driven recent spikes in power demand, and AI data centers in the US will need 30 to 80 GW of power by 2030.

Let’s assume that AI’s power demand grows aggressively, reaching 100 GW by 2030.2 That means the US would need to build out four times more power than California needs on average. We haven’t seen this much relative growth in electricity demand since the 1980s.3

But we think the US can likely meet this demand anyway. As long as there’s enough will to invest in AI scaling, there’ll be enough will to pay for the power. After all, power is a relatively small slice of the cost of running a data center — chips are far more expensive. If companies are willing to shell out for massive AI chip clusters, they won’t balk at paying a premium for electricity.

Plucking fruit from the power tree

This sudden spike in power demand after decades of stagnation means that the US is off to a standing start. But the flip side is that America hasn’t really tried to meet this kind of demand surge, so there’s still a bunch of relatively low-hanging fruit to be picked.

Natural gas: Digging power out of the ground

Probably the most salient way to supply power is to build out natural gas — it’s relatively cheap and can be built fast (e.g. without immediately connecting to the electrical grid), which is why it powers frontier data centers like OpenAI’s Stargate Abilene and Meta’s Hyperion.

So how much natural gas could we build out? As a starting point, natural gas providers already aim to grow their production a lot. Consider the three largest gas turbine manufacturers, which collectively make up something like 85% of market share:

GE Vernova says they’ll build 20 GW per year by 2026 and 24 GW per year by 2028.
Mitsubishi Heavy plans to double their gas turbine production over two years, though sadly we don’t know what this means in terms of absolute power capacity
Siemens plans to boost their gas turbine production capacity from 17 GW per year in 2024, to 22 GW per year between 2025 and 2027, and over 30 GW a year by 2028.

So it’s hard to say overall, but even if we just sum up the power contributions from GE Vernova and Siemens until 2030, that brings us beyond 200 GW. Accounting for Mitsubishi Heavy and other producers should bring us well beyond that. This power isn’t necessarily produced for US consumers (last year US consumption contributed 20% to 40% of the revenues of each company),4 but it does strongly suggest that these companies would be able to supply the AI power demand for several years to come.

This is also broadly consistent with older forecasts for natural gas production in the US, with over 100 GW produced by 2030.

Source: Cohen, Fitch, and Shwisberg

Now we can’t quite declare victory yet, because of two issues. First, natural gas capacity doesn’t translate directly into data center capacity. So if there’s 100 GW of added natural gas capacity, grid planners might only expect around 80 GW of that to be reliably available during peak demand.5

The second (and trickier) issue is that not all of this power capacity goes to AI — but we don’t know how much. GE Vernova states that their backlog goes into 2028, and “only a small portion of those orders relate to data centers and AI”. But what’s a “small portion”? 3%? 30%? At the very least, AI probably makes up a pretty substantial portion of the most recent orders — 65% in the case of Siemens. And if people are willing to spend big on AI scaling, many of these orders could potentially be bought from other use cases.

Even after we account for these two issues, it seems like natural gas could cover a big chunk of expected AI power demand. And planned natural gas buildouts are actually quite modest: companies are planning to spend several billions on equipment for natural gas through 2028, which sounds like a lot but pales in comparison to the trillions needed for IT equipment. If companies are going to spend a lot on buying GPUs anyway, they’re going to be willing to pay a fraction of that amount for power. So there’s plenty of room to increase turbine production even more if they really wanted to.

This can mean taking steps that aren’t the most cost-effective by traditional power industry metrics, because the time-to-power is what matters most for AI. For example, rather than using standard “combined-cycle” gas turbines, AI companies are turning to smaller “simple-cycle aeroderivative gas turbines”. Although these jet engine-based turbines can cost more than their combined cycle counterparts, they’re also much faster to build.6

But there’s perhaps an even better example of how gas turbines could be rapidly scaled up — Elon Musk’s xAI. Founded in 2023, xAI needed to catch up quickly with the likes of Anthropic and OpenAI. That meant building data centers fast and powering them in pretty out-of-the-box ways — renting gas before connecting to the electrical grid, deploying natural gas turbines prior to receiving permits, and even importing and disassembling a gas plant from overseas!

Solar: Raining power down from the sky

While there’s still room for Elon Musk-esque “just get the power ASAP” approaches, we should also consider other sources of power. And the best candidate for this is solar energy — it’s more environmentally friendly, and the price has famously been dropping monstrously fast compared to other energy sources:

More importantly, solar power can be scaled a lot. Solar installations have grown roughly exponentially for at least the last decade, and even if US policy headwinds stopped us from going beyond 2024’s rate of 40 GW per year, we’d end up with 200 GW over the next five years.

Once again, we shouldn’t jump the gun and declare victory just yet — not all of this solar capacity translates into data center capacity. In practice, we might have to cut this by a factor of five to 40 GW over the next five years. This is a huge drop, but it still covers a good chunk of AI’s power demand, and like with natural gas, there’s a lot of room to grow the supply of solar power.

Importantly, the core bottlenecks are things like installation costs, permitting costs, and connecting solar power to the grid. It’s less about manufacturing — the US can basically manufacture or import as many solar panels and batteries as they could realistically want. For instance, US solar manufacturing capacity reached 55 GW in the second quarter of 2025 alone, which is more than enough to supply installations for the full year. So if only you could bypass these bottlenecks, you could unlock a lot of power.

This is exactly what was proposed in a report late last year — rather than waiting to connect renewables to the grid, simply move off-grid. The idea is to have many “microgrids” that are powered by solar and a bit of natural gas backup. Each microgrid would individually power data centers, such as from 100 MW to 1 GW in size. According to them, this could unlock over 1000 GW of potential power for data centers in the US Southwest alone. And they think this could be built out fast:

“Estimated time to operation for a large off-grid solar microgrid could be around 2 years (1-2 years for site acquisition and permitting plus 1-2 years for site buildout), though there’s no obvious reason why this couldn’t be done faster by very motivated and competent builders.”

That said, this approach is still pretty speculative, and has a bunch of hiccups. For one, you need a lot of land — 2 GW of solar power needs as much space as all of Manhattan!7 Moreover, while you might want your data center to run 24/7, you can’t generate solar power at night. If you’re connected to the grid this isn’t too big of a deal because you can draw on spare grid capacity from other power sources, but going off-grid makes this more challenging. These reasons are probably why (to our knowledge) there haven’t yet been plans for gigawatt-scale data centers to be mainly solar-powered.

But as with natural gas, these obstacles don’t seem insurmountable. For one, a lot of US land gets a ton of sun — enough land for over ten thousand Manhattan-sized solar arrays.8 Moreover, batteries can help substantially support off-grid approaches. There was a roughly 20 GW increase in battery capacity from 2024 to 2025, and if this is sustained this would bring us to 100 GW by 2030.9 And in many ways this is too conservative — batteries could be imported, and also repurposed from Electric Vehicles, which currently make up the vast majority of battery demand. In fact there’s already some precedent for this — companies like Redwood Materials and Crusoe were exploring this approach earlier this year.

Demand response: Acquiring power “out of thin air”

So far we’ve looked at two ways of building power — but there’s one more wildcard that could make a huge difference. This is what Dean Ball cheekily describes as pulling “gigawatts out of thin air”, the more technical term for which is “demand response” (yet another name is “data center flexibility”).

As crazy as it might sound, the US electricity grid is oversupplied most of the time. That’s because power infrastructure is built to handle peak demand, like when everyone’s turned up their air conditioning during a heatwave.10 But most of the time, there’s actually abundant spare power in the grid, enough to run (say) a 1 GW data center.11

To make this work, data centers would need to reduce power use during peak demand. This can be tricky — for one, data centers would need to credibly commit to decreasing their power draw when necessary. This probably also involves deliberately curtailing AI workloads, which raises technical challenges — but not insurmountable ones. Even today, large training runs have to contend with spontaneous AI chip failures that force the entire process to stop, but data center operators have been able to manage these quite successfully.

If demand response actually works, it could unlock a lot of power: One study finds that if data centers could cut power demand by just 0.25% across the year, they could get around 76 GW of spare capacity.12 That’s 75% of the power of the entire US nuclear power plant fleet, and gets us a lot of the way to meeting AI’s power demand. If you’re willing to curtail power demand even more, you could get substantially more power — e.g. a 1.0% power cut across the year gives you 126 GW of power.

This almost sounds too good to be true, but is already gaining traction. Google has agreed to implement this in Indiana and Tennessee. The Electric Power Research Institute launched a project last year to demonstrate demand response in real markets. And Emerald AI has been working with companies like Oracle and Nvidia to implement it in practice.

This doesn’t mean it’ll all be smooth sailing. The estimates in the aforementioned study might be too optimistic.13 And moreover, a demand response proposal by the PJM (the largest grid operator in the US) has been met with some strong pushback from big tech and data centers. But the pushback seems mostly about the specific proposal,14 and given that multiple actors have their own proposals, this doesn’t seem like a big strike against demand response in general.

Adding up the numbers

Now let’s put together all the numbers. Our best projections suggest that the US will need around 100 GW of power by 2030. And if we look at individual power sources, we see there’s a lot of potential to meet this demand. Demand response unlocks tens of gigawatts of power, and plausibly over 100 GW. Solar and natural gas individually contribute many tens more gigawatts under existing projections, and there’s likely substantial room to grow by going off-grid or just scaling up equipment spending.

These different approaches can complement each other to some extent. For example, traditional buildouts of solar power could require five to seven years to connect to the grid, which is too slow. But if data centers procure their own solar power generation capacity and combine this with demand response, they could cut this wait time down to two years.

And let’s not forget that there are other power sources too. For example, next-generation geothermal power seems close to medium-scale commercial deployment — one pilot project is already online and there are several startups backed by AI “hyperscalers” that promise hundreds of megawatts by 2028. By 2035, these buildouts could plausibly add 40 GW of geothermal power.

So if we combine all these approaches, it looks quite likely that the US will have enough power for AI scaling until 2030.

Strictly speaking this isn’t quite enough because we don’t just care about the total power — to train AI systems, this power often needs to be concentrated in a single location. And if current trends persist, this might mean truly massive training runs with 4 to 16 GW by 2030. However, this is a big if, because a lot of model development is now less focused on performing a single large training run in one location.

But we doubt even this will be that much of a blocker. Frontier AI companies seem to be doing a pretty good job at quickly finding locations for gigawatt-scale data centers. Consider how OpenAI and Oracle were able to do this for multiple sites within roughly a year, as part of their Stargate project. And even if we can’t find individual locations that fit the bill, companies could split training across multiple locations to circumvent local power constraints.

So is everyone wrong about this?

Taking a step back, what we’re saying is perhaps rather outrageous. Are we really saying that everyone is wrong? Do we really think that we know better than all the frontier AI companies, all the AI policy researchers, and all the utility companies CEOs? If these approaches are so good, why isn’t everybody implementing them?

The thing is, they are implementing them! As we saw earlier, new frontier AI data centers often use natural gas, and companies like Google are now implementing demand response.

And to the extent that these approaches aren’t being put into practice, remember that a lot of this is very new. The US has seen “stagnant power demand for a few decades followed by a huge AI-shaped spike, roughly since ChatGPT was released three years ago”. And three years is ancient history by AI standards, but a blink of an eye for the power industry. So we shouldn’t be surprised that many people have been thinking about power through an outdated lens, tailored to a world without several gigawatt-scale megaprojects spun up each year.

The pressures of AI competition have transformed the picture. Consider electrical transformers and transmission lines: connecting a new facility to the grid traditionally requires building these out, a process which can take years. But in an industry where multi-year waits are a death sentence, companies are increasingly turning to off-grid generation and demand response programs which sidestep connecting new power to the grid.15

Why then are so many actors complaining about the size of the power bottleneck? We think the obvious reason is true — there are indeed challenges to accessing power for AI scaling! Overcoming these challenges often means using out-of-the-box techniques for acquiring power, and might end up costing more. So there are incentives to try and reduce the price of power as much as possible.

The upshot: we doubt that these challenges alone are strong enough to impede AI scaling. Power bottlenecks will likely add months — but not years — to build-out times. Even if power prices tripled, they would still be substantially lower than the costs for AI compute — in current frontier data centers, the cost of AI chips is roughly ten times higher than that of power! If people are willing to build out huge data centers with loads of chips, then they’re also willing to pay a premium to build out power.

The question that matters the most, then, is whether people will be willing to spend on AI scaling. So far, spending on building power has been fairly conservative, perhaps due to uncertainty in future power demand.16 But if there continues to be intense economic and geopolitical incentives to build AI infrastructure in the US, then the AI industry will probably be able to secure a large share of new power generation, especially for smaller-scale AI data centers used for inference or decentralized training.

This finally brings us back to the original argument — does China’s power advantage mean that they “win by default” (whatever that means)? Not necessarily. It’s still the case that China can build power faster than the US, but America’s power bottleneck is much weaker than many people make it out to be. Moreover, the whole reason electrical power matters for AI is that it’s used to run AI chips. But as things currently stand, China still has to overcome serious deficits in the quantity and quality of their AI chips to gain a compute advantage over the US — and that would be true even with unlimited power capacity.

The upshot is this: power does pose a real challenge, but it matters a lot how big this challenge is. At the very least, we doubt that power bottlenecks will be strong enough to substantially slow down US AI scaling. For that, we’re not holding our breath.

We’d like to thank Lynette Bye, Yann Riviere and Konstantin Pilz for their feedback and support on this post.

The same is true for retail sales of electricity.

This isn’t a strict upper bound, for example some estimates are substantially higher, in the ballpark of 250 GW (and if building data centers costs $30 billion per GW, this amounts to $7.5 trillion!). We’re more skeptical that we’ll reach these scales, but it’s possible.

100 GW would be around 10% of America’s peak power capacity. If we spread this out over four to five years, this works out to a growth rate of 2% to 3%, which we haven’t seen since roughly the 1980s.

For example, US consumption contributed around 40% of revenue for GE Vernova, about 20% for Siemens, and roughly 20% for Mitsubishi Heavy.

This is captured more formally by the Effective Load Carrying Capacity (ELCC), which is around 80% for natural gas.

Good data is somewhat sparse, but there is some suggestive evidence. For example, a report from the Energy Information Administration shows a handful of examples where simple-cycle aeroderivative turbines cost over 50% more than their combined-cycle counterparts. Similarly, smaller simple-cycle gas turbines can take 18 to 24 months to go from order to operation, and larger combined-cycle gas turbines take 36 to 48 months.

Each GW of solar power needs around 30 square kilometers of land, so for 2 GW you’d need around 60 square km, which is around the land area of Manhattan.

There’s around 1.5 million square km of US land that receives over 5 kWh/m²/day. That’s roughly the equivalent of 5 hours a day of full midday sun — for comparison, Los Angeles has an average of 5.3 kWh/m²/day. Given that Manhattan’s land area is around 60 square km, this is enough for 25,000 Manhattans.

This is consistent with projections from the US Department of Energy’s Argonne National Laboratory, which expects US battery cell capacity to grow by around 930 GWh through 2030. If batteries are 10-hour-systems, this would be enough to support around 90 GW of power.

A better example comes from the UK. During commercial breaks, many TV-watchers simultaneously take the opportunity to boil water to prepare another cup of tea or coffee. That leads to a predictable power spike of around 200 to 400 MW.

This doesn’t necessarily preclude training runs that need much more than 1 GW of power. For example, it’s likely possible to train AI models across multiple data centers in decentralized fashion, even if each data center needs less than a gigawatt.

In practice, this doesn’t quite mean cutting power 0.25% of the time (i.e. 22 hours per year) — these power reductions are usually partial rather than fully cutting out the data center’s power load. Instead, the authors estimate this would be spread out over 85 hours. But that’s still just under 1% of the total hours in a year.

In general, we worry it’s hard to know what happens when you apply something as novel as demand response at close to 100 GW scales (whereas building out this much power is something with much more historical precedent).

PJM proposed that data centers could avoid capacity charges by accepting priority curtailment during emergencies. But if not enough people volunteer, PJM would mandate participation. Data centers objected that this singled them out for inferior service while existing customers faced no equivalent obligation, and utilities argued PJM lacked authority to create what amounts to a new class of retail service.

In the longer run, more substantial buildouts should be possible — for example, there’s historical precedent for much larger supply of transformers and investment into transmission lines.

This in turn depends on the economic returns to further compute scaling — if these are high enough, this would help justify substantial investments.

Benchmark Scores = General Capability + Claudiness

Greg Burnham — Thu, 20 Nov 2025 21:09:33 GMT

The Gemini 3 release included a massive table showing how the model was state-of-the-art on nineteen diverse benchmarks. Such tables are commonplace by now, but they add up to an odd statistical situation. Benchmarks ostensibly measure different things, but since models tend to improve on many benchmarks at once, the dataset of benchmark scores is dominated by a single “General Capability” dimension.

In this post, I’ll describe the statistics of this dataset, look into what’s left when you factor out this dominant dimension (hint: it’s “Claudiness”), and discuss how this relates to an important key question about cross-task generalization.

Benchmarking data is dominated by a single underlying dimension

This is one of the lessons of our recent work on the Epoch Capabilities Index (ECI), which combines thirty-nine benchmarks into a single capabilities score. If benchmarks were generally uncorrelated with each other, you’d expect to see large residuals: the benchmark scores predicted by a model’s ECI number wouldn’t match the model’s actual benchmark scores. As it turns out, we see a very good match. In other words, our nominally high-dimensional dataset is well-approximated by just a single dimension.

To look beyond this dimension, we can do a Principal Component Analysis (PCA). This basically asks: if we make synthetic “components” by taking weighted sums of the different benchmark scores, what’s the most variance in the dataset we can account for with the fewest number of components?

When we do this on the raw data underlying ECI, the first component captures about half the variance of the dataset.1 2 The table below shows the weights on the different benchmarks in this component, accounting for 80% of the total weight. Note that the weights are all positive and not very dispersed. That is, PCA also finds a single “general capability” component.

Moving beyond the first principal component, the chart below shows the magnitudes of all the principal components, plotted against the size of components found in randomly generated data of the same shape (i.e., a parallel analysis). We see the single large component mentioned above, a second component that is borderline significant, and the rest having small sizes, consistent with noise.

Benchmarking data shows a smaller “Claudiness” dimension

What is this second component? Here are the top weights, by absolute value, that this component assigns to the benchmarks, again accounting for over 80% of the total weight. By construction, this component is orthogonal to the main “general capability” component. When I first saw this, I said it looked something like, “good at agentic tasks, but bad at vision… and also bad at math?”

But I showed it to a colleague and he just said, “it’s Claude”. He was right. Here are the top five models on this component, as well as the bottom five.

I think this second component shows that benchmarks aren’t entirely “general capability” plus “noise”, even if that is a pretty good approximation. Even though this component isn’t so statistically significant, I think it’s fair to say that it aligns with the general public sense of Anthropic’s priorities, i.e. they seem to be making Claude like this on purpose. This updated my thinking a bit on a broader question, as I’ll explain next.

Is the “general capability” dimension deep, or contingent?

The big question here is why a single dimension captures so much of the variance in benchmark scores. I can think of two possible reasons, corresponding to two possible worlds we may be in. I’ll call these worlds “deep” and “contingent”.

In the “deep” world, there is a single underlying ability that governs how well models do at superficially unrelated tasks. In this world, the only thing a model developer can do is make this ability go up. If they succeed, their model gets better at everything.

In the “contingent” world, there are many orthogonal abilities that models can have. These are orthogonal in the sense that model developers have to do completely unrelated work to get a model to improve on each ability. Still, in the world I’m imagining, customers demand models with many capabilities, and so developers put in the work to make this happen.

Which world more resembles our own? Sometimes in the history of AI, things have looked like the contingent world. AlphaGo was superhuman at go but it was nonsense to ask it to do anything else. At other times, things have looked like the deep world. When LLMs were picking up steam, next-token prediction on relatively uncurated web text was tearing through NLP tasks that had previously been dominated by specialized models.3

To a first approximation, benchmark scores look the same in both worlds. But the existence of the Claudiness dimension feels to me like a bit of evidence for the “contingent” world. Anthropic has focused on making models that are state-of-the-art at agentic coding. Without additional focused investment, the models turn out not to be exceptional at advanced math. There is surely some generalization across tasks, but perhaps this is a sign of its limits.

A trillion dollar question

The Claudiness dimension is not very strong evidence for the contingent view. Stronger evidence might be how model developers are investing heavily in collecting specialized data, like reinforcement learning (RL) environments for industry verticals. Even so, it’s possible that they’re doing that and that RL shows excellent cross-task generalization.

One way to test this would be to find an uncontaminated benchmark that measures something unusual, and see if it correlates with the “General Capability” dimension. Unfortunately, we don’t know what counts as “unusual” for models because we don’t know what they saw in training. Also, I suspect there’s a selection effect where benchmarks that show top models scoring poorly tend to capture attention. Still, this seems worth pursuing.

Even if the explanation for what we see in benchmark data is that model developers are pursuing an “everything at once” strategy, they have the resources and the scalable architectures necessary to keep it going. In other words, they can keep making all the benchmarks go up so long as they can get the right training data plus enough compute to make use of it.

What does this mean for the future? I like how Steve Newman put it recently: “how far can you get by simply putting an insane number of things in distribution” is one of the trillion dollar questions.

I doubt that there are in-principle limits to putting everything in distribution, but if we’re more in the contingent world then we shouldn’t expect much of a tailwind from generalization either. Every percentage point of improvement on every benchmark must be paid for. Here I think we should expect to see capabilities continue to improve quite generally, but only so long as the flywheel of growth and investment continues to allow developers to devote resources to actively making this happen.

Methodology: we filter our dataset to benchmarks created in 2023 and beyond, and models with at least 8 benchmark scores. We combine different reasoning settings for the same model, taking the max scores. We use k-nearest neighbors to impute missing data, transforming [0-1] scores by a logit first, weighting by distance, and using 5 neighbors. We then do PCA. Data and code can be found here.

This main finding accords with previous work, although we now have a much larger dataset of benchmarks.

Even now there are some prominent specialized models, like Cursor’s Tab or OpenAI’s Codex series. But it seems fair to characterize the current landscape as dominated by models that at least try to “do it all”.

The software intelligence explosion debate needs experiments

Anson Ho — Sat, 15 Nov 2025 01:07:35 GMT

Originally posted on Epoch AI.

Suppose you had a million AIs, each surpassing humanity’s best AI researchers. If they all worked on advancing AI, how much would AI progress accelerate?

This might sound like science fiction, but it may be the most consequential question about the future of AI. The problem is that the experts disagree wildly on the answer.

Some foresee a positive feedback loop. These AIs are smart enough to find new algorithms to make smarter AIs, which make even smarter AIs, and so on. Very soon, we could see multiple years of AI progress compressed into a single year just through software advances — a “software intelligence explosion”.1

Others agree that AI progress would speed up, but think that something will block the explosive feedback loop. For example, increasing difficulty in finding new algorithms might bottleneck AI self-improvement, or software improvements might depend heavily on physical resources like compute, which can’t be scaled as easily.

And we really need to know who’s right. If a software intelligence explosion is likely, economic and military power might become much more concentrated in a few companies or nations. They might also focus more on automating AI research rather than driving broad economic impacts — so AI progress could rapidly accelerate before most of the world knows what’s happening.

So why do the experts disagree so much, and how can we figure out who’s right?

Flawed data

The core reason for this disagreement is a lack of strong evidence — thus far, empirical work on the software intelligence explosion has used flawed data and models.

Prior work has used the “Jones model”, which tells us how R&D inputs translate into outputs. For example, the R&D input might be the number of AI researchers, and the R&D output could measure how efficiently an AI algorithm uses training compute (“software efficiency”).2

The core parameter in this model is the “returns to R&D” — in our example, this tells us how software efficiency changes when we double the number of researchers. If it also doubles, the returns are just 1. If it quadruples, then the returns are 2. In general, if doubling the inputs increases the output by 2^r, then the returns are r.

This parameter dictates the long-run dynamics of the software intelligence explosion.3 We especially care if we have increasing returns (i.e. r > 1) — this is when you get more than what you put in, resulting in the positive feedback loop where smarter AIs recursively build smarter AIs.4 5

So far, so good. But the issues show up when we apply this in practice. Consider our estimates of these returns in several domains of AI software:6

Estimates of the returns to software R&D for three different domains of AI: (1) computer vision, (2) reinforcement learning, and (3) language models. Technical details are provided in the appendix.

Here we see a lot of probability on compounding returns. But the uncertainty is very large, and our range of estimates straddle the key threshold of r = 1. So at face value, the software intelligence explosion seems quite plausible — at the very least, it’s hard to rule out.

But if we dig into the details, we see a ton of big problems. To estimate r, we need to collect data on R&D inputs and outputs, but our measures of these are very imperfect.

For example, to estimate the returns in language models, we first looked at the number of published papers in natural language processing. This proxies for the number of researchers, accounting for how some researchers might be more productive than others (and hence publish more papers).

But this ignores all the R&D effort that spills over from other domains of AI. For example, it’s common for language models like the Transformer to use “residual connections”, but this innovation was heavily inspired by work in computer vision.

So we also considered a second measure of R&D inputs, which is a function of human labor and AI compute at AI labs. Unfortunately, this is flawed for a similar reason — it ignores the R&D effort that spills over from researchers outside of a collection of AI companies.7 For instance, OpenAI’s work is often inspired by academic research.

So these “spillover effects” cause a mismatch between R&D inputs and what we measure it to be. But it’s not clear how best to adjust for this — we could include papers from computer vision and other domains of AI, but many of these papers will be unrelated to language model improvements! We could also try to account for academic R&D inputs, but we don’t know how much academic research compute there was over time. In general, we don’t have good data on the inputs and need to rely on proxies or best-guesses with limited data.

What does this mean for our estimates of the returns to R&D? Possibly this means that the “true” returns should be higher than what we observe. In particular, both the number of researchers and supply of compute probably grows more slowly in academia compared to AI labs.8 So if we broaden our metric to include academia, we’d measure slower R&D input growth. This means fewer input doublings for the same number of output doublings, so the estimated returns to R&D become higher.

There are similar issues with the R&D outputs. Ideally we’d like to have a fine-grained measure of software quality over time, but unfortunately we only have an average growth rate. This forces us to make strong assumptions to get estimates — for instance, we assume that software progress has proceeded at a constant rate. This is plausibly consistent with existing empirical evidence, but we think the evidence on this is quite weak. In fact, the rise of reasoning models might’ve marked an acceleration in algorithmic progress.

In general, there are a whole host of data-related issues that we need to contend with when estimating these returns, which might bias our estimates upward or downward.9

Flawed models

Even without data issues, we might still end up with misleading estimates. The problem is that we’re taking the Jones model way beyond what has been empirically observed.

The model was originally written for the whole economy, where the main input, researchers, and the main output, productivity, are growing at a few percent a year. But we’re trying to apply to a situation where the inputs are growing by orders of magnitude in a few years — i.e. the software intelligence explosion, where you get many more AI workers that become increasingly intelligent. So we shouldn’t be too surprised if the Jones law of motion is pretty badly misspecified for thinking about the software intelligence explosion.

One issue is that finding software improvements might require running experiments close to the size of frontier training runs. This can make a huge difference — for instance, in previous empirical work, such compute requirements could prevent a software intelligence explosion.10 But some previous work neglects this dynamic.

Another issue is that the Jones model doesn’t include a hard limit on how much research can be parallelized at a given time.11 So if you scale up the number of AI researchers really fast, the model predicts that software progress should go to infinity essentially immediately. But this seems implausible — you couldn’t get infinite software progress in five minutes even if you had infinitely many Alec Radfords doing AI research. You still need to run experiments and wait for them to finish, and some software innovations might need to happen in sequence. So if we naively apply the Jones model, we might vastly overstate how much software progress we see when AI R&D is automated.

Finally, when most people think of an intelligence explosion, they think of AIs getting smarter and smarter, but most models of the intelligence explosion consider making AI more and more numerous. It’s not clear that this is what people actually care about, as having infinite copies of GPT-2 would arguably have little impact on the world. The difficulty with modeling “intelligence” increases is that it’s not clear how intelligence translates into effective research input. Is being twice as smart as good as having twice as many researchers? We have fairly minimal data on this front, and adjusting for it has unclear implications. On the one hand, the returns to intelligence in R&D might be extremely high, effectively increasing r. On the other hand, if we changed the output of R&D to intelligence, this has plausibly grown much slower than effective compute, lowering r.

Part of the issue here is that we don’t know enough about the nature of algorithmic progress to know which model is right. This can lead us to over or underestimate the impact of things like scaling up R&D investments.

To make progress, we need experiments

Most of the approaches at trying to get a handle on the software intelligence explosion so far have been either through theoretical modelling, conceptual arguments, or doing statistics on messy real-world data. In part, this is because few people have been thinking about the software intelligence explosion in rigorous terms. And of the people that have, these approaches better fit their skillset, or are cheaper for them to implement.

But we think we’ve squeezed out much of the value from these approaches given existing data, and on the margin there’s a better approach — experiments. This helps us get much higher quality data that more directly tests the hypothesis and removes confounding factors. It also lets us test whether key assumptions will hold to help inform how we think about the software intelligence explosion, like in terms of models.

Here are some examples that we think are especially important:

Studying how much software progress is due to data. If we learn that most software improvement comes from increases in data quality, that could reduce our credences in the chances of a software intelligence explosion.12 However, existing approaches are based on high-level statistical analyses of actual papers, where we can’t control which algorithms or training data are used. So it’s hard to say which innovations are most important, or how important data is. In contrast, experiments would offer a much higher degree of control.

Understanding how much compute is a bottleneck to software progress. In the software intelligence explosion, we can get lots of software progress even if we don’t use a lot more compute. So we want to know how hard it is to find new algorithms that push the state of the art, even without using much compute to run experiments. However, there isn’t much real-world data that tells us how easy this is. So we want to run experiments to see whether this is possible — this might mean randomly allocating compute budgets to different researchers to see how that impacts the amount of software innovation they’re able to achieve. We might also want to perform scaling experiments to understand how well algorithms that work with small training runs to large ones — if this happens easily, compute might not be as big of a bottleneck as we think.

Importantly, just doing experiments isn’t going to be a silver bullet. We might still work with flawed models. Experiments might also be impractical — for example, if we want to test how well software innovations apply at frontier training scales, we need enough compute to do a frontier training run! That restricts the number of actors that can study this kind of question dramatically. So we’re skeptical that this will “solve” questions about the software intelligence explosion.

And there’s still some role for other approaches. For example, we might want to know which input most bottlenecks software progress — is it intelligence, compute, engineers, or something else? Besides experimental approaches, we could try and see if software progress mostly comes from large labs that are compute rich, or from the same set of researchers over and over.

But on the margin, we think experiments are the most promising avenue for improving our understanding about the software singularity. And if the software intelligence explosion is actually real, it might be especially important to gather evidence on these questions as soon as possible.

We’d like to thank Greg Burnham, JS Denain, Jaime Sevilla, Lynette Bye and Phil Trammell for feedback on this post.

Technical appendix

In the main post we elided over many of the technical details to make our results widely accessible. But if you’re one of the more technically-minded readers of our newsletter, you’ll also be interested in our modeling and data choices, so we specify these here. Note that you might want to read this on our website, since mathematical symbols will be rendered more nicely.

Estimating the returns to R&D using the Jones model

Let A denote some measure of software quality, which is improved via a Jones-style idea production function:

0","id":"JUVHPBOOUL"}" data-component-name="LatexBlockToDOM">

A^1-β represents ideas possibly getting harder to find over time, while F(L,K) denotes the effective research input given experimental compute K and cognitive labor L. Suppose research is automated in such a way that F(L,K) shifts to be proportional to the effective number of simulated AI researchers, which is given by Ac for some constant c.13

This model says software progress growth is hyperbolic if λ/β > 1. In the terminology of the post, r = λ/β is the “returns to AI research.” Therefore, we have a software-only intelligence explosion if the returns to AI software research are sufficiently high.

In the following section, we describe how we estimate the returns to AI research for frontier language models. The key to measuring this is getting some input measure of F(L,K).

Approach 1: The number of papers

One approach is to proxy for F(L,K) based on the number of papers that are produced.14 This is the same approach as used in previous work.

We do this in domains of AI for which we have estimates of the relevant rates of software progress — namely computer vision, reinforcement learning, and language modeling. We use these estimates to define the “software quality”,15 which we use as our output metric.16

We get data on the number of papers in different years from OpenAlex, which we identify using the following “concept” codes:

Computer vision: Computer vision (C31972630) and Deep learning (C108583219)
Reinforcement learning: Reinforcement learning (C97541855)
Language models: Natural language processing (C204321447) and Deep learning (C108583219)

The resulting estimates of the parameters are shown in the table below:

Estimates of key parameters in the Jones model, proxying R&D inputs with the number of papers in certain domains. We provide median estimates, and show 90% credible intervals in brackets. Code for this analysis can be found in this github repository.

We then estimate the returns to R&D using Bayesian inference.17 Like in computer vision and reinforcement learning, this yields central estimates of r that exceed 1, and in fact the central estimate is higher in language models. Moreover, the 90% credible interval exceeds one as well — pointing towards a software intelligence explosion! On the other hand, all of these estimates have very large uncertainties, so the confidence intervals across these different domains overlap quite a lot.

Approach 2: Some function of cognitive labor and experimental compute

In reality, we might think that papers are a poor proxy for R&D inputs. For example, AI R&D involves lots of people training models, but the “number of published papers” doesn’t directly model these people at all.

So we consider an alternative approach that treats the R&D input as some function of training compute and AI researcher labor at OpenAI from 2022-2025.

The next approach will try to directly measure F(L,K) instead of proxying it. Suppose that software progress is growing at a constant rate, which is consistent with the empirical evidence we have so far (see here, and here). That implies the Jones equation is at a steady state, i.e.,

In this equation, g_A, g_K, and g_L denote the growth rates of A, K, and L respectively. ϵ_K denotes the elasticity of F with respect to K — so if you increase K by 1%, F increases by ϵ_K percent. Re-arranging this equation results in the following:

Intuitively, this equation says that λ/β is equal to the ratio of research output divided by research inputs. We can now estimate the terms on the right-hand side by considering frontier AI labs after November 2022, when ChatGPT was first released. Note that we give growth rates in base e.

g_K ≈ 1.3 from data on OpenAI’s annual compute spend
g_L ≈ 0.85 from data on OpenAI’s staff counts over time
ϵ_K ≈ 0.67 based on the fact that if markets are competitive, the elasticity of output with respect to capital should equal the compute share, the share of research money spent on compute instead of researchers. We guess the compute share is about 0.59 to 0.75.18 As a central estimate, we take the average of 0.67.19
g_A ≈ 1.1 based on estimates of training compute efficiency improvements in language models.

The nice thing about this approach is that it doesn’t rely on the weak proxy of the “number of papers”. But it instead requires making an additional assumption, which is that software efficiency has grown at a constant rate.

Compute bottlenecks

One issue with the Jones model, pointed out in Whitfill and Wu 2025 and in Erdil and Barnett 2025, is that there may be compute bottlenecks.

Rigorously, if F(L,K) is a production function such that both inputs are necessary for sustained growth, then even if L is increasing due to more simulated AIs, if experimental compute K is not growing quickly, then F(L,K) becomes proportional to experimental compute, not proportional to the number of simulated AIs, which breaks the argument for an intelligence explosion no matter what value λ/β is.

One baseline case to consider is a Cobb-Douglas baseline:

Here, the lack of K doesn’t fully bottleneck progress as you can always still increase L to get more output, but there is some penalty for increasing only L instead of both L and K. Under this assumption, the key intelligence explosion parameter becomes (1 - ϵ_K) λ/β.

Given our estimate that ϵ_K ≈ 2/3, that means all our estimates of λ/β should be cut by a factor of three, which puts them all below 1.

Parallelizability in the Jones model

To address the issue of parallelizability in the Jones model, Phil Trammell proposes an alternative model where the impact of R&D inputs is mediated by the current level of software quality. This is shown in the equation below, where A is software quality and R is research input:

Here the CES function between A and R on the right-hand side prevents you from just being able to immediately increase A by a ton just by increasing R by a ton, because ρ < 0. But while this deals with the parallelizability issue, it unfortunately also makes things harder to estimate!

This is sometimes called the “software-only singularity”.

We describe the full model in the technical appendix.

Things may be more complicated in the short-run — this parameter primarily determines the long-run asymptotic dynamics of AI R&D, rather than the short-run.

If r < 1, this feedback loop fizzles out. If r = 1, we’re in the special case where growth in inputs results in proportional growth in outputs – but that also means that we don’t get a massive “intelligence explosion”.

This doesn’t mean that AIs only improve AIs when there’s a software intelligence explosion. This happens regardless — the question is whether these improvements happen fast enough to lead to an “explosion”, as opposed to “fizzling out”.

Technical details on data and the model are provided in the technical appendix.

One counterargument is that it’s possible that algorithmic innovations proposed outside of major AI companies tend to be much less useful, because it’s not clear that they’ll work at the compute scale of frontier AI models.

To get a sense of the numbers, we know that OpenAI has more than doubled their compute stock in the last year, and AI companies are almost doubling their staff headcounts each year. Relatedly, we also know that private sector companies own an increasing share of GPU clusters, and the academic share of notable machine learning models has decreased over time. So if we look at the aggregate of OpenAI and academia, we’d measure slower growth in both labor and compute, and thus overall R&D inputs.

Moreover, the input and output metrics are themselves often extremely uncertain. For instance, existing estimates of software progress are themselves based on observational estimates. Prior estimates of the returns to R&D are also unrealistic — for example, the exponent on research inputs in the Jones law of motion is greater than 1, so doubling researchers could more than double software efficiency! Part of the issue here is that these individual exponents cannot be estimated well due to data constraints, but the hope is that the overall returns can still be estimated reasonably well.

In particular, Whitfill and Wu 2025 estimate whether compute and labor are complements or substitutes. If you don’t account for the dynamic where frontier scale is necessary, you find that compute and labor are substitutes. But if you do account for this you find that they’re very strong complements!

Yet another issue is that there’s no explicit data in the model proxying for research inputs based on compute and labor. In practice, developing better capabilities also depends on data.

On the other hand, AI systems could improve data quality through things like verification and filtering, generating more diverse problems, or helping build high-quality RL environments. So even if most software improvements boil down to data quality, we could conceivably see a software intelligence explosion.

If A is inference compute efficiency, then this equation holds exactly. If A is training compute efficiency, then this equation only holds if the effective number of AI researchers (because they are getting smarter) is linear in training compute.

This data is taken from OpenAlex, based on papers grouped into both “Natural Language Processing” and “Deep learning”.

Software quality is tricky to define, but can be defined with reference to algorithms of previous years. For example, suppose we choose 2024 as our start year, where software quality is defined as 1. Compared to models in 2024, models in 2025 might need around three times less training compute to achieve the same performance. So we then say that algorithms in 2025 have three times higher software quality than algorithms in 2024. Note that this dubiously assumes that all software (algorithms and data) in a particular year have the same quality.

Strictly speaking, these works provide a single growth rate of the R&D output metric, rather than the output metric itself. Other domains like computer chess have much higher quality R&D output data, since we can look at chess engine ELO scores. This gives a time series rather than a single average trend.

This approach is detailed in Erdil, Besiroglu and Ho 2024. Based on the paper, we select relatively uninformative priors to avoid biasing results, where exponential growth is the typical outcome. We need to use this approach rather than things like maximum likelihood estimation because we only have a single trend over time, rather than an actual time series.

We take this estimate from OpenAI’s 2024 budget, which shows 700 million spent on salaries and 1 billion on research compute, giving a naïve capital share of 59%. If we do a more sophisticated analysis that adjusts for spend on research staff only, include equity and a broader set of R&D, then we get about 75%.

Note this term is not necessarily constant over time, so consider it a single snapshot.

Less than 70% of FrontierMath is within reach for today’s models

Greg Burnham — Fri, 17 Oct 2025 16:53:02 GMT

Originally posted on Epoch AI.

The best we have seen a model perform on a single run of FrontierMath is 29%.1 If you want to use a model to solve FrontierMath-style problems, that’s the number to consider.

But there’s another way to gauge state-of-the-art performance: how many FrontierMath problems have been solved by any model, on any run, even once? This tells us more about what is “within reach” for today’s models. It’s also more forward-looking: if today’s models can generate the right ideas to solve a problem at all, then that makes it more likely that tomorrow’s models will be able to solve the problem reliably.2 In other words, we can get a view of the future that’s a bit more concrete than just extrapolating accuracy trends.

To make matters more interesting, there’s some empirical evidence that if you run an LLM on a benchmark N times, the percentage of problems correctly solved at least once (known as pass@N) increases proportionally to log(N). If that’s true in general, then, since log(N) is unbounded, we should expect pass@N to approach 100% as the models are given more tries. Could FrontierMath’s saturation already be so clearly within sight?

To investigate this, I conducted 32 runs of GPT-5 (medium).3 4 In short: FrontierMath isn’t dead yet. Pass@N on these runs show sub-logarithmic growth and appears to cap out below 50%.

That’s just one model, though: we have a bunch of other model runs sitting around that seem just as relevant for understanding what’s likely to be reliably solvable soon. Throwing them all together gives a pass@the-kitchen-sink of 57% — that is, how many problems were solved by any model on any run. Moving into guesswork territory, we estimate that even repeatedly running all of these models would cap out below 70%.

Going forward, it will be interesting to see how much progress on FrontierMath accuracy comes from shoring up reliability on this 57% of problems vs. solving problems models haven’t solved before.

GPT-5’s pass@N caps out below 50%

The chart below shows pass@N across our 32 runs of GPT-5 on our scaffold. Note how the pass@N curve is concave down compared to the straight line of a logarithmic fit.

Below is the same data in a table. Each doubling of N increases pass@N, but by a smaller amount than the previous doubling. At the extremes, doubling N from one to two increases pass@N by 5.4%, but doubling N from sixteen to thirty-two increases pass@N by only 1.5%.

How much more would pass@N increase with additional doublings? On average, the gains from doubling decreased by about 1% with each doubling. If we extrapolate that, we would see a gain of 0.5% as N goes from thirty-two to sixty-four, and that would be it. Maybe there’s a bit of a longer tail, but as a rough round-number guess, it seems likely that pass@N here would cap out below 50%.5

To sanity check this, I randomly drew ten problems that were not solved in any of these runs, and sampled them each 100 more times, for 132 total i.e. about two more doublings. None of these problems were solved even once. This is consistent with the observed data: we wouldn’t expect a draw of ten problems to hit one of the few problems that may be within reach of two more doublings. But this at least rules out the possibility that there’s a gold mine of problems that GPT-5 can get right if only it’s given 100 more chances.6 7

Pass@the-kitchen-sink likely caps out below 70%

What about models other than GPT-5, or scaffolds other than Epoch’s? It’s a bit messy, but we can at least look at the assortment of FrontierMath runs Epoch has accumulated over time. We call this pass@the-kitchen-sink, and at the moment our kitchen sink is filled from the following six buckets:

32 runs of GPT-5, described above
52 runs of various models from various developers from Epoch’s benchmarking hub
16 runs of ChatGPT Agent conducted by OpenAI and graded by Epoch
1 run of a Gemini 2.5 Deep Think, evaluated manually
20 runs of o4-mini conducted a few months ago as part of an open-ended exploration
6 miscellaneous runs conducted while experimenting with o4-mini, Gemini 2.5 Pro, and Grok 4

All together this gives a pass@the-kitchen-sink of 57%, or 165/290 problems solved.

Of these problems, 140 are solved in at least two buckets. This suggests that models do not have very different skill profiles, even ones from different developers: if one model can solve it, another probably can too. The table below shows how many problems are solved uniquely within a given bucket, as well as the bucket’s overall pass@N rate. Note in particular that our 32 runs of GPT-5 only solved a single problem that we hadn’t seen solved in some other run before!

The one notable deviation from this is ChatGPT Agent, which solved 14 problems (5%) that no other model solved. There’s a salient explanation for this: it’s the only model here with access to a web search tool. While FrontierMath doesn’t contain pure “look-up” problems, it does have some problems that are meant to be solved by appropriately adapting somewhat obscure knowledge, so we expect web search to help.8

Since we have 16 runs of ChatGPT Agent, we can look at how Pass@N scales as above. Doing this, we see a similar sub-logarithmic pattern: the returns to doubling N decrease by about 1% (absolute) with each doubling. Simple extrapolation suggests there might be another 6% to gain from additional doublings.

Where would pass@the-kitchen-sink cap out?

We have clean estimates for GPT-5 and ChatGPT Agent. Conservatively assuming the gains from additional runs of these would be disjoint, that would add 7% in the limit, for a running tally of 64% total.

We don’t have as good a way to predict what all the other models would add, if re-run repeatedly. We can at least note that, across all our existing runs for them, they solve 46% of problems, but only 13 problems (5%) that neither GPT-5 nor ChatGPT Agent solved. This suggests fairly low marginal diversity, though we can’t rule out that one of these models would break into new territory if re-run repeatedly.

For the sake of a bottom-line number, I’ll assume that this pool has as much juice left to squeeze as ChatGPT Agent, the larger of our previous two estimates. Again conservatively assuming that these gains would be disjoint, this gives an all-in cap of 70%.

In the original release of FrontierMath, we estimated that 7% of problems were erroneous, either due to fatal ambiguities in the problem statement or issues in the answer key or verification script. We collect errors as we find them and will update the benchmark periodically, so this number will decrease over time. Currently we are only aware of uncorrected errors in 2% of problems. Conservatively taking the error rate to be 10%, we are thus left with 20% of FrontierMath problems being likely out of reach for current models.

Returning to the 57% of solved problems, we observe the expected pattern across the FrontierMath difficulty tiers: lower tiers have higher solve rates.

This gives us something to watch as models improve on FrontierMath

Epoch recently published a report on how we expect AI to look in 2030, which included the following projection of FrontierMath performance.

While a performance of 70% seems high compared to today’s SOTA, our projections show it arriving in the first half of next year. The investigations in this post make this seem believable: we have already observed some model solving almost that many problems. If additional training can simply shore up the performance observed in our pass@the-kitchen-sink, it will already be most of the way there.

By the same token, this gives us something more precise to look at if and when a model does hit this performance level. In particular, we can see how much of the gain comes from problems that we have already observed being solved at least once, vs. problems that haven’t been solved at all before.

In the extremes, the interpretation of progress is very different. If gains come entirely from previously solved problems, then this represents purely improved reliability. If, however, much of the gain comes from problems that no model has solved before, then we can interpret it as a meaningful advance in capability.

Thanks to OpenAI for API credits used for the GPT-5 experiments. Thanks to Daniel Litt for feedback on this post.

In this post, by FrontierMath, I always mean the private problems in Tiers 1–3.

This reasoning is only valid for benchmarks where random guessing scores ≈0%, but FrontierMath is such a benchmark: answers are typically large integers or complicated symbolic real numbers, so we can sample repeatedly without worrying that models are getting correct answers just by luck.

These runs were conducted with a 10x token budget compared to our usual scaffold. In the run currently shown on our benchmarking hub, GPT-5 scored 25% but exceeded our scaffold’s 100K token budget on 3% of problems. I tried giving it a 10x token budget, and it scored 29% while never exceeding the budget. For our experiments here, I used our standard scaffold with 10x the standard token budget. We’ll be increasing the general token budget in our scaffold going forward.

I intended to conduct these runs with GPT-5 (high), but due to a bug in our evaluations infrastructure, they were in fact conducted using GPT-5 (medium). On our standard scaffold, GPT-5 (high)’s now-correct score on FrontierMath Tiers 1-3 is 27%, compared to 25% for GPT-5 (medium). I expect the overall conclusions of this article would not be changed if GPT-5 (high) was used instead. In particular, the fact (discussed later) that all 32 runs of GPT-5 (medium) only solved one problem that no other model run has solved suggests that there just isn’t that much juice for GPT-5 to squeeze. Still, it is entirely possible that GPT-5 (high)’s asymptote is a few percentage points higher than GPT-5 (medium)’s.

That said, we can’t rule out that this is simply a very slow-growing function that is nonetheless unbounded, e.g. c*log(N) for small c, or log(log(N)).

Of course, these experiments in general lack the power to tell the difference between p(solve)=0 and p(solve)=small. If we crudely model a problem as requiring k steps, and that GPT-5 has a probability of p of performing each step correctly, then p(solve)=p^k. For a problem with k=10 and p=½, GPT has a ≈50% chance of solving it in 709 samples. I would still want to count this problem as “within reach”.

These problems were also chosen from among those that no other model run solved. These runs are discussed further on in the post.

We plan to experiment with adding this.

OpenAI is projecting unprecedented revenue growth

Greg Burnham — Wed, 15 Oct 2025 16:05:54 GMT

Epoch’s new AI companies database shows the remarkable level and pace of growth of OpenAI’s revenue. It first exceeded $1 billion in 2023 and will exceed $10 billion in 2025. This is impressive, but not unprecedented — a few other companies have matched this growth rate historically.

OpenAI’s projections, however, are a different story. According to The Information, in Q3 2025 OpenAI projected its 2028 revenue to be $100 billion. I couldn’t find any examples of a company growing its revenue from around $10 billion to $100 billion in such a short period of time.

What happens if OpenAI falls short of these projections? At a minimum, it would likely have to scale back its plans for large compute build-outs. The recently-announced deals with Nvidia, AMD, and Broadcom imply expenditures of roughly $1.3 trillion within the next decade, and some of this is presumably expected to be financed by revenue or debt raised against revenue.1

But the second-order effects of a miss could be larger. This is because investors and other companies are increasingly betting big on OpenAI being highly valuable. As Noah Smith recently wrote, these bets depend not only on this value being realized, but on it being realized fast enough to cover the debt used to finance the bets. Failing to deliver value as fast as investors expected is all it took to turn several historical technology booms into busts.

We don’t know how sensitive the AI financing system is to OpenAI’s specific projections. It’s possible that there is a big margin of error and the tenor of the deals is, “Even if we miss our revenue projections, we all still get rich.” But it seems just as possible that more is at stake.

One thing we can say: OpenAI hitting its projections would be a historic achievement.

OpenAI’s revenue grew very quickly from $1B to $10B

I found four instances of US companies in the past fifty years growing their revenue from less than $1 billion to over $10 billion over the course of three years.2 3 It’s a somewhat eclectic group.

The one company to clearly cross these thresholds faster than OpenAI was the vaccine maker Moderna, whose revenue spiked in 2021 due to the pandemic. Another demand-driven entry is Cheniere Energy (2015–2018), which rode the US natural gas boom. Uber (2014–2017) and Google (2002-2005) are familiar tech startup success stories.

Being a pure software company, Google may be the most apt comparison to OpenAI. Of these four, only Google went on to hit $100 billion revenue. The others have yet to pass $40 billion.

OpenAI’s recent growth does look fast compared to the others. It is currently growing around 3× per year, compared to Google’s 2× per year in the year when it first hit $10 billion. So, apart from Moderna, OpenAI may be the fastest-growing company to hit $10 billion. Can it keep up the pace?

No company has grown its revenue from $10 billion to $100 billion in three years

OpenAI expects to make $13 billion in revenue in 2025 and projects to grow its revenue by 2.3× in 2026, 2× in 2027, and 1.6× in 2028. I couldn’t find any historical examples of companies growing at such a fast pace from such a high level. The following chart shows the seven US companies I could find in the past fifty years that grew their revenue from $10B to $100B in less than a decade.4

The fastest companies to cross these thresholds were Tesla and Meta in seven years, followed by Nvidia, Apple, and Amazon in eight, Walmart in nine, and Google in ten. The median annual revenue growth rate of these companies during this high-growth period is 1.3×. From this perspective, OpenAI is targeting an unprecedented performance.

However, the chart also shows the arbitrariness of the thresholds. Yes it took Nvidia eight years to go from $10 billion to $100 billion, but almost all of that growth came in just two banner years toward the end of the period, both a bit over 2× per year, starting from a base of about $28 billion. This is the most optimistic reference point for OpenAI.

I also looked for any company with revenue over $10 billion (in 2024 USD) that more than doubled its revenue in a single year. As just mentioned, Nvidia did this twice. I only found two other examples. Gilead Sciences went from $11 billion in 2013 to $25 billion in 2014, driven by the success of a new hepatitis C drug. Cheniere Energy grew its revenue from $16 billion in 2021 to $33 billion in 2022, riding another natural gas price spike. Both companies’ revenues plateaued and declined over the subsequent two years.

I focused on recent US companies to isolate factors like market access, political economy, and the availability of capital. Looking beyond the US, the only stand-out example of rapid revenue growth that I could find was ByteDance, the parent company of TikTok. In 2024 USD, its revenue first crossed $10B in 2019 and first crossed $100B in 2024, five years later.

Can OpenAI do it?

The outside view says: probably not. Companies usually don’t do this. Does a closer look reveal anything different?

The strongest argument in OpenAI’s favor is probably, as mentioned above, that its revenue is growing very quickly right now: 3× per year. All data points since 2024, even recent ones, are consistent with this trend. Given that, its growth could slow somewhat while still hitting the target.

There also appears to be a trend toward companies hitting $100 billion in revenue faster. Of our seven examples above, six hit $100 billion in the last fifteen years. If there is some general dynamic by which this happens faster over time, it could be a tailwind for OpenAI.5

Where might this revenue come from? It probably won’t be ChatGPT subscriptions alone, which currently account for about 75% of OpenAI’s revenue. OpenAI projects only $50 billion of its 2028 revenue to come from ChatGPT directly. As a reference point, that is the cost equivalent of 210 million of today’s “Plus” subscriptions. The most recent figures, from April 2025, show 20 million paid subscriptions of all kinds.

OpenAI also seems likely to make a play for the advertising and shopping revenue that currently flows through Google, Amazon, and Meta. Their main tool here will be the much larger user base of ChatGPT’s free tier. Winning market share in this relatively mature sector is at least plausible.6

Or maybe that’s thinking small. The promise of AI, after all, is broad productivity. For a very rough estimate, let’s imagine that:

OpenAI’s current and near-term products could provide an average 10% productivity uplift for all remote-compatible work tasks, and
OpenAI can capture 10% of the resulting value.7

We previously estimated that about a third of US work tasks are remote-compatible, and that is probably lower globally. So, assuming these tasks account for around 20% of the roughly $110 trillion of global GDP, the total potential revenue from this near-term productivity uplift might be around $220 billion.8

There will be competition for this revenue, but perhaps OpenAI can maintain its current relatively high market share. Maybe AI itself can lower the hurdles to adoption, helping people quickly figure out how to use it to be more productive.

Still, reaching such heights in just three years would be an extraordinary achievement. History suggests that it doesn’t happen quite so fast.

What if not?

Maybe if OpenAI “merely” gets to $50 billion in 2028—an impressive feat in its own right—it will only have had to slow its data center build-outs, and will otherwise remain financially healthy. But the more stakeholders take big bets on OpenAI, the riskier such misses become. We’ll be paying close attention to OpenAI’s next few revenue prints.

Thanks to Steven Kryger and Shanu Mathew for feedback on this post.

The deals total 26 GW of chips, and a 1 GW data center costs roughly $50 billion to build, all-in. Precise timeframes of the deals are not available, but all plans discussed publicly cite dates within the next decade.

I don’t have access to a comprehensive database of company revenue. My methodology for this post relied in large part on queries to GPT-5 Pro, Grok 4 Heavy, and Gemini 2.5 Deep Think, combined with manual spot checks. If you know of examples of companies achieving faster revenue growth off larger bases than what I discuss, please tell me! greg@epoch.ai or @GregHBurnham.

All analysis in this post excludes cases of revenue growth driven primarily by mergers and acquisitions.

The first year shown for each company is either the first year when the company’s revenue exceeded $10 billion, or the year prior to that, if the company’s revenue in this prior year was closer to $10 billion than the company’s revenue in the subsequent year. The range of revenues in this “Year 0” is $9-13 billion, with OpenAI’s toward the upper end of the range. Shifting the rest of the cohort one year earlier doesn’t change the main observation.

Perhaps the better way to think about the $100 billion milestone is as a percent of GDP: in that case, as GDP grows, the milestone gets easier to achieve.

Of course, if the tech giants face fierce competition in their core businesses, they may have to scale back their own AI investments. That could cause its own set of problems, depending on what other companies are levered to those anticipated expenditures.

These numbers are speculative. Uplift estimates vary widely and include negative effects. One estimate of the consumer surplus of AI puts it at about 10× revenue.

The eventual total addressable market for AI is much larger. The question here is rather how much can be captured in the next three years, given current AI capabilities and diffusion.

How many digital workers could OpenAI deploy?

JSD — Sat, 04 Oct 2025 17:50:03 GMT

The core argument for how AI could drive explosive economic growth is that you can dramatically scale up the number of AI “digital workers”. The idea is that growth is constrained by labor, so rapidly expanding the workforce would hugely accelerate growth rates.

This is where AI comes in. While you can’t double the human population each year, you can double the number of AI chips – as we’ve seen with NVIDIA and OpenAI.1 So if AI can fully substitute for human workers, the workforce could grow many times faster than today. As a result, the economy could grow many times faster too.

If this framing is right, then to know how far we are from explosive growth, we need to answer three questions. First, how many AI “digital workers” can be deployed today? Second, how far is AI from fully substituting for human workers? And third, how are both of these changing over time?

In this post, we’ll take a stab at the first question: On the tasks that AIs are able to perform today, how many “human-equivalent digital workers” could frontier AI labs deploy to work on them?

Based on a speculative back-of-the-envelope calculation, we estimate that companies like OpenAI have the hardware to deploy on the order of 7 million digital workers, with a wide 90% confidence interval of 400,000 to around 300 million.2 This doesn’t mean that OpenAI could do the jobs of 7 million human employees today, because AIs can’t fully substitute for humans. But as AI progress continues, AIs will be able to perform an increasing fraction of the tasks that humans currently do.

Estimating the number of digital workers that frontier labs can deploy

But where does this 7 million number come from? There’s no “AI employee count” webpage from the Bureau of Labor Statistics. So what we need to do is somehow figure out how much “work” is being done by running AIs at frontier labs.

Our approach is to compare the number of tokens that AIs process daily to the number of effective tokens that a human worker “processes” over a day when they read, think, and write. Then when you take the ratio of these two, you get the number of digital workers on the tasks that AIs can currently solve.

The first step is thus to figure out how many tokens frontier labs produce each day, and for concreteness let’s consider OpenAI’s GPT-5.

One approach is to look at OpenAI’s inference compute budget. Based on reporting from The Information, we think they currently have around 480,000 H100 equivalents for running models,3 which is enough for around 10²⁵ FLOP each day.4 It probably takes on the order of 100 to 600 billion FLOP for GPT-5 to generate each token,5 so that works out to around 10 to 100 trillion tokens per day.

For another approach, we also look directly at usage statistics: ChatGPT likely sends around 4 billion messages each day.6 If API costs contribute a quarter of the tokens of ChatGPT, and the mean message length is four thousand tokens long, then that implies GPT-5 generates around 20 trillion tokens per day.

For our final calculations we simply take a mixture of the predictions of these two approaches (with confidence intervals to capture uncertainty), with a median of around 19 trillion tokens a day. This is comparable to the 35 trillion tokens per day generated by Google’s AI models, so we think this is probably in the right ballpark.7

Now that we know the number of tokens, the next step is to convert these token numbers to “human-equivalent digital workers”. Essentially, how many effective tokens does a human employee “process” in a day?

We could look at the number of words spoken or the number of words written. But this would likely be an underestimate, since humans do a lot more thinking than they do speaking or writing. So we instead anchor to human thinking speeds of around 380 words per minute. If we convert this to tokens, and assume that each human works for 8 hours a day, that implies that each human worker processes around 240 thousand tokens a day.

Given this number, it’s really tempting to say “GPT-5 outputs 19 trillion tokens a day and each human processes around 240 thousand tokens, so there are 19 trillion / 240 thousand = 80 million digital workers”. But this assumes that each token is worth the same as each GPT-5 token, which might not be true.

So we can consider an alternative approach, that relies on a study by METR – this looks at whether AIs can perform a (software-related) task that takes humans a specified amount of time to perform. For example, one task might involve debugging a particular program, which typically takes human coders an hour to complete. The idea is then to look at how many tokens GPT-5 needs to do the same task and multiply it by the number of hours in a work day, representing the tokens needed to match a human over a day. In the case of the METR tasks, GPT-5 typically uses 100 thousand to 1 million tokens, so over an eight hour workday that’s 800 thousand to 8 million tokens. And if we divide 19 trillion by this, we end up with 2.4 million to 24 million digital workers.

Combining these two approaches suggests a median estimate of 7 million digital workers, but with very wide confidence intervals.8

What do these numbers tell us?

7 million digital workers is a lot – for comparison, the world’s largest employer has 2.1 million employees. This estimate is quite different compared to prior approaches. For instance, some works estimate the number of digital workers by dividing a compute stock by the computational power of a human brain.9 Given that OpenAI has enough compute to do 10²⁵ FLOP per day, and the human brain needs 10¹⁵ FLOP/s to run, then that amounts to around 100,000 workers – around 500× smaller than our estimate!10

Of course, as we alluded to earlier, it’d be wrong to conclude that AIs can now do the jobs of 7 million people, because current AI systems aren’t able to do all the tasks that humans can.11 Humans and AIs have different skill profiles, and currently humans are able to do many tasks that AIs can’t do regardless of how much inference compute they use. But as AI progress continues, AIs will substitute humans for a wider and wider range of tasks, and our calculation approach becomes more directly meaningful - at least as a lower bound on AI’s economic impact, since AIs will also be capable of tasks that no human could perform.12

In particular, suppose AIs became good enough to automate a decent chunk of the tasks in the world economy, including the jobs of software engineers and remote workers. Even if we were stuck with the same number of digital workers, the impacts could be pretty notable. On the tasks that AIs can perform, they can typically perform them much cheaper than humans. For example, suppose a software engineer takes an hour to debug some code – then they might be paid $50 to $100. But if GPT-5 is able to do the same task, it would instead cost $1 to $10.13 Such a substantial cost difference would provide a strong incentive to use the AIs instead of humans on the automated tasks, potentially impacting the jobs of tens of millions of workers.

Moreover, if millions of digital workers can be deployed in parallel, learn from their real-world interactions and share the resulting knowledge, this would allow AI systems to accumulate knowledge much more quickly than humans. So there’s the potential for AIs to learn missing skills very quickly.

On the other hand, one could argue that these 7 million workers are still peanuts compared to the overall economy – the world’s working population is still two orders of magnitude larger, so this wouldn’t be enough to drive explosive growth. With current compute stocks and automation costs, this is true even if we had AI that could fully substitute for humans.

But some have also raised the possibility of a more extreme scenario – if these digital workers were primarily concentrated into AI R&D, that could substantially accelerate AI progress, further increasing the number of digital AI researchers, and so on. While 7 million workers is small on a global scale, it’s likely still substantially higher than the world’s current population of AI researchers.14

So the implications of our estimate could change pretty substantially depending on several factors. We need to know how quickly AI automates more tasks. We need to know how much more compute there’ll be to run digital workers over the next few years, and whether researchers can use this compute more efficiently at inference. And we also need to know how likely we are to see dynamics like explosive economic growth and a software intelligence explosion. These are questions that we’ll investigate in forthcoming issues.

We would like to thank Lynette Bye for her feedback on this post.

For example, OpenAI’s compute spending grew from $6 billion mid-2024 to $13 billion mid-2025 – but this growth rate might also slow down.

You can find the full code for our analysis here.

Specifically, based on reporting from The Information we estimate that they currently have around 1.1 million H100 equivalents in total compute stock, and we place a 90% credible interval on this from around 800 thousand to 1.4 million H100 equivalents. Around 44% of this goes to inference compute, as opposed to training or experiments.

This depends on several assumptions. First, we assume that inference runs on FP4 on hardware that supports it (e.g. GB200s) and on FP8 otherwise (e.g. H100s). Second, we assume an inference utilization of around 5% to 30%, in part based on our estimates of Llama 3.3’s and DeepSeek-R1’s utilizations. The full calculation can be found in the code.

We estimate that GPT-5 has between 50 billion and 300 billion active parameters, so at 2 FLOP per active parameter, we get 100 billion to 600 billion FLOP per token.

In June 2025, ChatGPT daily message counts were 2.627 billion, compared to 451 million one year earlier. This corresponds to a growth rate of 2627 / 451 = 5.8× per year, such that we expect around 4 billion daily messages by September 2025.

In July 2025, Google’s AI models reportedly processed 980 trillion tokens per month, or around 35 trillion tokens per day.

We combine these two approaches in part because we’re uncertain about how many “human tokens” equals one GPT-5 token. This can vary a lot depending on the task – for example, GPT-5 does translation with far fewer tokens than a human would need based on thinking time. So we don’t want to anchor too much on a 1:1 ratio or tasks like those from the METR study.

This is true at a coarse level, where some inference compute budget is divided by some runtime compute cost. But in practice things can be more complicated, e.g. in the GATE model the runtime compute requirements can change substantially depending on the task.

Of course, both our approach and this compute-based approach are highly uncertain – e.g. because it’s hard to pin down just how many FLOP/s the brain is doing.

We’ve also glossed over details like how fast AI systems are run.

One dimension of these capability improvements is on long-context tasks, which could impact our estimate of the number of digital workers. For example, long context decoding (but not prefill) is typically bottlenecked by memory bandwidth. In our analysis, this would translate into lower compute utilizations, but a more sophisticated approach could also directly model labs’ total memory bandwidth stock, as AI Futures attempted in the AI 2027 report. This may be the case even with improvements in inference efficiency.

This is based on the METR task example we gave earlier. If an GPT-5 takes 100,000 to 1 million tokens to perform a task that takes humans an hour, then at $10 per million output tokens this would add up to $1 to $10.

This is especially true if we restrict ourselves to research that occurs on frontier models. For example, if we sum the number of employees at frontier AI labs, we likely get on the order of 10,000 employees. This is already multiple orders of magnitude lower than our estimate, not to mention that the actual number is likely lower, since not all employees are researchers.

Why GPT-5 used less training compute than GPT-4.5 (but GPT-6 probably won’t)

Yafah Edelman — Fri, 26 Sep 2025 20:31:01 GMT

Out of all the GPT models, GPT-5 is the odd one out. Unlike all previous versions of GPT, it was likely trained on less compute than its immediate predecessor, GPT-4.5.1

While the exact numbers are uncertain, GPT-4.5 very likely used more training compute than GPT-5.

But this leads to a puzzle: Models trained with more compute tend to be better, so why did OpenAI train GPT-5 with less compute than GPT-4.5? And what will this mean for future OpenAI models?

In this post, we’ll argue that the answers to these questions are the following:

GPT-5 used less training compute than GPT-4.5 because OpenAI focused on scaling post-training. New post-training techniques made it possible to outperform GPT-4.5 with less training compute, but these methods likely weren’t yet mature enough to be applied at GPT-4.5’s compute scale. Doing so would’ve taken more time (and compute), which OpenAI likely chose not to do due to strong market pressures.
OpenAI’s next flagship model (“GPT-6”) will probably be trained on more compute than GPT-4.5: When OpenAI figures out how to productively scale post-training, they’ll likely shift back using more training compute.

Importantly, when we say “training compute”, we’re focusing on the compute to perform the final training run of a model. It’s likely that the total compute for developing GPT-5 was higher than for GPT-4.5, if we also account for the compute for running experiments. This is because OpenAI’s (projected) R&D compute spend has grown from ~$5 billion in 2024 to ~$9 billion in 2025.2

Now let’s consider the arguments in turn.

GPT-5 used less training compute than GPT-4.5 because OpenAI focused on scaling post-training

Why did GPT-5 use less training compute than GPT-4.5? We believe this is a combination of two factors. First, OpenAI decided to prioritize scaling post-training, which had better returns on the margin. Second, they couldn’t readily scale post-training compute to GPT-4.5 levels at the time. And if they tried to scale post-training on a model with as much pre-training as GPT-4.5, they would’ve run into timing and experimental compute constraints.

Until recently, most LLMs were trained with 100× more pre-training than post-training compute. However, around September 2024, researchers developed novel techniques used in “reasoning models” that help scale post-training compute effectively. Researchers could now triple post-training compute in a way that was at least as useful as tripling pre-training compute. In fact, these reasoning techniques make it possible to reduce pre-training compute by roughly 10× while getting the same performance!3

This means that, rather than spending around $200 million on pre-training and $2 million on post-training GPT-4.5,4 new post-training techniques made it possible for that $2 million in post-training to achieve the same overall performance with only $20 million in pre-training. That’s roughly a ten-fold decrease in training costs, though this doesn’t imply that total model development costs were lower, due to increases in the compute needed to run experiments. The upshot is that OpenAI was likely able to train a model with less compute than GPT-4.5, while still outperforming it on many useful tasks like coding and search.

However, while this shows that OpenAI could’ve outperformed GPT-4.5 with less training compute, it doesn’t fully explain why they chose this strategy in practice. For example, why not just post-train GPT-4.5? And why not post-train a smaller model on enough data to reach GPT-4.5’s level of training compute?

The core reason is that scaling post-training in this way is challenging. It requires lots of testing and experimentation, which takes time and compute, especially when performed on larger, newer models.5 It also requires a significant amount of high-quality post-training data, which takes time to design and collect.

Crucially, OpenAI faced major time constraints due to market pressures. This came in the form of fierce competition from rival AI labs, which would hurt their revenue – e.g. Anthropic’s models had been consistently outperforming OpenAI’s models at coding. And there was added pressure because many had expected OpenAI to release a model called “GPT-5” as early as November 2023.6

Given these constraints, we believe that OpenAI scaled post-training on a smaller model as much as they could. Scaling further would’ve either required more experiments than they had the compute or time for, or post-training data that they didn’t have. Post-training a GPT-4.5-sized model, let alone starting a larger multi-month pre-training run and doing post-training on top, would’ve taken too much time or too much experiment compute.7

The result of these efforts in scaling post-training was GPT-5, a new state-of-the-art model that OpenAI was able to release by August.

GPT-6 will probably be trained on more compute than GPT-4.5

What does this mean for training compute trends moving forward? Our best guess is that future iterations of GPT will be trained on more compute.

To see why, consider the bigger picture. Training GPT-5 with less compute than GPT-4.5 is part of a broader trend, where the training compute of state-of-the-art models has grown more slowly than one might’ve expected a year ago.8 Since post-training was just a small portion of training compute and scaling it yielded huge returns, AI labs focused their limited training compute on scaling it rather than pre-training.9

In fact, these reasoning post-training techniques have scaled much faster than pre-training compute.10 At this rate, tripling post-training compute will soon be akin to tripling the entire compute budget – so current growth rates likely can’t be sustained for much more than a year.

That means that this broader trend is likely to end – we may see a reversion to the original trend of training compute growth. If this is right, GPT-6 is likely to need much more training compute than GPT-5, and probably more than GPT-4.5. Not to mention, OpenAI plans to significantly expand their compute stock, with many more GPUs brought online by the end of the year,11 and major clusters like Stargate Abilene coming out in phases.

On the other hand, it’s possible that the bottlenecks to scaling training compute are harder than we anticipate, e.g. due to limits in the availability of high-quality pre-training data and post-training RL environments. It’s also just hard to pin down how to measure “training compute”. For instance, it’s plausible that a larger model could be used to generate synthetic data for a smaller one – so should the total “training compute” for the smaller model include the compute for training the larger model?

Overall, we think we’ll likely revert back to the trend of training compute growth – though perhaps not for long.12 Moreover, while training compute scaling has slowed, infrastructure buildout has continued. This means more compute availability relative to the size of training runs – so training compute scaling could come back with a vengeance.

We’d like to thank Josh You, Lynette Bye, and David Owen for their feedback.

OpenAI has historically scaled up training compute by around 100× with each (integer) generation of GPT, and GPT-5 is an exception to this trend.

There are several things to note about these numbers. First, R&D compute accounts for both running experiments and the final training run. Second, these are estimates based on The Information’s projections of OpenAI’s spending, and are thus quite uncertain. Third, the 2024 R&D compute cost is reported as two separate numbers: $3 billion in the “compute to train models”, and $1 billion in “research compute amortization”. It’s not entirely clear what this means, but GPUs are typically amortized over several years. For simplicity we assume amortization over two years, leading to $2 billion + $3 billion = $5 billion in total R&D compute, but our point stands even if we assume amortization over four years.

The exact number here should be taken as a rough estimate to illustrate the magnitude of the savings, in part because they were estimated using limited data, and in part because the effect size depends on the specific benchmark.

For illustration, the cost of training Grok 4 was around $500 million, and we expect GPT-4.5 to be in a similar ballpark (but slightly lower).

One reason this is harder for larger models is that more time and compute is needed to run the model in post-training environments, a crucial part of post-training.

GPT-4.5 was likely also an attempt to develop a model called “GPT-5”.

One more reason that OpenAI likely chose to focus on scaling post-training without increasing pre-training compute for GPT-5 is to reduce inference costs. But it’s not clear how much of a role this played – for example, if OpenAI were worried about inference speeds or costs, they could’ve distilled a larger model into a smaller one, making it both faster and cheaper to run.

For instance, in the 2.5 years since GPT-4 was released, most state-of-the-art models used less than 3× more training compute than it, such as Claude 3.7 Sonnet and o3. This seems slow given other trends – the training compute for the most compute-intensive models has been expanding at 5× year. The reason for this discrepancy is that models trained with the most compute aren’t necessarily state-of-the-art in performance. For example, while models like Llama 4 Behemoth exceeded this 3× threshold, they didn’t clearly outperform the best models from competing labs on benchmarks.

There are also other practical factors that we’ve alluded to, which also contribute to this trend. For instance, it’s also easier to scale novel reasoning techniques on smaller models, such that labs have temporarily focused on releasing relatively small models compared to further pre-training scaling.

One estimate is that this post-training compute can be scaled up 10× every 4 months. This probably doesn’t go faster still because of scaling bottlenecks, such as in how quickly the RL environments can be built for RL post-training.

While Sam Altman tweeted that OpenAI would have “well over 1 million GPUs brought online” by the end of the year, this is unlikely to be purely top-end GPUs. For example, it plausibly includes a mix of A100s, H100s, and Blackwell GPUs.

For example, this could happen through further pre-training scaling or RL post-training scaling.