Benchmark Scores = General Capability + Claudiness

Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

Nov 20, 2025

The Gemini 3 release included a massive table showing how the model was state-of-the-art on nineteen diverse benchmarks. Such tables are commonplace by now, but they add up to an odd statistical situation. Benchmarks ostensibly measure different things, but since models tend to improve on many benchmarks at once, the dataset of benchmark scores is dominated by a single “General Capability” dimension.

In this post, I’ll describe the statistics of this dataset, look into what’s left when you factor out this dominant dimension (hint: it’s “Claudiness”), and discuss how this relates to an important key question about cross-task generalization.

Benchmarking data is dominated by a single underlying dimension

This is one of the lessons of our recent work on the Epoch Capabilities Index (ECI), which combines thirty-nine benchmarks into a single capabilities score. If benchmarks were generally uncorrelated with each other, you’d expect to see large residuals: the benchmark scores predicted by a model’s ECI number wouldn’t match the model’s actual benchmark scores. As it turns out, we see a very good match. In other words, our nominally high-dimensional dataset is well-approximated by just a single dimension.

To look beyond this dimension, we can do a Principal Component Analysis (PCA). This basically asks: if we make synthetic “components” by taking weighted sums of the different benchmark scores, what’s the most variance in the dataset we can account for with the fewest number of components?

When we do this on the raw data underlying ECI, the first component captures about half the variance of the dataset.12 The table below shows the weights on the different benchmarks in this component, accounting for 80% of the total weight. Note that the weights are all positive and not very dispersed. That is, PCA also finds a single “general capability” component.

Moving beyond the first principal component, the chart below shows the magnitudes of all the principal components, plotted against the size of components found in randomly generated data of the same shape (i.e., a parallel analysis). We see the single large component mentioned above, a second component that is borderline significant, and the rest having small sizes, consistent with noise.

Benchmarking data shows a smaller “Claudiness” dimension

What is this second component? Here are the top weights, by absolute value, that this component assigns to the benchmarks, again accounting for over 80% of the total weight. By construction, this component is orthogonal to the main “general capability” component. When I first saw this, I said it looked something like, “good at agentic tasks, but bad at vision… and also bad at math?”

But I showed it to a colleague and he just said, “it’s Claude”. He was right. Here are the top five models on this component, as well as the bottom five.

I think this second component shows that benchmarks aren’t entirely “general capability” plus “noise”, even if that is a pretty good approximation. Even though this component isn’t so statistically significant, I think it’s fair to say that it aligns with the general public sense of Anthropic’s priorities, i.e. they seem to be making Claude like this on purpose. This updated my thinking a bit on a broader question, as I’ll explain next.

Is the “general capability” dimension deep, or contingent?

The big question here is why a single dimension captures so much of the variance in benchmark scores. I can think of two possible reasons, corresponding to two possible worlds we may be in. I’ll call these worlds “deep” and “contingent”.

In the “deep” world, there is a single underlying ability that governs how well models do at superficially unrelated tasks. In this world, the only thing a model developer can do is make this ability go up. If they succeed, their model gets better at everything.

In the “contingent” world, there are many orthogonal abilities that models can have. These are orthogonal in the sense that model developers have to do completely unrelated work to get a model to improve on each ability. Still, in the world I’m imagining, customers demand models with many capabilities, and so developers put in the work to make this happen.

Which world more resembles our own? Sometimes in the history of AI, things have looked like the contingent world. AlphaGo was superhuman at go but it was nonsense to ask it to do anything else. At other times, things have looked like the deep world. When LLMs were picking up steam, next-token prediction on relatively uncurated web text was tearing through NLP tasks that had previously been dominated by specialized models.3

To a first approximation, benchmark scores look the same in both worlds. But the existence of the Claudiness dimension feels to me like a bit of evidence for the “contingent” world. Anthropic has focused on making models that are state-of-the-art at agentic coding. Without additional focused investment, the models turn out not to be exceptional at advanced math. There is surely some generalization across tasks, but perhaps this is a sign of its limits.

A trillion dollar question

The Claudiness dimension is not very strong evidence for the contingent view. Stronger evidence might be how model developers are investing heavily in collecting specialized data, like reinforcement learning (RL) environments for industry verticals. Even so, it’s possible that they’re doing that and that RL shows excellent cross-task generalization.

One way to test this would be to find an uncontaminated benchmark that measures something unusual, and see if it correlates with the “General Capability” dimension. Unfortunately, we don’t know what counts as “unusual” for models because we don’t know what they saw in training. Also, I suspect there’s a selection effect where benchmarks that show top models scoring poorly tend to capture attention. Still, this seems worth pursuing.

Even if the explanation for what we see in benchmark data is that model developers are pursuing an “everything at once” strategy, they have the resources and the scalable architectures necessary to keep it going. In other words, they can keep making all the benchmarks go up so long as they can get the right training data plus enough compute to make use of it.

What does this mean for the future? I like how Steve Newman put it recently: “how far can you get by simply putting an insane number of things in distribution” is one of the trillion dollar questions.

I doubt that there are in-principle limits to putting everything in distribution, but if we’re more in the contingent world then we shouldn’t expect much of a tailwind from generalization either. Every percentage point of improvement on every benchmark must be paid for. Here I think we should expect to see capabilities continue to improve quite generally, but only so long as the flywheel of growth and investment continues to allow developers to devote resources to actively making this happen.

Methodology: we filter our dataset to benchmarks created in 2023 and beyond, and models with at least 8 benchmark scores. We combine different reasoning settings for the same model, taking the max scores. We use k-nearest neighbors to impute missing data, transforming [0-1] scores by a logit first, weighting by distance, and using 5 neighbors. We then do PCA. Data and code can be found here.

This main finding accords with previous work, although we now have a much larger dataset of benchmarks.

Even now there are some prominent specialized models, like Cursor’s Tab or OpenAI’s Codex series. But it seems fair to characterize the current landscape as dominated by models that at least try to “do it all”.

Kenny Easwaran

Nov 21, 2025

Are you sure it’s a Claude-iness dimension and not a negative of a ChatGPT-iness dimension? I don’t know how many companies’ models are in the dataset, but if it’s more than two, it’s quite striking that the top 5 are all one company and the bottom 5 are all another. It would be interesting to see which others are on one side or the other!

1 reply

Thomas DeWitt

Fascinating! I suppose the claudiness finding could cut both ways on the generalization aspect? It shows that Anthropic is not focusing on math, which seems consistent with what they have said, so generalization may be imperfect. But nonetheless Claude is showing significant improvements in math (eg Math L5 or Frontier math), suggesting Anthropics programming focus is generalizing at least somewhat to math.

3 more comments...

Epoch AI

Discussion about this post

Ready for more?