Models that are good at math benchmarks tend to be good at coding and reasoning benchmarks too, pointing to a common factor driving AI capabilities.
We find that AI benchmark scores are nearly as correlated across domains (0.68) as within them (0.79).
We looked at 15 benchmarks from our AI Benchmarking Hub, filtering to pairs with 5+ overlapping models. Our findings: rank correlations are strong for most pairs. The median correlation within the same category (0.79) is only modestly stronger than between categories (0.68).
Note that benchmarks spanning wide time ranges with sparse data will show higher correlations—models separated by years differ on almost any task. This helps explain some high correlations on benchmarks where current models do differ, like METR Time Horizons and FrontierMath.
This common ranking across benchmarks motivates the single capability scale behind our Epoch Capabilities Index. As expected (since these benchmarks are inputs to ECI), ECI correlates strongly with nearly all benchmarks.
Check out our full analysis at the Data Insight page!


Please forgive the question, but is than a news update compared to https://epochai.substack.com/p/benchmark-scores-general-capability? (Have you added new data since then?)