Why benchmarking is hard

Dec 23, 2025

Running benchmarks involves many moving parts, each of which can influence the final score. The two most impactful components are scaffolds and API providers.

Read →

3 Comments

Lucas

Dec 28

As a first time reader, I liked this a lot! Especially the way you pull back the curtain on how messy evals actually are. Same benchmark, same model, and the score still moves by double digits just from scaffolds and providers. Insane spread for GLM

Celeste 🌱

Dec 24

This is just wild! good work as always epoch

Semantic Fidelity Lab

Jan 2

The hard part isn’t just variance, it’s drift across layers. When prompts, tools, and providers all compress meaning differently, scores stop being comparable in a principled way. The benchmark ends up measuring the system, not the model.

Epoch AI

Why benchmarking is hard