Running benchmarks involves many moving parts, each of which can influence the final score. The two most impactful components are scaffolds and API providers.
As a first time reader, I liked this a lot! Especially the way you pull back the curtain on how messy evals actually are. Same benchmark, same model, and the score still moves by double digits just from scaffolds and providers. Insane spread for GLM
The hard part isn’t just variance, it’s drift across layers. When prompts, tools, and providers all compress meaning differently, scores stop being comparable in a principled way. The benchmark ends up measuring the system, not the model.
As a first time reader, I liked this a lot! Especially the way you pull back the curtain on how messy evals actually are. Same benchmark, same model, and the score still moves by double digits just from scaffolds and providers. Insane spread for GLM
This is just wild! good work as always epoch
The hard part isn’t just variance, it’s drift across layers. When prompts, tools, and providers all compress meaning differently, scores stop being comparable in a principled way. The benchmark ends up measuring the system, not the model.