3 Comments
User's avatar
Lucas's avatar

As a first time reader, I liked this a lot! Especially the way you pull back the curtain on how messy evals actually are. Same benchmark, same model, and the score still moves by double digits just from scaffolds and providers. Insane spread for GLM

Celeste 🌱's avatar

This is just wild! good work as always epoch

Semantic Fidelity Lab's avatar

The hard part isn’t just variance, it’s drift across layers. When prompts, tools, and providers all compress meaning differently, scores stop being comparable in a principled way. The benchmark ends up measuring the system, not the model.