Discussion about this post

User's avatar
Lucas's avatar

As a first time reader, I liked this a lot! Especially the way you pull back the curtain on how messy evals actually are. Same benchmark, same model, and the score still moves by double digits just from scaffolds and providers. Insane spread for GLM

Expand full comment
Celeste 🌱's avatar

This is just wild! good work as always epoch

Expand full comment
1 more comment...

No posts

Ready for more?