Discussion about this post

User's avatar
Liam Laverty's avatar

Interesting, I like the idea of abstract benchmarking. I recently built an app which gets language models to paint iteratively, rather than one-shotting an image generation. It's not intentionally a benchmark, but I think it touches on some of the stuff you're saying here, including spatial-reasoning, and compounding opportunity for errors in many short-running tasks (as opposed to a single long-running task). It has the benefit of producing a nice visual analogue of the struggles the models are having at the frontier of their capabilities. The results are a mix of objective and subjective.

Take a look if you're interested: https://www.etive-mor.com/blog/can-a-language-model-paint/

No posts

Ready for more?