7 Comments
User's avatar
Parker Whitfill's avatar

FYI there is a typo where "It’s entirely possible that we see no improvement. It’s perfectly plausible that AlphaProof gets 2-4 problems and LLMs get 1-2 problems. This would be consistent with no progress over prior capabilities, or could just be due to the problems being unusually hard for AI systems" is repeated twice.

Expand full comment
Greg Burnham's avatar

Thanks! Should be fixed now.

Expand full comment
Steeven's avatar

I guess what’s the point of the IMO if you’re not exactly looking at the IMO problem set but only if the LLM can solve the problem in a particular way? I think it’s also possible that mere grinding might be enough to solve open problems and even if the LLM isn’t creative in a way that would be particularly impressive, it might not be a dead end for the LLM to simply get faster at grinding

Expand full comment
Greg Burnham's avatar

I think it’s a really interesting question how far grinding will get in terms of solving open problems. You probably need formalization for that, or verifiable outputs like AlphaEvolve. I don’t think it’s too likely that it’s the best approach in general, but also wouldn’t be surprised if some results fall this way. Also, though, I think human interest here will be limited. Grinding is easier to forgive if you, say, make money at the end of it. But mathematicians really want to understand math, and grinding doesn’t help with that.

Expand full comment
Chris L's avatar

Do AI systems really show a "lack of creativity"?

My intutition is that AI systems are actually typically far more creative than humans and that the problem is more subtle, that it's the filtering of solutions that is hard.

"But maths solutions are much more verifiable than other tasks" - complete solutions are, but filtering partial solutions is much harder.

Expand full comment
Greg Burnham's avatar

I think this lack of creativity is pretty well-supported, though it’s a fuzzy concept.

But for instance, see this prior article, specifically search for “creativity”: https://epoch.ai/gradient-updates/beyond-benchmark-scores-analysing-o3-mini-math-reasoning

And, for some more concrete examples, an article from my blog: https://lemmata.substack.com/p/who-needs-insight-when-youve-got

Expand full comment
Chris L's avatar

I think you're wrong (low confidence as it's nearly 11pm here and I only skimmed the articles), but I can't say any more as discussing capabilities with Epoch employees seems potentially risky.

Expand full comment