FYI there is a typo where "It’s entirely possible that we see no improvement. It’s perfectly plausible that AlphaProof gets 2-4 problems and LLMs get 1-2 problems. This would be consistent with no progress over prior capabilities, or could just be due to the problems being unusually hard for AI systems" is repeated twice.
I guess what’s the point of the IMO if you’re not exactly looking at the IMO problem set but only if the LLM can solve the problem in a particular way? I think it’s also possible that mere grinding might be enough to solve open problems and even if the LLM isn’t creative in a way that would be particularly impressive, it might not be a dead end for the LLM to simply get faster at grinding
I think it’s a really interesting question how far grinding will get in terms of solving open problems. You probably need formalization for that, or verifiable outputs like AlphaEvolve. I don’t think it’s too likely that it’s the best approach in general, but also wouldn’t be surprised if some results fall this way. Also, though, I think human interest here will be limited. Grinding is easier to forgive if you, say, make money at the end of it. But mathematicians really want to understand math, and grinding doesn’t help with that.
My intutition is that AI systems are actually typically far more creative than humans and that the problem is more subtle, that it's the filtering of solutions that is hard.
"But maths solutions are much more verifiable than other tasks" - complete solutions are, but filtering partial solutions is much harder.
I think you're wrong (low confidence as it's nearly 11pm here and I only skimmed the articles), but I can't say any more as discussing capabilities with Epoch employees seems potentially risky.
FYI there is a typo where "It’s entirely possible that we see no improvement. It’s perfectly plausible that AlphaProof gets 2-4 problems and LLMs get 1-2 problems. This would be consistent with no progress over prior capabilities, or could just be due to the problems being unusually hard for AI systems" is repeated twice.
Thanks! Should be fixed now.
I guess what’s the point of the IMO if you’re not exactly looking at the IMO problem set but only if the LLM can solve the problem in a particular way? I think it’s also possible that mere grinding might be enough to solve open problems and even if the LLM isn’t creative in a way that would be particularly impressive, it might not be a dead end for the LLM to simply get faster at grinding
I think it’s a really interesting question how far grinding will get in terms of solving open problems. You probably need formalization for that, or verifiable outputs like AlphaEvolve. I don’t think it’s too likely that it’s the best approach in general, but also wouldn’t be surprised if some results fall this way. Also, though, I think human interest here will be limited. Grinding is easier to forgive if you, say, make money at the end of it. But mathematicians really want to understand math, and grinding doesn’t help with that.
Do AI systems really show a "lack of creativity"?
My intutition is that AI systems are actually typically far more creative than humans and that the problem is more subtle, that it's the filtering of solutions that is hard.
"But maths solutions are much more verifiable than other tasks" - complete solutions are, but filtering partial solutions is much harder.
I think this lack of creativity is pretty well-supported, though it’s a fuzzy concept.
But for instance, see this prior article, specifically search for “creativity”: https://epoch.ai/gradient-updates/beyond-benchmark-scores-analysing-o3-mini-math-reasoning
And, for some more concrete examples, an article from my blog: https://lemmata.substack.com/p/who-needs-insight-when-youve-got
I think you're wrong (low confidence as it's nearly 11pm here and I only skimmed the articles), but I can't say any more as discussing capabilities with Epoch employees seems potentially risky.