What will the IMO tell us about AI math capabilities?
Most discussion about AI and the IMO focuses on gold medals, but that's not the thing to pay most attention to.
This year’s International Mathematical Olympiad (IMO) will take place on July 15th and 16th in Sunshine Coast, Australia. It is the pinnacle of high school math competitions. Much like the Olympic Games, the stakes are national pride and personal glory.
AI model developers must be gauging their own chances for pride and glory. No AI system has yet achieved a score equivalent to an IMO gold medal, much less a perfect score. Could this be the year?
In this post, I’ll say what I think different results might mean for AI math capabilities. In particular, I think there are some important distinctions between results that might generate hype and results that will actually tell us something new.
Here are the key background facts I have in mind.
The baseline is high. Google’s AlphaProof1, a specialized system that outputs formal math proofs, solved 4/6 problems on the 2024 IMO. On the 2025 USAMO, a contest similar to the IMO, the best general-purpose LLMs can already solve 2/6 problems. Progress means doing better than that.
LLMs in particular struggle with proofs. The reinforcement learning (RL) that makes LLMs good at math relies heavily on problems with short-form answers. IMO problems are different: solutions are judged by whether they rigorously prove a given statement. AlphaProof does this by design, but LLMs have been less adept here.
All AI systems show a lack of creativity. The one very hard problem that AlphaProof solved on the 2024 IMO, it solved in a very brute-force way. This appears to be a consistent limitation for LLMs as well. Some IMO problems, however, seem to require what humans would recognize as creativity.
Medals are a fickle metric. Humans and AI systems find problems difficult for different reasons, but the score required for a gold medal depends only on the distribution of human scores. “AI wins gold” means something very different depending on whether problems play to AI strengths while being especially challenging for humans—or vice versa.
Given that, here’s how I would frame the possible outcomes.
It’s entirely possible that we see no improvement. AlphaProof could get 2-4 problems and LLMs could get 1-2 problems. This would be consistent with no progress over prior capabilities, or could just be due to the problems being unusually hard for AI systems.
Even a gold medal from AlphaProof might tell us nothing. If we get unlucky, it may be that no IMO problem requires much creativity. The problems would still be difficult for humans, but AI systems could find brute-force solutions that would not indicate particularly interesting or novel capabilities. This might even be the modal outcome for AlphaProof: it got a bit unlucky last year.
A gold or silver medal from an LLM would mean they’re improving at proofs. LLMs have a ways to go to catch up with AlphaProof’s proof writing capabilities. If they get a gold or silver medal, they’ve probably closed some of that gap. But the relevance of this achievement would be limited: helpful for mathematicians, but not clearly more.
The most interesting result would be evidence of creative problem-solving. This is the big one. Not only would it indicate a new mathematical ability, but it would suggest material progress toward broader utility as well. For better or for worse, we’ll only be able to tell if this has happened by looking into the specific problems and judging how AI systems solved them.
Creative problem-solving gets at an open question: how far can AI systems generalize outside of their training distributions? The more they can, the more likely we are to see AI-driven breakthroughs in all sorts of fields. It’s hard to know what to expect. On the one hand, we have scant examples of this sort of creativity so far. On the other hand, some of the few examples we do have, like in game-playing, come from scaled-up RL. Model developers have been doing nothing if not scaling up RL. Will it pay off or stall out? The IMO could give us a sign, just not via the color of a medal.2
I’ll use the rest of the article to support these points. Let’s delve in!
IMO Background
The IMO spans two days, with contestants given 4.5 hours to solve three problems each day. Countries send teams of up to six contestants, all of whom take the test individually. Solutions take the form of natural language proofs and are graded by human experts. Each solution is scored on a 0-7 scale, with partial credit given sparingly. Some problems ask for a statement to be proven while others ask for a particular quantity or formula to be determined. Even for the latter problem type, solutions are graded based on the rigor of their proofs. A solution that only finds the correct final answer without proving that it is correct will receive at most 1 point.
Medals are awarded roughly to the top half of participants: gold to the top ≈8%, silver to the next ≈16%, and bronze to the next ≈24%. In 3 of the past 10 years, getting full credit on 4 problems was enough for a gold medal. In the other years, it took 5.
Since each country sends six contestants regardless of its population or the development of its education system, it’s probably not the case that every IMO contestant is one of the mathematically-best high school students in the world. Though, each country’s best is plausibly in attendance. It certainly boasts star talent: at least a third of Fields medalists were, in their youth, IMO medalists.3
Each day of the IMO typically has one easy, one medium, and one hard problem.4 For instance, here are the human score distributions for the 2024 problems. Most contestants got P1 and P4; the distributions for P2 and P5 are bimodal with about one third getting full credit; very few contestants got P3 or P6.5
All IMO problems require a degree of rigor and precision: proofs must be constructed carefully. They also typically require the competent application of background knowledge from an advanced, competition-oriented high school curriculum.
The easier problems don’t require much more. I would characterize the harder problems as adding three additional factors: abstraction, creativity, and depth. Abstraction refers to the need to derive general truths about the subject of the problem, not just chug through formulas. Creativity refers to the need to come up with something new, not just straightforward application of background knowledge. Depth refers to the number of steps required to carry out a solution. Of course, these factors can combine: the very hardest problems require many steps, each of them involving a novel abstraction.
It’s the hardest problems that give the IMO its reputation. P1 and P4 might be routine for competent math undergrads. P3 and P6 are sometimes very difficult even for professional mathematicians.6
AlphaProof already set a high bar, but some key abilities were missing
AlphaProof holds the current AI record on the IMO. While there is at least one fine-tuned LLM in its architecture, AlphaProof is not very similar to today’s leading LLMs. Instead of being able to do many tasks, it only solves math problems. Instead of generating natural language, it only outputs a specialized mathematical proof-writing language called Lean. Thus, I think it’s best to analyze this style of system separately.
AlphaProof solved its hardest problem in a surprisingly uninteresting way
On the 2024 IMO AlphaProof solved P1, P2, and P6: one easy, one medium, and one hard.7 The easy and medium problems have plausibly been matched by more general-purpose LLMs since then, which I’ll discuss later. AlphaProof’s real claim to fame remains solving P6. Almost no humans solved this problem.
Looking deeper, however, the way that AlphaProof solved P6 is telling. P6 is an algebra question. It asks contestants to prove that, if a function satisfies a given property, then it also satisfies another property. The human way to solve this is to prove various abstract properties about such a function, building these up to prove the desired property. For instance, a natural first step is to notice that the given property implies that the function is 1-to-1. It gets harder from there, but the steps are qualitatively similar.
AlphaProof didn’t do this. Instead, it applied the given property over and over to generate a proof that, loosely speaking, amounts to a long, tedious case-by-case analysis. Logically it works—Lean guarantees that—but it’s not something a human would ever want to read, much less write. Furthermore, I think most mathematicians would suspect that this sort of “grind it out” reasoning is a dead end for most interesting math problems. In other words, it probably doesn’t represent a particularly important capability.8
AlphaProof didn’t solve problems that required more creativity
Just as interesting are the problems that AlphaProof didn’t get: P3 and P5. These problems don’t seem to have “grind it out”” ways to solve them: contestants had to understand some deeper structure on some abstract level or else couldn’t make progress. They are also both combinatorics problems, which often have the most original, off-the-beaten-path setups.9 Combinatorics problems often rely on comparatively little background knowledge and comparatively more creativity, and I think this goes for P3 and P5.
We will probably get at least one hard combinatorics problem this year. According to a well-regarded problem difficulty rating system, 8 of the last 10 IMOs had at least one such problem. However, these ratings also indicate that 2024 was a bit unusual in that the nominally “medium” P5 was on the harder end of the spectrum. Only 3 of the last 10 IMOs had two combinatorics problems at least that hard. We’ll have to keep this problem composition in mind when interpreting the 2025 results.
Geometry won’t tell us much
Euclidean geometry problems are a staple of the IMO, but AlphaProof itself didn’t solve geometry problems: it deferred them to a specialized system called AlphaGeometry2. This system is trained to generate proofs in a domain-specific language that Google devised to represent Euclidean geometry. An earlier version called AlphaGeometry was already capable of solving hard IMO geometry problems. AlphaGeometry2 made short work of P4, the only geometry problem on the 2024 IMO.
Unfortunately, I think this just isn’t so interesting: it’s such a specialized system in such a constrained domain that I don’t think it tells us much about general AI.
If an AlphaProof-like system scores well, we’ll have to look at the specific problems
I can now articulate a more nuanced distinction between scenarios mentioned earlier.
If the 2025 IMO happens to contain 0 or 1 hard combinatorics problems, it’s entirely possible that AlphaProof will get a gold medal just by grinding out 5 or 6 of the problems—especially if there happens to be a tilt toward hard geometry problems. This would grab headlines, but wouldn’t be much of an update over current capabilities. Still, it seems pretty likely: I give a 70% chance to AlphaProof winning a gold medal overall, with almost all of that chance coming from just such a scenario.
While “grind it out” solutions have a bit of a “know it when you see it” character, we’ll also be able to compare AI solutions to the extensive solution guides that the IMO organizers prepare for problem graders. For example, the 2024 guide has 7 different solutions for P6, and not one of them looks like AlphaProof’s. It’s possible that an AI system will find an insightful solution that humans missed, but IMO organizers really do try to find all angles on the problems—in particular, making sure there isn’t an unintended shortcut. So, if AlphaProof does find a solution not represented in the solution guide, I think it will likely be because it found an uninteresting one.
But it’s also possible that the 2025 IMO will have hard combinatorics problems, or, more generally, problems that don’t admit any “grind it out” solutions. In that case, solving 5 or 6 problems would unambiguously break new ground. It’s even possible that, for whatever reason, AlphaProof fails on an easier problem while still breaking this new ground—and I would consider that almost as big of an update. I give only a 5% chance of AlphaProof impressing us with its creativity.
I should note that the above discussion has been about AlphaProof specifically, but I think the same broadly goes for any system specially designed to generate formal proofs.10
Closing Notes on AlphaProof
There are two other things worth mentioning.
First, on the 2024 IMO, AlphaProof was given several days to solve problems: much longer than the 4.5 hours that humans are given. I don’t find this particularly relevant. The most interesting question is whether an AI system can solve these problems at all. “Mere” engineering can typically deliver post-facto speed-ups.
Second, we’ve heard very little about AlphaProof since last year’s IMO. I find this surprising, because there have been other well-known math competitions since then, like the USAMO or the premier college competition, the Putnam. Why didn’t Google take some of the limelight there? They may have chosen not to invest further in AlphaProof, as a relatively academic task even compared to math systems like AlphaEvolve.11 Or they may have felt that progress was cost-prohibitive, or simply infeasible. Either way, I think there’s some chance we don’t hear from AlphaProof. My odds above are conditional on AlphaProof’s results being released at all.12
General-purpose LLMs have more headroom
Our best reference point for LLMs is probably the 2025 USAMO, a US national competition used to help select the US team for the IMO, which has an identical format and comparable difficulty to the IMO.13 We can draw data from MathArena, which has posed each USAMO problem to models (sampling four times) and fielded human graders to judge the outputs.
I would summarize the results as: models can solve P1 and P4, but not the rest.14 While a few models get scores less than 10% on the other questions, i.e., occasionally getting ≈1/7 points of partial credit, I also wouldn’t make too much of this. From inspecting the results, I think the models are stumbling onto this partial credit rather than showing real signs of progress.
Deep Think suggests we’ll see something better than this
Gemini 2.5 Pro’s as-of-yet unreleased Deep Think mode was announced to have gotten P2 on the 2025 USAMO. In broad strokes, I think P2 on the 2025 USAMO is similar in style and difficulty to P2 on the 2024 IMO, the one medium-difficulty problem that AlphaProof solved.15 I wouldn’t characterize either AI solution as “grind-it-out”: they’re reasonably human-like. Still, to my eye, they require careful application of background knowledge and some depth, but not particularly high creativity.16
In other words, there’s a style of medium-difficulty IMO problem that is probably within reach of LLMs, or, at least, within reach of Gemini 2.5 Pro Deep Think.17 If both medium-difficulty problems on the 2025 IMO are of this style, LLMs could solve 4 problems. But, as P5 on the 2024 IMO shows, there are also some medium-difficulty problems that require more creativity, and for which there is no public evidence that any AI system is close to solving.18 If both medium-difficulty problems on the 2025 IMO are of this style, LLMs might solve only 2 problems.
So, an LLM solving 3 or 4 problems definitely represents continued incremental progress at solving proof-based problems. Furthermore, there’s upside potential: if one of those problems requires creativity akin to 2024’s P5, then that’s something qualitatively new. We’ll have to look into the problems to see.
The same rough story goes for the harder problems. If LLMs solve P3 or P6 it will, at a minimum, represent a leap forward in their proof-writing capabilities: these will be long proofs, if nothing else. If the LLM proofs are similar to AlphaProof’s “grind it out” style, then it really will represent nothing else. If, instead, the LLM proofs involve substantial creativity and abstraction, then we’re in “wow” territory.
All bets are off if obscure background knowledge cracks problems
IMO problems aren’t supposed to require obscure knowledge. However, problem authors are sometimes inspired by real mathematical work. It’s possible that a trick for solving a problem is covered in a niche journal article, and that competition organizers view the risk of a contestant having read such an article to be minimal. LLMs, on the other hand, have read ≈everything.
For instance, P3 on the 2025 USAMO was related to a topic called Gabriel graphs. We can see from LLMs’ attempted solutions that they recognized this connection. In this case it wasn’t enough of a hint to allow them to solve the problem, but I bet that contest organizers didn’t expect any of the contestants to make this connection at all.
Thus, for all problems that LLMs solve, and especially if they solve harder ones, we’ll want to inspect their solutions to see if they are making use of advanced knowledge. This wouldn’t be “cheating” in any sense, but it can definitely indicate less innate mathematical skill if the cited knowledge removes the need to come up with creative abstractions.
Geometry still probably won’t tell us much
LLMs have their own uninspiring way of tackling geometry problems. Instead of solving them in a more intuitive, spatial manner, they convert them to coordinate systems—the xy-plane or xyz-space—and then churn through pages of algebraic equations. This approach can work, but it’s usually too time-consuming for humans to attempt during the competition. It’s also considered inelegant.
The only geometry problem on the 2025 USAMO was P4, and several LLMs can now solve it. I suspect they are not yet ready to apply this coordinate-based approach to significantly harder geometry problems, but it may only be a matter of time. I would be interested if LLMs solved geometry problems with more human-like spatial reasoning, but it wouldn’t be clear if this had broader significance and I mostly don’t expect this to happen.
Closing Notes on LLMs
So what will happen? I think there’s a 60% chance that at least one LLM, most likely Deep Think, gets 3 IMO problems, and thus either a bronze or silver medal. I haven’t seen much evidence of creativity or of the ability to solve harder IMO problems from any LLM. I’d be surprised if we saw that so soon after the results we already have. Maybe next year, but this year I’d only give a 5% chance to either a qualitatively creative solution or a solution to P3 or P6 from an LLM.
Conclusion
In the end, I think it’s moderately unlikely that we learn something surprising from the IMO. The modal outcome might be: AlphaProof wins a gold medal, boosted by some favorable geometry problems; LLMs continue their march up the math competition ladder by solving P3 or P5; in neither case do we see clear evidence of creative problem solving. We’ll have to wait for other benchmarks to tell us about that.
Thanks to Greta Panova for feedback on this post.
Leave a comment on our website!
Do you have comments, questions, or an idea you think we should explore? Visit the commenting section on our website’s version of this issue.
Technically AlphaProof together with AlphaGeometry2, as we’ll discuss.
It’s also possible that the IMO problem authors will invest some effort in making problems difficult for publicly available AI systems. This could bias results toward less-impressive AI performance, but also might not be easy to do for unreleased systems.
There is no Nobel Prize in mathematics, but the Fields Medal is widely considered to be the equivalent.
Relatively speaking, that is. Most high school students could not solve even an easier IMO problem.
The difficulty of the final problems on each day is well-known, and thus might be a bit of a self-fulfilling prophecy. For many contestants, it might make more sense not to attempt P3 or P6 and rather spend more time on the easier problems.
The fourth problem it solves was actually solved by a different system, AlphaGeometry2, which we’ll discuss shortly.
It also raises the question of how much AlphaGo was, on some level, like this. With math, we can see a bit more of how the sausage is made. Could “move 37” have been more brute-force than we realize?
LLMs can take a crack at generating formal proofs as well, the same way they can take a crack at anything. However, they aren’t yet very good at it yet, perhaps due to limited training data. I think it would be immensely surprising if an off-the-shelf LLM matched AlphaProof’s performance by generating formal proofs. That’s not to say that an improved version of AlphaProof won’t use an updated LLM within its architecture: if anything, this seems like one of the easiest ways to improve AlphaProof.
AlphaEvolve broke new ground in some research-level math problems. However, these were all centered around numerical optimization: proposed solutions could be verified algorithmically, and AlphaEvolve simply managed to write algorithms that found new optima. This progress is impressive on its own terms, but is a different sort of thing than solving proof-based problems—especially those that rely on abstraction and creativity.
Google might only release AlphaProof’s results if they’re impressive enough. I wasn’t sure how to account for this selection effect. My estimates are really more like “the unconditional results of the best AlphaProof-like system that Google has”, so my conditional estimates are probably biased low.
According to one popular competition difficulty rating system, P3 and P6 on the IMO are sometimes a bit harder than P3 and P6 on the USAMO.
No model solves P4 with perfect reliability, but reliability often improves with scale: if not now, soon.
It’s plausible that Deep Think was trained in part with data generated by AlphaProof.
o3-pro did not solve this problem via ChatGPT when I tried.
I sampled o4-mini-high 100 times on this problem, and it didn’t solve it once.
FYI there is a typo where "It’s entirely possible that we see no improvement. It’s perfectly plausible that AlphaProof gets 2-4 problems and LLMs get 1-2 problems. This would be consistent with no progress over prior capabilities, or could just be due to the problems being unusually hard for AI systems" is repeated twice.
I guess what’s the point of the IMO if you’re not exactly looking at the IMO problem set but only if the LLM can solve the problem in a particular way? I think it’s also possible that mere grinding might be enough to solve open problems and even if the LLM isn’t creative in a way that would be particularly impressive, it might not be a dead end for the LLM to simply get faster at grinding