What will the IMO tell us about AI math capabilities?

Most discussion about AI and the IMO focuses on gold medals, but that's not the thing to pay most attention to.

Jul 09, 2025

This year’s International Mathematical Olympiad (IMO) will take place on July 15th and 16th in Sunshine Coast, Australia. It is the pinnacle of high school math competitions. Much like the Olympic Games, the stakes are national pride and personal glory.

AI model developers must be gauging their own chances for pride and glory. No AI system has yet achieved a score equivalent to an IMO gold medal, much less a perfect score. Could this be the year?

In this post, I’ll say what I think different results might mean for AI math capabilities. In particular, I think there are some important distinctions between results that might generate hype and results that will actually tell us something new.

Here are the key background facts I have in mind.

The baseline is high. Google’s AlphaProof1, a specialized system that outputs formal math proofs, solved 4/6 problems on the 2024 IMO. On the 2025 USAMO, a contest similar to the IMO, the best general-purpose LLMs can already solve 2/6 problems. Progress means doing better than that.
LLMs in particular struggle with proofs. The reinforcement learning (RL) that makes LLMs good at math relies heavily on problems with short-form answers. IMO problems are different: solutions are judged by whether they rigorously prove a given statement. AlphaProof does this by design, but LLMs have been less adept here.
All AI systems show a lack of creativity. The one very hard problem that AlphaProof solved on the 2024 IMO, it solved in a very brute-force way. This appears to be a consistent limitation for LLMs as well. Some IMO problems, however, seem to require what humans would recognize as creativity.
Medals are a fickle metric. Humans and AI systems find problems difficult for different reasons, but the score required for a gold medal depends only on the distribution of human scores. “AI wins gold” means something very different depending on whether problems play to AI strengths while being especially challenging for humans—or vice versa.

Given that, here’s how I would frame the possible outcomes.

It’s entirely possible that we see no improvement. AlphaProof could get 2-4 problems and LLMs could get 1-2 problems. This would be consistent with no progress over prior capabilities, or could just be due to the problems being unusually hard for AI systems.
Even a gold medal from AlphaProof might tell us nothing. If we get unlucky, it may be that no IMO problem requires much creativity. The problems would still be difficult for humans, but AI systems could find brute-force solutions that would not indicate particularly interesting or novel capabilities. This might even be the modal outcome for AlphaProof: it got a bit unlucky last year.
A gold or silver medal from an LLM would mean they’re improving at proofs. LLMs have a ways to go to catch up with AlphaProof’s proof writing capabilities. If they get a gold or silver medal, they’ve probably closed some of that gap. But the relevance of this achievement would be limited: helpful for mathematicians, but not clearly more.
The most interesting result would be evidence of creative problem-solving. This is the big one. Not only would it indicate a new mathematical ability, but it would suggest material progress toward broader utility as well. For better or for worse, we’ll only be able to tell if this has happened by looking into the specific problems and judging how AI systems solved them.

Creative problem-solving gets at an open question: how far can AI systems generalize outside of their training distributions? The more they can, the more likely we are to see AI-driven breakthroughs in all sorts of fields. It’s hard to know what to expect. On the one hand, we have scant examples of this sort of creativity so far. On the other hand, some of the few examples we do have, like in game-playing, come from scaled-up RL. Model developers have been doing nothing if not scaling up RL. Will it pay off or stall out? The IMO could give us a sign, just not via the color of a medal.2

I’ll use the rest of the article to support these points. Let’s delve in!

IMO Background

The IMO spans two days, with contestants given 4.5 hours to solve three problems each day. Countries send teams of up to six contestants, all of whom take the test individually. Solutions take the form of natural language proofs and are graded by human experts. Each solution is scored on a 0-7 scale, with partial credit given sparingly. Some problems ask for a statement to be proven while others ask for a particular quantity or formula to be determined. Even for the latter problem type, solutions are graded based on the rigor of their proofs. A solution that only finds the correct final answer without proving that it is correct will receive at most 1 point.

Medals are awarded roughly to the top half of participants: gold to the top ≈8%, silver to the next ≈16%, and bronze to the next ≈24%. In 3 of the past 10 years, getting full credit on 4 problems was enough for a gold medal. In the other years, it took 5.

Since each country sends six contestants regardless of its population or the development of its education system, it’s probably not the case that every IMO contestant is one of the mathematically-best high school students in the world. Though, each country’s best is plausibly in attendance. It certainly boasts star talent: at least a third of Fields medalists were, in their youth, IMO medalists.3

Each day of the IMO typically has one easy, one medium, and one hard problem.4 For instance, here are the human score distributions for the 2024 problems. Most contestants got P1 and P4; the distributions for P2 and P5 are bimodal with about one third getting full credit; very few contestants got P3 or P6.5

All IMO problems require a degree of rigor and precision: proofs must be constructed carefully. They also typically require the competent application of background knowledge from an advanced, competition-oriented high school curriculum.

The easier problems don’t require much more. I would characterize the harder problems as adding three additional factors: abstraction, creativity, and depth. Abstraction refers to the need to derive general truths about the subject of the problem, not just chug through formulas. Creativity refers to the need to come up with something new, not just straightforward application of background knowledge. Depth refers to the number of steps required to carry out a solution. Of course, these factors can combine: the very hardest problems require many steps, each of them involving a novel abstraction.

It’s the hardest problems that give the IMO its reputation. P1 and P4 might be routine for competent math undergrads. P3 and P6 are sometimes very difficult even for professional mathematicians.6

AlphaProof already set a high bar, but some key abilities were missing

AlphaProof holds the current AI record on the IMO. While there is at least one fine-tuned LLM in its architecture, AlphaProof is not very similar to today’s leading LLMs. Instead of being able to do many tasks, it only solves math problems. Instead of generating natural language, it only outputs a specialized mathematical proof-writing language called Lean. Thus, I think it’s best to analyze this style of system separately.

AlphaProof solved its hardest problem in a surprisingly uninteresting way

On the 2024 IMO AlphaProof solved P1, P2, and P6: one easy, one medium, and one hard.7 The easy and medium problems have plausibly been matched by more general-purpose LLMs since then, which I’ll discuss later. AlphaProof’s real claim to fame remains solving P6. Almost no humans solved this problem.

Looking deeper, however, the way that AlphaProof solved P6 is telling. P6 is an algebra question. It asks contestants to prove that, if a function satisfies a given property, then it also satisfies another property. The human way to solve this is to prove various abstract properties about such a function, building these up to prove the desired property. For instance, a natural first step is to notice that the given property implies that the function is 1-to-1. It gets harder from there, but the steps are qualitatively similar.

AlphaProof didn’t do this. Instead, it applied the given property over and over to generate a proof that, loosely speaking, amounts to a long, tedious case-by-case analysis. Logically it works—Lean guarantees that—but it’s not something a human would ever want to read, much less write. Furthermore, I think most mathematicians would suspect that this sort of “grind it out” reasoning is a dead end for most interesting math problems. In other words, it probably doesn’t represent a particularly important capability.8

AlphaProof didn’t solve problems that required more creativity

Just as interesting are the problems that AlphaProof didn’t get: P3 and P5. These problems don’t seem to have “grind it out”” ways to solve them: contestants had to understand some deeper structure on some abstract level or else couldn’t make progress. They are also both combinatorics problems, which often have the most original, off-the-beaten-path setups.9 Combinatorics problems often rely on comparatively little background knowledge and comparatively more creativity, and I think this goes for P3 and P5.

We will probably get at least one hard combinatorics problem this year. According to a well-regarded problem difficulty rating system, 8 of the last 10 IMOs had at least one such problem. However, these ratings also indicate that 2024 was a bit unusual in that the nominally “medium” P5 was on the harder end of the spectrum. Only 3 of the last 10 IMOs had two combinatorics problems at least that hard. We’ll have to keep this problem composition in mind when interpreting the 2025 results.

Geometry won’t tell us much

Euclidean geometry problems are a staple of the IMO, but AlphaProof itself didn’t solve geometry problems: it deferred them to a specialized system called AlphaGeometry2. This system is trained to generate proofs in a domain-specific language that Google devised to represent Euclidean geometry. An earlier version called AlphaGeometry was already capable of solving hard IMO geometry problems. AlphaGeometry2 made short work of P4, the only geometry problem on the 2024 IMO.

Unfortunately, I think this just isn’t so interesting: it’s such a specialized system in such a constrained domain that I don’t think it tells us much about general AI.

If an AlphaProof-like system scores well, we’ll have to look at the specific problems

I can now articulate a more nuanced distinction between scenarios mentioned earlier.

If the 2025 IMO happens to contain 0 or 1 hard combinatorics problems, it’s entirely possible that AlphaProof will get a gold medal just by grinding out 5 or 6 of the problems—especially if there happens to be a tilt toward hard geometry problems. This would grab headlines, but wouldn’t be much of an update over current capabilities. Still, it seems pretty likely: I give a 70% chance to AlphaProof winning a gold medal overall, with almost all of that chance coming from just such a scenario.

While “grind it out” solutions have a bit of a “know it when you see it” character, we’ll also be able to compare AI solutions to the extensive solution guides that the IMO organizers prepare for problem graders. For example, the 2024 guide has 7 different solutions for P6, and not one of them looks like AlphaProof’s. It’s possible that an AI system will find an insightful solution that humans missed, but IMO organizers really do try to find all angles on the problems—in particular, making sure there isn’t an unintended shortcut. So, if AlphaProof does find a solution not represented in the solution guide, I think it will likely be because it found an uninteresting one.

But it’s also possible that the 2025 IMO will have hard combinatorics problems, or, more generally, problems that don’t admit any “grind it out” solutions. In that case, solving 5 or 6 problems would unambiguously break new ground. It’s even possible that, for whatever reason, AlphaProof fails on an easier problem while still breaking this new ground—and I would consider that almost as big of an update. I give only a 5% chance of AlphaProof impressing us with its creativity.

I should note that the above discussion has been about AlphaProof specifically, but I think the same broadly goes for any system specially designed to generate formal proofs.10

Closing Notes on AlphaProof

There are two other things worth mentioning.

First, on the 2024 IMO, AlphaProof was given several days to solve problems: much longer than the 4.5 hours that humans are given. I don’t find this particularly relevant. The most interesting question is whether an AI system can solve these problems at all. “Mere” engineering can typically deliver post-facto speed-ups.

Second, we’ve heard very little about AlphaProof since last year’s IMO. I find this surprising, because there have been other well-known math competitions since then, like the USAMO or the premier college competition, the Putnam. Why didn’t Google take some of the limelight there? They may have chosen not to invest further in AlphaProof, as a relatively academic task even compared to math systems like AlphaEvolve.11 Or they may have felt that progress was cost-prohibitive, or simply infeasible. Either way, I think there’s some chance we don’t hear from AlphaProof. My odds above are conditional on AlphaProof’s results being released at all.12

General-purpose LLMs have more headroom

Our best reference point for LLMs is probably the 2025 USAMO, a US national competition used to help select the US team for the IMO, which has an identical format and comparable difficulty to the IMO.13 We can draw data from MathArena, which has posed each USAMO problem to models (sampling four times) and fielded human graders to judge the outputs.

I would summarize the results as: models can solve P1 and P4, but not the rest.14 While a few models get scores less than 10% on the other questions, i.e., occasionally getting ≈1/7 points of partial credit, I also wouldn’t make too much of this. From inspecting the results, I think the models are stumbling onto this partial credit rather than showing real signs of progress.

Deep Think suggests we’ll see something better than this

Gemini 2.5 Pro’s as-of-yet unreleased Deep Think mode was announced to have gotten P2 on the 2025 USAMO. In broad strokes, I think P2 on the 2025 USAMO is similar in style and difficulty to P2 on the 2024 IMO, the one medium-difficulty problem that AlphaProof solved.15 I wouldn’t characterize either AI solution as “grind-it-out”: they’re reasonably human-like. Still, to my eye, they require careful application of background knowledge and some depth, but not particularly high creativity.16

In other words, there’s a style of medium-difficulty IMO problem that is probably within reach of LLMs, or, at least, within reach of Gemini 2.5 Pro Deep Think.17 If both medium-difficulty problems on the 2025 IMO are of this style, LLMs could solve 4 problems. But, as P5 on the 2024 IMO shows, there are also some medium-difficulty problems that require more creativity, and for which there is no public evidence that any AI system is close to solving.18 If both medium-difficulty problems on the 2025 IMO are of this style, LLMs might solve only 2 problems.

So, an LLM solving 3 or 4 problems definitely represents continued incremental progress at solving proof-based problems. Furthermore, there’s upside potential: if one of those problems requires creativity akin to 2024’s P5, then that’s something qualitatively new. We’ll have to look into the problems to see.

The same rough story goes for the harder problems. If LLMs solve P3 or P6 it will, at a minimum, represent a leap forward in their proof-writing capabilities: these will be long proofs, if nothing else. If the LLM proofs are similar to AlphaProof’s “grind it out” style, then it really will represent nothing else. If, instead, the LLM proofs involve substantial creativity and abstraction, then we’re in “wow” territory.

All bets are off if obscure background knowledge cracks problems

IMO problems aren’t supposed to require obscure knowledge. However, problem authors are sometimes inspired by real mathematical work. It’s possible that a trick for solving a problem is covered in a niche journal article, and that competition organizers view the risk of a contestant having read such an article to be minimal. LLMs, on the other hand, have read ≈everything.

For instance, P3 on the 2025 USAMO was related to a topic called Gabriel graphs. We can see from LLMs’ attempted solutions that they recognized this connection. In this case it wasn’t enough of a hint to allow them to solve the problem, but I bet that contest organizers didn’t expect any of the contestants to make this connection at all.

Thus, for all problems that LLMs solve, and especially if they solve harder ones, we’ll want to inspect their solutions to see if they are making use of advanced knowledge. This wouldn’t be “cheating” in any sense, but it can definitely indicate less innate mathematical skill if the cited knowledge removes the need to come up with creative abstractions.

Geometry still probably won’t tell us much

LLMs have their own uninspiring way of tackling geometry problems. Instead of solving them in a more intuitive, spatial manner, they convert them to coordinate systems—the xy-plane or xyz-space—and then churn through pages of algebraic equations. This approach can work, but it’s usually too time-consuming for humans to attempt during the competition. It’s also considered inelegant.

The only geometry problem on the 2025 USAMO was P4, and several LLMs can now solve it. I suspect they are not yet ready to apply this coordinate-based approach to significantly harder geometry problems, but it may only be a matter of time. I would be interested if LLMs solved geometry problems with more human-like spatial reasoning, but it wouldn’t be clear if this had broader significance and I mostly don’t expect this to happen.

Closing Notes on LLMs

So what will happen? I think there’s a 60% chance that at least one LLM, most likely Deep Think, gets 3 IMO problems, and thus either a bronze or silver medal. I haven’t seen much evidence of creativity or of the ability to solve harder IMO problems from any LLM. I’d be surprised if we saw that so soon after the results we already have. Maybe next year, but this year I’d only give a 5% chance to either a qualitatively creative solution or a solution to P3 or P6 from an LLM.

Conclusion

In the end, I think it’s moderately unlikely that we learn something surprising from the IMO. The modal outcome might be: AlphaProof wins a gold medal, boosted by some favorable geometry problems; LLMs continue their march up the math competition ladder by solving P3 or P5; in neither case do we see clear evidence of creative problem solving. We’ll have to wait for other benchmarks to tell us about that.

Thanks to Greta Panova for feedback on this post.

Leave a comment on our website!

Do you have comments, questions, or an idea you think we should explore? Visit the commenting section on our website’s version of this issue.