GPT-5.4 set a new record on FrontierMath
Solved one Tier 4 problem that no model had solved before
GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems!
We had pre-release access to evaluate the model. On Tiers 1–3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%.
See below for commentary and additional experiments.
FrontierMath was funded by OpenAI, who has exclusive access to: all 290 problems in Tiers 1–3; solutions to 237 of these problems; 28 of the 48 problems in Tier 4; solutions to these 28 problems. Epoch holds out the rest.
On Tiers 1–3 GPT-5.4 Pro solved 52% of the non-held-out set and 42% of the held-out set. On Tier 4, GPT-5.4 Pro solved 25% of the non-held-out set and 55% of the held-out set. Neither of these differences is statistically significant.
GPT-5.4 Pro solved one Tier 4 problem that no model had solved before. In a preliminary analysis, it appeared to have found a preprint from 2011 which let it shortcut much of the intended work. The problem author was unaware of this preprint.
We ran GPT-5.4 (xhigh) an additional ten times on Tier 4 to get a pass@10 score. This was 38%. In one of these runs, it solved another problem no model had solved before. This problem was by Bartosz Naskręcki, who responded as follows:
Across all runs ever, 42% (20/48) of the problems in Tier 4 have now been solved at least once.
We also evaluated GPT-5.4 Pro on FrontierMath: Open Problems. It did not solve any problems. It made some novel observations on one problem, but of a form that the author had anticipated and characterized as relatively uninteresting. More here.
Check out our website for more results and commentary about FrontierMath overall!



Well I was all the time wondering What is tier 1 ,2 ,3, 4.
It seems the post is more worried with explicit pure data description, instead of care of the reader will understand and What it means all if these numbers.
The bit about finding a 2011 preprint to shortcut a Tier 4 problem is remarkable. That's not brute force reasoning; that's something closer to research. I covered the broader benchmark picture and the FrontierMath results were one of the more surprising data points. The computer use numbers got the headlines but this deserves more attention. Full breakdown: https://reading.sh/gpt-5-4-just-dropped-heres-your-explainer-8fcc0126d84d?sk=ad5982c9f3b9382ff8fea9c32491a811