AI has solved one of the problems in FrontierMath: Open Problems, our benchmark of real research problems that mathematicians have tried and failed to solve
"Congratulations to Kevin Barreto and Liam Price, who first elicited a solution from GPT-5.4 Pro" so the breakthrough here is indeed supposed to be that a group of two engineers/mathematicians together with the mathematician who formulated the problem in their combined effort miraculously managed to find the right prompt/response sequence that makes GPT-5.4Pro produce a solution to the problem - we are deep into Clever Hans territory if you ask me.
If several existing models were capable of solving this problem, why did we only learn this now? Is the scaffolding and promoting very important? I wonder if there are other problems in this dataset that current models can also solve them, have you checked?
The other models weren't able to one-shot it, which is all we had tried previously. We learned this as soon as we set up our scaffold. It's not so involved, and we'll open-source it once we've finished testing it on all problems. You can see a bit more here for now:
How do I submit my solution?
"Congratulations to Kevin Barreto and Liam Price, who first elicited a solution from GPT-5.4 Pro" so the breakthrough here is indeed supposed to be that a group of two engineers/mathematicians together with the mathematician who formulated the problem in their combined effort miraculously managed to find the right prompt/response sequence that makes GPT-5.4Pro produce a solution to the problem - we are deep into Clever Hans territory if you ask me.
If several existing models were capable of solving this problem, why did we only learn this now? Is the scaffolding and promoting very important? I wonder if there are other problems in this dataset that current models can also solve them, have you checked?
The other models weren't able to one-shot it, which is all we had tried previously. We learned this as soon as we set up our scaffold. It's not so involved, and we'll open-source it once we've finished testing it on all problems. You can see a bit more here for now:
https://x.com/GregHBurnham/status/2036242043412848800?s=20