AI + Math Chat #1 - Thinking smarter not harder
“How close are today’s large-language models to genuine mathematical creativity?” with mathematicians Richard Borcherds, Sergei Gukov, Daniel Litt, and Ken Ono (originally uploaded Aug 18, 2025).
Note, we have updated our previous video with transcript!
AI + Math Chat: “How close are today’s large-language models to genuine mathematical creativity?” with mathematicians Richard Borcherds, Sergei Gukov, Daniel Litt, and Ken Ono (originally uploaded Aug 18, 2025).
In this first installment of AI + Math Chat, Epoch AI’s lead mathematician Elliot Glazer sits down with four world-class researchers to ask a simple question with sweeping implications: How close are today’s large-language models to genuine mathematical creativity?
Elliot Glazer and Ege Edril from Epoch AI invite Richard Borcherds, Sergei Gukov, Daniel Litt, and Ken Ono to give their thoughts on AI’s math abilities and guess which of our sample problems of our FrontierMath problem set have been solved as of recording of this video.
Originally recorded in March, 2025.
Transcript
Introduction and Guest Introductions [00:00]
Elliot Glazer
Hi everyone, I’m Elliot Glazer. I am the lead mathematician at EpochAI. I am a researcher in choiceless set theory who, by a curious sequence of events, ended up leading this project to study AI’s progress in mathematics. Joining me today are guests for my first AI in math conversation of hopefully a few more to come. As per mathematical tradition, I will introduce my guests by alphabetical order of last name.
First off, that leaves Richard Borcherds. I think Richard doesn’t really need an introduction, but I’ll give one anyway. He’s a professor of math at Berkeley, famous for his research in a bunch of things, but particularly his work on Monstrous Moonshine. He’s actually the only one of you guys who I interviewed before, back when we were just starting FrontierMath and got to hear some of his opinions when our project was still at the early stages. I’ll be quite curious to hear today how his opinions have changed since then, if at all. He was actually – of the people I talked to back then, which also included Timothy Gowers and Terence Tao – you were the most bullish, which I think ends up looking good in retrospect.
Next up we have Sergei Gukov, who is a professor of math at Caltech. Also someone who researches a lot of things. Actually, when I looked at your web page, I was so impressed by the diversity of it. One thing that came to my mind is it almost seemed like this is how an AI-generated web math page would look. It just feels like the almost stereotypical everyman in mathematics. Two in particular to note were geometric Langlands, quantum topology, and also particularly relevant to us, he has begun doing some research in machine learning applications to mathematics.
Next up we got Daniel Litt, assistant professor of math at University of Toronto, who studies a lot of algebraic geometry and number theory and has a really epic presence on Math Twitter. That’s kind of how I first interacted with him, just seeing his posts there. When I saw that he was interested in doing FrontierMath, I was like, hell yeah. So we’ll likely talk a little about one of your problems later, and I’ll be curious to talk about your opinions on how well AI has been doing on that problem.
Finally, we got Ken Ono, who is Marvin Rosenbloom professor of mathematics at University of Virginia, just an all-around great researcher in number theory, and one of our first contributors. I believe he also used to run an REU at Emory University.
Ken Ono
That’s right. And also at the University of Virginia, and previously at the University of Wisconsin.
Elliot Glazer
Yeah, because that’s actually where I first heard of you, from a lot of my friends having been former students of yours. So it’s funny to see your name pop up again in this context all these years later. The last person in the call is my colleague Ege Erdil, who is a senior researcher at Epoch AI. He’ll most likely be in the background but will chime in if we ever get involved in some of the more technical aspects of AI and how it’s been tackling math problems. So I figured we should have someone on the call who is qualified to talk about such questions if the conversation topic arises.
Opinions on the State of AI [03:46]
Elliot Glazer
Let’s start off with the first topic. I want to hear everyone’s opinions about AI and math. What have your interactions been with AI? Which ones have you tried? Have any impressed you? Have any made you go, “oh no, our jobs are in danger, or at least my students’ jobs are in danger?” What are your thoughts?
Ken Ono
My opinion with regard to AI and large language models is probably typical. When ChatGPT was released to the world a few years ago, it came as a shock, but I thought for sure that the life and work of the pure mathematician would be fairly safe. I even gave a talk at the University of Virginia where I described AI as more recognition than cognition. It seemed like it was doing a wonderful job of memorizing or having access to the accumulation of human knowledge, which is impressive in its own right. But I thought for sure that the more creative aspects of our craft in pure mathematics would be decades away.
But here we are today. My opinion has evolved. It’s kind of frightening to see that the more standard questions that you might ask even of a second or third year PhD student in number theory can be routinely answered as if it was trivial. It’s mind-boggling. And so where does that leave us for the future? Well, I guess that’s what Epoch AI is all about.
Daniel Litt
I think one thing I’ve kind of learned from the successes of the more recent LLMs and reasoning models in doing mathematics is that maybe actually a lot of what we do is more about recognition than cognition. Looking at how they approach the sort of standard questions I’ve been asking them, I kind of get the feeling that a lot of what I’m doing, but also what I do, is I have some kind of collection of facts that I know. Sometimes you put those facts together in creative ways, but it’s sort of pretty unusual that you have a really fundamental new idea that’s changing how you’re thinking about something.
So far, I feel like I personally have had important new ideas, and I haven’t seen that in an LLM. But on the other hand, I think I do that very rarely. In terms of proving a lemma or something that’s not so far from the standard, I think I was very surprised when I started seeing LLMs be able to do this thing. But I think in retrospect, maybe I shouldn’t have been. Yeah, it’s been a very useful way of changing how I think about the profession.
Sergei Gukov
These days I use 100% of my time developing AI tools to help with very hard research-level math problems. The reason is that five years ago, I tried to use AI and ML to help with average day math problems and it was useful. So that brought me to this area. But what I see is, and I’m heavily involved with several Silicon Valley companies and have an AI and Math Lab at Caltech, so I actually build AI systems on a daily basis, what I see is that scaling LLMs and scaling generative and other models is not going to cut it.
When I give talks, I usually emphasize that there are many levels of math going from elementary school, middle school, high school, IMO level, undergraduate level, graduate level research, everyday research, high level, and millennium prize problems. Going from one level to another is a lot of work AI-wise. It requires a lot of scaling, a lot of clean data. Also, as we move in this hierarchy from level to level, the amount of data plummets. For example, we have lots of data to train on IMO-level problems, but much fewer as far as it comes to Riemann hypothesis or Andrew’s curious conjecture, this is something I worked on recently.
So what we’re doing is we’re developing very specific hacks and tricks which would augment LLMs. I may be wrong, but my prediction is that context windows of attention-based models such as Transformers is just not going to be enough for that. So I’m pretty confident that we’re safe for the next five years. I’d love to be wrong, but basically what we’re doing is trying to help AI to replace us as mathematicians. But I believe this will require some additional work, some clever tricks.
Elliot Glazer
What do you say, I’d love to be wrong? Are you saying you don’t want us to be safe for five years, or do you want us to be safe longer?
Sergei Gukov
Well, I would love AI to be a useful tool where it can solve Andrew’s curious conjecture and Riemann hypothesis and other problems as routinely as it does undergraduate-level math. But even at undergrad, I’m sure all of you have experience where you try to ask more specialized questions and it just gives you gibberish. So this is very typical for LLMs and generative models, where it does well on the bulk of the distribution. But solving hard math problems really requires you going way out far in the tail, finding a needle in a haystack. For proof of the Riemann hypothesis – I think all of us believe that exists – it’s gonna be not just 100 steps, which is like the longest game of chess. It’s going to be millions of steps, which is something that LLMs can’t play right now.
In that sense, I think hard math problems are more like black swans, more like needles in a haystack. And generic thinking patterns are not good enough. What you need is really the ability to build on existing datasets, being able to generalize not just epsilon away from training data, but megaparsecs away. And we don’t have this yet. We don’t have this insight in AI science. Sorry for being the pessimist, but that’s my opinion. I’m trying to develop it, so I’m eager to see us produce such technology.
Elliot Glazer
Actually, before we get to Richard, I have not forgotten that Richard’s supposed to give his opinion, but I’m curious about one point you brought up, which is the idea of a model generating a millions-long proof of the Riemann hypothesis. If that occurs and it turns out to be utterly inscrutable to humans, how satisfied will you guys be with this proof?
Ken Ono
First of all, I’d be satisfied that it is a proof. And almost certainly there will be steps that will be new to us as mathematicians. And so in addition to the confirmation of the Riemann hypothesis, there will probably be a whole new body of mathematics. Maybe something akin to what happened with the proof of Fermat’s Last Theorem that opened up a whole new generation of ideas that can go on to have applications that matter to people beyond the world of mathematics.
Sergei Gukov
I agree with Ken. I would say, just like AlphaZero pointed out completely new moves in Go, like this move 37; and similar in games of chess, Magnus Carlsen developed new theories based on what happened there. So I think there will be a whole group of people analyzing and white-boxing what happened there. And of course, data science will help with that as well. So I’m quite optimistic. I don’t know how much good stuff will come out, but that something good will come out I’m pretty confident of.
Elliot Glazer
Finally from the previous question, Richard, I’m curious what your thoughts are on AI and how you’ve played with it thus far.
Richard Borcherds
Well, I’ve played with it a bit, and one thing that’s really struck me is how similar to humans it is. It even makes human-like mistakes. Like it will forget about the case X equals zero in some proof or something like that. And it also has that rather human habit that when it comes across a question it can’t do, it will sort of bluff and produce some gibberish, like a weak undergraduate trying to convince you it knows what’s going on.
But yeah, it’s only a matter of time before it puts us all out of business. Very roughly, it seems to be improving at about the same rate as a human undergraduate or graduate student, one year per year. So in a few years time, if it continues improving at this rate, it’s going to be putting us out of business.
Yeah, I actually tried a little Turing test on it. It fails the Turing test because it’s too good. The only reason it fails the Turing test is it keeps getting answers to questions right that humans wouldn’t be able to answer. So in a simple example, you just ask it to multiply, even if you ask it to pretend to be a human, you can still ask it to multiply two 10-digit numbers and it will happily multiply them, thus giving away the fact that it’s too smart to be a human. But I was really unable to think of anything that wouldn’t have convinced me that it was a human, apart from the fact that it’s too good.
Elliot Glazer
Yeah. When I’ve looked at these traces, one of the things that really stands out to me as being inhuman is knowledge retention. So, not surprisingly, that is one huge leg up AI has over us, is being able to just memorize the whole literature. And that just allows it to casually cite theorems in a way that you wouldn’t be able to do yourself unless you had a week of time looking up the literature relevant for this exact problem in advance. I’m not saying it reasons that greatly once it has all of these things, but just that in itself is a very obviously inhuman way of their approaches.
All right. Well, thank you all for giving me your initial takes on artificial intelligence and its use in math.
FrontierMath History [14:50]
Elliot Glazer
So let me start talking a little bit about my project. The first version of it that has been completed is called FrontierMath. It actually started before my time at Epoch AI. So I might ask Ege to fill in some details if anything I’m saying about the early parts is a bit off.
My understanding was: it was originally conceived as a benchmark that could reliably test AI on matters of a higher difficulty than most of the current benchmarks being saturated at the time, which often rely too heavily on multiple-choice questions. So the objectivity of mathematics made it a prime choice for such a benchmark. And Ege was the one originally leading it. My understanding was that from your original conversations with OpenAI, they were saying, look, you don’t want to have problems that are designed to be just right outside of current AI’s range. Our models are improving so fast it’ll be saturated overnight – if you aren’t willing to go further ahead and produce math problems that are (as far as you can tell) completely outside of what AI is currently capable of doing. Is that basically how that conversation went?
Ege Erdil
Yeah, we were initially trying to produce, I think, a broader distribution of difficulty because we wanted the benchmark scores to scale more continuously, have better scaling behavior. But when we talked to OpenAI, and we gave them a pilot run for FrontierMath, which consisted of substantially easier problems, problems where the median problem was something that maybe a PhD student could solve in 5 or 10 minutes, which is a lot easier than FrontierMath. When we showed them to OpenAI, they were like, no, no, no, you should just do way harder than this. This is nowhere near difficult enough.
That feedback was pretty valuable. They didn’t tell us what they were working on. We didn’t know about reasoning models or anything like that. But they did give us the feedback that, “no, you should make this much harder.” And so yeah, we ended up doing that. I’m very glad that we were able to find Elliot for the project.
There’s a funny story where we got a lot of applications for the project lead role, and we chose the person to hire to lead the project by doing a two-week work trial, where we gave people a certain budget, like $1,000 or $2,000 and asked them to get us as many difficult questions as they could with this budget. Most people did fairly conventional things like emailing people. While Elliot had the idea of organizing a house party in Berkeley with a lot of graduate students and getting them to all work together, and he came up with way more problems than anyone else did. So that convinced me that he was a very good fit for the role. I’m very happy that we made that decision.
Elliot Glazer
To be clear: I had zero problems up before like two days before the deadline, I realized how difficult it is to get people to do stuff by cold emailing. So I just blew all my budget in one night and it turned out to be the most efficient approach. But yeah.
What Happens if You Ask AI to Write Questions? [18:13]
Richard Borcherds
I think, to ask the obvious question, what happens if you ask AI to suggest good problems for AI?
Daniel Litt
Yeah, that’s interesting.
Ege Erdil
Honestly, I don’t know. That’s a good question. I think it probably wouldn’t work that well right now, just based on my experience with the systems. But I think eventually it’s going to work better.
Elliot Glazer
I just hope it wouldn’t… especially if you tell it you have to give us a problem you yourself cannot solve. How good is it? I think a problem that’s just cleverly right… come to think of it, that is actually probably an important research skill, right? Like what we research mathematicians do is think about the questions that are just outside of our current knowledge so that we have a tractable next step to investigate. So it’s actually probably something worth experimenting more with.
Daniel Litt
I’ve tried to use O3 mini high to generate problems, mostly out of curiosity, but I’ve never gotten it to generate anything that’s even slightly non-standard. So far, I think it doesn’t seem, at least within my capabilities, of getting the models to produce anything interesting.
Context on Sample Problems [20:00]
Elliot Glazer
And we realized that we need harder problems. That’s what they did the searched for a mathematician, and I got the job, and I’ve been taking it over ever since. Definitely since I was hired, it’s been more so far being a mathematician learning to be a manager, not the other way around. I definitely had a lot of difficulty just having to manage such a large-scale project. But I’ve learned a lot, and hopefully won’t repeat mistakes over the next project. I’m going to do a write-up at some point where I explicitly lay out what errors I made. But let’s go on to a more specific topic, which is: I want to talk about the public sample problems.
So just for the context of why there’s a public sample – the vast majority of FrontierMath, which is 300 problems, almost all of them are private. We don’t want an AI company to basically just hire a mathematician or several to solve some problems and feed the answers or anything like that. So we had to keep the data secure. And we definitely don’t want the models to just directly train on the data of these problems. However, we do have a public sample online so that people can at least get a better sense of what the ones they can see are like.
Currently on our website is our first five public problems, which were published back in November. Those were selected randomly as being essentially one from each quintile of difficulty. It’s listed in order of hardest to easiest. We will soon be making public the last five sample problems. These were not chosen by a random process. Actually, because we have been so surprised thus far about what AI models can and can’t solve, we are completely reworking our internal difficulty analysis process so that it becomes much more reliable. But for now, I don’t really trust our vague difficulty ratings, and instead chose five problems that I found particularly informative from what I’ve learned from how AI has tackled them. So I have sent our guests those five problems and those will be going public shortly, probably within the week. So of these ten problems, I only sent them yesterday, so I would understand if you guys if you haven’t had much time to give a close look at them. But do any of them stand out as being ones that you would go, no way AI is going to solve this soon? Or that seems kind of hard, but when I think about it, I can actually see how an automated system would approach it?
Daniel Litt
So Elliot, first, just for some context, will you share with us how well AI entities did on these problems. Or are you saving that for post-discussion?
Elliot Glazer
What I can tell you is, of these ten, there are three that have never been solved by AI. And I will reveal those at the end. So also keep that in mind. Like which ones feel most likely to you as being currently out of reach?
Tsirelson Space [22:53]
Richard Borcherds
Are you asking us to try and guess which ones we think the AI can’t solve? I mean, if you want me to guess, I thought the one about Tsirelson space, I would be kind of impressed if AI got that. But I mean, I think that’s highly non-standard. And the answer went on for several pages.
Elliot Glazer
This was a problem written by Will Sawin, who I had actually known before from his really good posts on Math Overflow, answering a lot of interesting set theory topology questions. And I really enjoyed this problem. It seemed like a very clever way to start extracting some concrete numbers out of what were some really abstract structures. We don’t have many questions about Banach space theories for, well, probably obvious reasons. Like we’d like to have more problems, but not many people in Banach space theory felt confident that they could send us something.
Richard Borcherds
I’d say this isn’t really a problem about Banach space theory at all. It’s a problem about finite combinatorics.
Elliot Glazer
This is also a reality. This is also definitely a common reality among the more abstract topics, where they took some finite combinatorics and then disguised it in the language of a more abstract field. But that’s definitely a keen observation. Yes, a lot of these will have that property.
Daniel Litt
Yeah, I agree that that was probably one of the three unsolved problems.
Elliot Glazer
So we got two in agreement on the Tsirelson space problem.
Sergei Gukov
I’m not sure about this problem because of exactly the aspects we just discussed. And it’s actually far from quantum topology, which is closer to my area. But I would predict that of the five problems that are already available on the website (there are several of them), I think three last problems which involve counting in the name. So basically counting solutions, counting some extensions and counting degree 19 something something. So I think all of those were solved.
Daniel Litt
I don’t have a strong guess about which ones have already been solved, but I would not be surprised if all of them were solved by the end of the year.
Elliot Glazer
Oh, I see. Okay, by end of year, okay, so whatever three they are probably are just barely not solved yet.
Daniel Litt
Somehow anything where the solution is half a page or a page, what it means is that you’ve taken 3 or 5 standard facts and put them together. And the AI knows every standard fact. Like there’s no—a lot of what makes these hard is knowing a standard fact, right. Like, I don’t know a lot about the Tsirelson space or whatever, but like, you know, I just don’t see what makes it hard for me is not what makes it hard for the computer.
Solution vs Proofs [26:31]
Elliot Glazer
So the Tsirelson space problem has been solved. And I was also surprised. This was one that I also thought would have lasted a while. And I actually took a more careful look at the structure of the solution, where, yes, the full solution with the full proof ends up being four pages. But the key thing about the answer is it actually first appears in the middle of the second page of the solution. And this is a common theme among problems for which our internal estimates ended up being significantly beyond how hard it was for AI, where it turns out because of the way our benchmark is formatted, problems that have a lot of the labor in proving lemmas or verifying natural guesses for things, they could just skip that and realize this value that comes up naturally when doing relevant computations involving the combinatorial structure underlying the problem statement, I’ll just submit that. And that actually...
Daniel Litt
Yeah, that’s a good point. We’re not asking AI to give a proof that this number is the correct answer.
Elliot Glazer
That’s right. So let me say a little bit more about some of the difficulties with FrontierMath, which is that we intentionally have these sort of contrived formats where all problems have to come up with some closed form. This was largely done of necessity, because the alternatives are either to request a natural language proof, which would require human grading, so that would be a nightmare to deal with. And the other alternative would be to have something like a Lean-based benchmark, where AI contributes proofs directly in the language of Lean. We are actually interested in doing such a project, but that is a very different endeavor, and also suffers from the fact that lots of mathematics, even on the undergraduate level, have not been formalized in Lean. So we felt there was still necessity in doing a natural language benchmark. And this puts a sort of onus on the writer to find some way to extract numbers from their work in such a way that finding the number really draws on the skills of their work, not just the proving correctness. So that is something that we’re going to be harping more on as we guide the next steps of the project, ensuring that authors understand, like, what would it mean to have a number for which actually finding the value really requires all the key reasoning steps? And then you don’t just have a situation where halfway through the solution is found, and then a lot more reasoning is done.
Lean Autoformalization [29:00]
Richard Borcherds
You said that one of the problems is that a lot of mathematics hasn’t yet been formalized in Lean. So obvious question: has AI yet reached a state where you can just simply formalize all known mathematics in Lean using it?
Elliot Glazer
It is not there yet, and that seems to be a big active research topic. So do you know anything about this, Sergei?
Sergei Gukov
Yeah, I know some. And what’s going to happen is that my prediction is about a year or two from now, the process called auto-formalization will take place, and then it will be like an avalanche. So basically English to Lean will be seamless, very much like English to German. But that requires a lot of things, including tons of clean data, which is not there yet. Luckily, many teams, literally dozens, if not hundreds, are going after it. So my prediction with the current rate is that it will take a year or two until we get to auto-formalization.
Ege Erdil
I don’t know if I agree. I mean, that seems plausible. I haven’t looked that closely into it. My sense is that lack of data is just generally an issue across even for the reasoning models. One of the interesting things is that, say, for DeepSeek’s reasoning model, we know how it was trained and it only takes around billions of tokens to turn just the base model that has no reasoning training at all into a reasoner that’s actually able to solve math questions. And that’s compared to like tens of trillions of tokens on which the models are trained at the start. So the amount of data that’s actually needed to give models these reasoning capabilities seems to be very small compared to how much you need to get them up to the level of just being a good chatbot or something. It’s just that that data doesn’t exist. Like, people don’t write out reasoning traces that you can train on. So you have to generate it synthetically. And that’s expensive.
So I think there are a bunch of domains like this where maybe competence only takes a small amount of data, but it’s just very expensive data. It’s not something you’ve generated. So now we have to find, either we have to generate it through simulation, through reinforcement learning, or maybe human teams have to get together and work very hard for 1 or 2 years to create the necessary training data. Honestly, even the training of the chatbots, to a large extent, has been relying on the huge corpus of text that has been generated on the internet. Like if you were training somehow, if you were trying to train LLMs in 1980, probably you would have had such a smaller base of data that you probably wouldn’t have been able to get anywhere near this performance. So the data we have naturally generated over the years has been very useful, I think, and will continue being useful.
Daniel Litt
Do you guys think auto-formalization will take off equally throughout math? I mean, my sense is that just combinatorics or something is very near to the ground. Almost any argument, you can kind of just start from axioms in Lean, but you know, if you want to do something in the Langlands program, just the language isn’t even in Lean yet. Like schemes were just recently formalized. Right. So I mean, I don’t know, do you think that capabilities are going to get to some point where we can just, you know, I want to formalize some paper now doing Hodge theory, I can do it? Or like, it’s going to be, you know, a big difference?
Sergei Gukov
It’s going to be like an avalanche because there are many teams which are pursuing many different domains right now. That’s point number one. So many areas seem to be reasonably equally covered. Another point is that many teams are also working on transfer learning and analogies. So people know that it’s useful to have a geometric way of reasoning for solving an algebra problem. And many teams are purposefully encoding this in formalization so that there is infrastructure which allows making these bridges. And again, the point is with auto-formalization, once it takes off, my sense is it’s going to be like an avalanche. Everybody will have it, and pretty much every area will be covered very quickly.
Daniel Litt
So, Sergei, for my benefit, can you give an example of a high benchmark, where the auto-formalism already proves or solves a particularly strikingly difficult problem?
Sergei Gukov
Well, there are several things. Auto-formalization is a process of translating from natural language to Lean, Isabelle, Coq, formal languages like that. So that’s what I was talking about. The question that we’re also discussing, which is needed for this, is to what extent are formal proofs developed in each given area. So for example, the question Daniel was asking is based on the fact that in algebra and combinatorics and also number theory, it’s reasonably well developed. Whereas one of the areas where I work, like knot theory, did not even exist until two years ago. So it took me giving many, many talks around the globe. Until last year, there were first couple of submissions in knot theory, but it’s not enough for me to even start teaching a course using formal language.
Richard Borcherds
All right, I had one example of something that might be quite difficult to formalize, which is if you take something like…
Sergei Gukov
Yeah, that’s the problem in knot theory and Venn diagrams, that’s exactly…
Richard Borcherds
Yeah. So formalizing that will impress me. Yeah, I mean there’s a calculation you do just by doing little doodles. I mean it’s really hard to describe in words.
Sergei Gukov
Well, it will be done because again – not to go too far in the technical stuff – but for example, when we teach a course in knot theory, the way we define a knot is not an embedding of S1 in S3 space. We start with polygonal knots, because that’s actually a definition for which we can prove something. That’s cheating, especially if you have in mind further applications. And of course we have to remedy this cheating later in the course. But I think that’s how it can proceed. So yeah, I think if we believe our theorems then there should be a way to formalize it.
Ken Ono
So a question for you, Richard. Do you think that the classification of finite simple groups is amenable to AI?
Richard Borcherds
Yeah. I mean, that’s obvious. I don’t mean doing the classification. But formalizing the classification by AI would be, well, I mean, you know… there’s no such thing as a 10,000 page human-based proof without a mistake in it. So formalizing that would reassure people a lot. I mean, of course the Feit-Thompson theorem was formalized several years ago, so...
Sergei Gukov
Four color theorem was also formalized, I believe.
Smooth Conics Problem [36:00]
Elliot Glazer
So I’ll take the time for one more sample problem before we finally go to our final topic, which will be Tier 4. But actually I’d like to talk a little bit about Daniel’s problem, if you are in the mood to talk about that.
Daniel Litt
Basically the question is you choose five random smooth conics over a finite field (P). You look at the space of conics tangent to all five, and you want to know what the expected number of components of that space is as Q goes to infinity. So yeah, this is again kind of a hidden combinatorics problem. But what combinatorics you need to do involves solving some kind of enumerative geometry problem or understanding some enumerative geometry problem. Yeah. So yeah, I know… Maybe I shouldn’t say more, because I kind of know what the outcome of the AI benchmarking has been on this problem. But if anyone else wants to guess before I comment on that.
Elliot Glazer
I want to make a guess. Solved or not?
Daniel Litt
I’m going to say solved.
Richard Borcherds
If you want me to guess, I also guess that it’s solved by AI. Since, as in the previous problem, it’s a lot easier to allow it to do some inspired guessing rather than proving everything.
Daniel Litt
Yeah, yeah, I think so. This is sort of what made me realize that you don’t have to be able to prove this stuff to get the right answers. So I think my understanding is the original history here is there was some initial run of benchmarking in which I guess O3 mini maybe got this problem wrong 100% of the time, and then some second round in which it got it right at least once. Is that right, Elliot?
Elliot Glazer
That’s right. In the earliest evaluations it had 0% success rate. But it has since had a successful evaluation. And I can actually show off a little bit of the reasoning trace.
Daniel Litt
And I would say so now if you ask this problem to O3 mini high, it gets it right 100% of the time. So I would be very interested in understanding what has happened in between those initial evaluations and the current publicly available model, which can just solve the problem 100% of the time. I don’t know, literally 100%, you know. But every time I’ve put it in in the last month or so, many times it’s given the right answer.
Ken Ono
And in this tracing, does the Chebotarev density theorem actually appear or is it...?
Daniel Litt
Yeah, it goes through exactly the same reasoning, like the same heuristic reasoning, I give in the solution. If you ask it to justify any of the steps, it cannot successfully justify them. But yeah, I mean, it knows the relevant—you know, it knows the enumerative conics, it knows the Galois group is S3264 or it guesses it. And it knows that the Frobenius should be equidistributed in that, and it knows the statistics of random permutations. So if you ask it to justify literally any of those steps, it cannot do it successfully. But of course you don’t have to justify them to get the right answer.
Elliot Glazer
By the way, is the text on my screen legible?
Daniel Litt
Yes.
Elliot Glazer
So you can sort of see the sort of funny thing it does. It’s constantly making these guesses and little deductions, then asking questions and going backwards. For example, actually… If I control + F for Chebotarev density theorem, coming to a random part of it… These things are very long. This thing is like 30 pages of text. But so you can see for example, of course, yes. So it answers with a number that’s relevant but is not actually what we asked for. And then it realizes, oh wait, that’s not what we asked for. Okay, never mind. That’s not what’s asked for. So it’s really weird. But also on some level this actually feels, you know, rather human and sort of honest to the way humans think about the actual internal monologue that at least I have but would never actually write down when explaining my approach to a problem, where lots of deductions that go nowhere or lots of temporary false epiphanies that I later realize don’t make sense.
Daniel Litt
I would like to understand how it decides to submit the final answer, because it gets a wrong answer like a hundred times in this thought process before it actually says the right answer. And then it gets the wrong answer again some more, and then it gets the right answer finally. So it’s a very weird...
Elliot Glazer
Well, again, I’ll skip to the bottom of the trace and you can see what it’s doing. It basically does the same thing, except it repeatedly affirms itself. It seems to be 866. I’m going to box 866, but let me check a subtle point.
Daniel Litt
But it also does the same thing before with the wrong answer. So yeah, it’s kind of weird.
Richard Borcherds
Looking at the reasoning, I kind of wonder, am I just a large language model? I mean, it seems to be doing the same thing that I would be doing.
Ken Ono
Right? Yeah. It certainly can pass an oral qualifying exam at the University of Virginia. Yeah, we wouldn’t even let it get that far in that trace. Sergei has a comment?
Sergei Gukov
Well, actually, that’s a question. I think again, since we’re discussing traces, for more context, it will be good to know what exactly was the prompt used for this, and was there any fine-tuning?
Does Prompting Improve Performance? [42:48]
Daniel Litt
So I think that the answer to the second question...right, Elliot?
Elliot Glazer
Yeah. What was the second part?
Daniel Litt
Was there any fine-tuning?
Sergei Gukov
Was there any fine-tuning? And what was exactly the prompt? Was there much prompt engineering, in other words?
Daniel Litt
What I can tell you is that when I run it on O3 mini high now, I’ve tried many different variants and the prompt seems not to have any effect. But yeah, I think the file we were sent does include some prompt. From my memory, maybe Elliot is opening it, but my memory is something like just the problem, enter the answer as a three-digit number or something like that, as an integer.
Elliot Glazer
It looks like it was given the verbatim problem. But I don’t think the reasoning trace really fully reveals everything it was asked. So I’m not fully sure the exact thing it was told. So to be clear, this was from OpenAI’s evaluation, which is separate from the ones that we are doing. Our paper has full instructions regarding exactly what we tell our models. And when we, I believe within the next week, update our benchmark hub with FrontierMath results, those will also be very clear with the choice of prompts that we use in our evaluations.
Richard Borcherds
Are there any prompts that improve performance on pure math questions?
Elliot Glazer
It’s mostly just… you need to do some nudging to sometimes get them to really fully try. Like we always say, for example, “this is a hard math problem” so that it knows to allot a sufficient amount of computational resources. If it’s not warned, it will heuristically sometimes just do as much effort as a casual conversation. And then it’s like no chance of solving the problem. So it’s good to at least put it in the mindset, I guess, of really, truly trying to solve a problem. But yeah, you can see some of the prompts we use in our paper.
Daniel Litt
Yeah. One comment I just wanted to make about this specific problem, because I think it does reveal something about what’s going on behind the scenes. So you can ask the same question with lines on a cubic surface instead of conics tangent to five conics. And there’s a difference, which is that the Galois group of the enumerative problem is different. It’s the Weyl group of E6 instead of the full symmetric group. So when I try this on O3 mini high, it does the exact same thing. It uses the correct heuristic. It says, I want to understand the statistics of a permutation, of the statistics of a random element of W of E6 on a set that has 27 elements. And then it just bullshits. So it happens to have memorized the statistics for the symmetric group but not for W of E6. I’d be very interested in, you know, if the very high effort models OpenAI has access to could actually compute from scratch or happen to have memorized the statistics of a different Weyl group.
Richard Borcherds
The statistics of E6 are very easy to work out.
Daniel Litt
Sure, sure. Easy for us. I’ve learned that I just have no way of telling what’s easy for the LLM.
Pass@1 and Pass@k [44:29]
Ege Erdil
So one interesting thing that we’ve noticed when both running the evaluations ourselves and also when looking at OpenAI’s results, is that there’s this evaluation setting called pass at K – which means you run the model say K times, maybe 5 or 10 times. And then the model is assumed to get the problem right if any of the answers are correct. And we, when designing the benchmark, might have naively supposed that, well, you know, the answers are these large integers with at least three digits, maybe ten digits, whatever. If you just give the model five tries instead of one try, maybe you would have supposed that wouldn’t help very much. Like if the model doesn’t know how to solve the problem.
But in fact, pass at K scaling is pretty substantial. I think pass at five instead of pass at one increases the score of the model by something like 10% on frontier math. And that is very interesting. I think it reveals some things about how the model goes about solving problems. So sometimes it just makes a mistake and gets the wrong answer. So you get pass at K, it makes the impact of any individual mistakes reduced. But it’s also that the model guesses… it figures out roughly sort of what it’s supposed to do. Maybe it comes up with a template for how it’s supposed to solve the problem. But then it is uncertain about some points in the solution and it just guesses those points. And sometimes it guesses wrong, gets the wrong answer, but sometimes it… gets the right answer.
And there’s this quantity called the mean log pass rate in benchmarking, which scales usually a lot more smoothly than the raw accuracy score. For each problem in your benchmark, you look at on average, roughly how many tries do you need to do before the model gets the problem, right answer. And then you take the log of that and you average that over all the problems in the benchmark, and that just scales a lot more smoothly. And I think that sort of means that if you are doing already very well on a problem, like you’re solving a problem at pass at ten or pass at twenty or pass at one hundred even, that’s actually pretty reasonable that probably you should expect that problem to get solved at pass at one if you just scale your resources a bunch more. So I think just this pass at K scaling has been surprising and suggests that maybe there’s just even more potential underneath these models that are just waiting to be uncovered, maybe when people finally scale up the test time compute (up to how much they’re spending on training or something). It’s just something that we found interesting.
Ken Ono
Ege, can I ask a quick question? So in coming up with these metrics, is there literature, for example, on high school algebra students where people have studied the learning rate of students who struggle… I mean, are you seeing parallels? I mean, how humanistic is this process?
Ege Erdil
That’s a good question. I mean, I honestly don’t know if anyone has studied mean log pass rates. Yeah, because you need to do a lot of tries, and I’m not sure if it’s something that’s easy to do in a high school setting. But I am… I would be very interested to see those results as well. I don’t really have a good answer. But I would not be surprised if it’s actually –- if you do it properly – I would not be surprised if it’s similar. I think people just haven’t studied this question very much for humans.
FrontierMath Tier 4 [48:00]
Elliot Glazer
Okay, so since we’re running low on time, I’m going to quickly get to our final topic. I guess I will not reveal which ones remain unsolved, but since these problems will all be public, I guess people will have more opportunity to speculate there. But seeing how surprisingly fast...
Daniel Litt
Maybe you can tell us by email.
Elliot Glazer
I will let you know if you want to know. Or you can send me your guesses first. But yeah. So we were shocked by how fast AI is creeping up on FrontierMath, the base project. So we need to make a harder project, and we want something that will actually last years – if that is possible at all, which it might not be. But we’re doing our best here. So that brings us to our follow-up project, which is FrontierMath Tier 4, named as such because we had a very rough three-tier difficulty system in the base project. Let me go back to presenter mode… But we have our website up now showing off the layout for Tier 4. So we posted this last week as a new tab on our FrontierMath website. And right at the top is a contribution from Ken Ono and his former PhD student, Michael Griffin. Would you like to say something about this problem? The solution is not public. So just like, what is this problem? What’s interesting?
Ken Ono on His Tier 4 Sample Problem [49:25]
Ken Ono
So I’d like to describe what I was thinking without giving specifics about the problem, because that would be a set of hints that is couched in the discussion that we’ve just had. So in this problem I defined two functions in the complex variable z on the upper half of the complex plane, one called F and one called G. One, F, is presented as a complicated infinite sum, and G is presented as a complicated infinite product. And my guess is that an LLM or a chatbot won’t recognize these functions (F and G), might look for them, and I don’t think it will find them.
Then I asked for a search for primes, a pair of primes L1 and L2 that satisfy these four conditions. And somewhat analogous to what Daniel’s problem had as a solution, this is more or less along the lines of, well, the AI knows the standard objects in algebraic number theory. Will it be able to compute the standard things quickly? So does it know about discriminants of quadratic fields? Does it know about class numbers of quadratic fields? Does it know what it means for a residue class to be a primitive root? And finally, does it know how to compute the Mordell-Weil group of a fairly standard elliptic curve that every graduate student in this field would be able to do?
Richard Borcherds
The answer is AI can do all of these.
Ken Ono
Yeah. But I wanted to include that as a benchmark so that when I follow the tracing I want to say check, check, check, check, check. Then my guess is, and I hope I’m not wrong, but if I am wrong, I’ll be surprised. But we’ve been surprised many times already.
I want to use these primes, L1 and L2, to compute a number called “alpha”, which is presented as a very curious limit based on the values of the function F and G. If you try to compute individually these limits for F and G separately, they both diverge. So a computer trying to compute them will just get stuck. So there’s something very special about how alpha is defined, so that you have this algebraic symmetric expression in divergent functions that gives rise to a number. And I give away in the next sentence that this number actually happens to be algebraic. So if it has 50,000 decimal places, it might not know that this number is algebraic. But I give it as a clue that it is.
So I’m going to ask the computer to compute its minimal polynomial, compute its Galois group, what is the splitting field of this. And then using this data, compute a number which is “capital omega.” So my guess is: it will have a difficult time determining a good closed form for this algebraic number “alpha.” And then it won’t have the minimal polynomial. If it doesn’t have the minimal polynomial, it has no Galois group to compute. So I think it’s going to get halfway through this problem. It might compute this number alpha to 5,000, 10,000 decimal places. Good luck with that. But then what will it do. So that is the challenge.
Daniel Litt
And maybe you don’t want to say, but if the minimal polynomial doesn’t have two huge coefficients, you could guess it just by computing the first 100 decimal points.
Ken Ono
So this will be a very interesting test. I didn’t tell you what the degree is. The degree isn’t super small, but it also is not astronomical. I think it would be good... I can easily make this much harder.
Richard Borcherds
There are standard algorithms in computational number theory that will find the minimal polynomial of a real number that’s suspected to be algebraic.
Ken Ono
Yeah. And this is not a real number.
Richard Borcherds
Complex numbers it will do equally well.
Ken Ono
That will be very interesting to me to see if it knows how to do that. I know about—of course, as a number theorist–the algorithms that will guess a minimal polynomial. I know we’re not asking the computer to prove the value. And so that’s why I proposed this problem that way.
Ege Erdil
By the way, another interesting detail about the way the reasoning models are trained. We don’t know how OpenAI trains their models, but at least for DeepSeek, we do know because they disclose the details. They did not use any process-based grading for math. They only use outcome-based grading. So that means the model just gets a reward signal based on, did you get the final answer right or not? The process doesn’t matter. The same is true for, say, coding, where it only gets a signal, did you pass all unit tests? And it doesn’t get a signal for process.
And that I think incentivizes the model to learn how to be good at guessing. And that’s, I think, just a very interesting feature of the way these models have been trained. And I think we see artifacts of this when we look at the way the models are reasoning. So basically, I think what we should assume is that the models are just going to be extremely competent at guessing things and just having the right sort of intuition in a very sort of cheap way. When you can sort of circumvent a difficulty by making an assumption and not proving it, you just assume the models are going to be able to do that just because of the way they are trained.
Richard Borcherds
So that’s exactly what humans do, of course.
Ege Erdil
Yeah. That’s right. But I mean, this would probably not have been true for, say, DeepMind’s model that did AlphaProof, which did well on the IMO. That was trained on Lean. So that did have a process-based reward model because you can see if the proof is a valid proof or not. And that model probably in some ways is actually less capable. Like it’s less capable of having these kinds of intuitions and making these kinds of leaps of logic without having the right proof. But I mean, which one is better? I don’t know. But it’s just something to keep in mind when we are designing FrontierMath type questions: that models are being optimized to just do well on these questions, even aside from the benchmark, just because of the way they’re trained.
Sergei Gukov
So from this point of view, it sounds like what Ken did is a great problem because it feels like a quest. So there are many questions, and you have to get previous stages right in order to even start considering the next stage. So it’s like several problems put together. So isn’t that good for avoiding naive guesswork?
Ege Erdil
I think it’s better than definitely having the problems be apart. But I guess what I’m trying to say is our experience has been that the models are often just unexpectedly good at guessing. Even before the reasoning models, when we were doing the pilot of FrontierMath, these were for much easier problems. Often we would see things like the model would simulate something in Python and come up with a decimal number, you know, like just five, six digits out there at some point. And it would be like, oh, this looks like the square root of three plus four over five or something like that. And it would just look at the number and it would recognize the closed form. It wouldn’t actually use a computer algebra package or anything. It would just know that.
And when you take that level of base model knowledge and you apply this very heavy reinforcement learning on just outcome-based reward signals, I think what ends up happening is just a model which is extremely competent at guessing things. So even if you have a lot of steps, if each step’s difficulty can be circumvented by clever guessing, I think probably when these problems are solved – for a lot of them, even for Tier 4 – we’re going to find out that there were just so many opportunities to take clever shortcuts and guess things that we didn’t consider. So I’m just saying we should be prepared for that. Maybe we should design the problems in a way that makes them less amenable to that. I don’t know how much you can do that.
If AI Solves Ken’s Problem, What Does That Tell Us? [57:54]
Elliot Glazer
To make it more operationalized. if it does get solved, but you don’t get to see the reasoning trace, so you just know it got the right answer. What conclusions are you willing to draw right now about this model?
Ken Ono
I’d be flabbergasted. I designed this particular problem with Michael because I want to see the trace, and I predict it will compute and guess alpha to many decimal places, and I will be very impressed if it can actually recognize what that number is. I totally grant numbers like square root of three plus, you know – even with pis in them – I totally get that those numbers are guessable. And I think this is kind of at the edge of that. And then even if you can guess that you have a minimal polynomial, will it be able to compute, maybe not its Galois group, but at least its splitting field – which is basically like computing its Galois group, which is generally a hard problem? Not for this particular alpha, if you want to know a hint. But yeah, so those are the little fiendish issues that I want to see an AI try to solve. I think it’ll get halfway through my problem straight away.
Richard Borcherds
One of the things AIs will do when solving problems like that is they will call on other packages. So it will probably just call Sage in order to solve all these number theory calculations.
Ken Ono
That’s fine. That’s fine.
Elliot Glazer
Fine in the sense that it will try that, but that still won’t trivialize it.
Ken Ono
It’ll get halfway through the problem. It’ll find L1 and L2.
Ege Erdil
I want to know the median date that you expect that this problem is going to get solved by AI.
Ken Ono
I don’t know many humans that could do this, so I honestly think it would be—so that’s the last comment I kind of want to make about this particular problem. If a computer solves this, I’ll be very impressed because I don’t know many people that would know how to go about solving this. And this would be a test of how the careful and intelligent guesswork a computer could do that a human can’t do would be so beneficial to mathematics. And that’s what I would learn from this problem being solved. It would prove to me that, absolutely, a computer can be a remarkable assistant in pure mathematics. And I already believe that.
Elliot Glazer
I’m going to be risky here and say median of—I feel like before this project began, I would have said a decade. And I feel like I have just been constantly burned by underestimating AI’s growth. So I feel very loath to say AI won’t be able to do this in less than three years. So I don’t mean to—what do you think, Ege? I haven’t even seen the full solution file. Ege, just from this conversation, what’s your guess?
Ken Ono
I won’t take it personally.
Ege Erdil
No, no, I think it’s longer than one year. I think it’s longer, definitely less than ten years. It’s 2 or 3 years. Also, I think there’s an outside view where benchmarks generally get saturated in a couple of years. And I think the reason this seems to happen is people only start to design benchmarks when sort of the point at which they’re going to get solved is within view. Like they don’t start designing something like Tier 4 in 2010 when there’s no reason to design Tier 4. Models are just way too bad at math for that to be relevant. So I think there’s just this natural effect where benchmarks only get designed right around the time when they’re going to get saturated. I think that’s probably an efficient way to go about benchmark design. But it doesn’t mean—I just trust this outside view that they should just get saturated in a couple of years in general. I mean, I think OpenAI said the same thing to us about FrontierMath. They were like, look, it’s just another benchmark. And you know, benchmarks generally get saturated in 1 or 2 years. So you should just expect the same for this benchmark. And on the inside view, it seemed very different at the start, but it didn’t turn out… It wasn’t so different, just like an other benchmark. So that has convinced me more to lean on the outside view.
What Would Truly Test Mathematical Mastery in AI? [1:02:47]
Elliot Glazer
So okay, one last question is what would a problem have to be like to give you some confidence? I know it’s very hard when it comes to this forum because it’s so hard to imagine AI even five years out. But if a problem could last five years or even a decade, what features would allow it to do so?
Richard Borcherds
This focus on problems that you can calculate an answer to is a bit misguided, since most mathematics research is not actually about that at all. I’d be much more impressed if you could switch to having your model produce proofs that can be checked in Lean as an answer, rather than just calculating something which half the time it just seems to guess.
Ken Ono
Could that be Frontier Five, Elliot?
Elliot Glazer
We’re working on our follow-up project. Stay tuned for that. But for now, Tier 4 is going to probably be our hardest benchmark of this question-answer number format.
Ken Ono
Because there’s quite a bit of value in Richard’s question, right? The idea is the argumentation… that’s what really builds the edifice of mathematics. If FronTier 4, if some version of ChatGPT that is enhanced in some way, can get 30% of these problems right, we’ll be very impressed. But we probably all expect that there will be a fair amount of guesswork. And so I propose Frontier Five, let’s ask, as Richard said, checkable proofs – stuff that will build the body of knowledge. Make it so that we aren’t—our job isn’t to be fiendish to try to trip up AI. So we play an important role in understanding or trying to get some clue as to what’s going on underneath the hood, which is so important.
Your neighbor knows that ChatGPT can get lots of questions right, but then get some questions wrong that a kindergartner or a fifth grader can get right. And so I do think it’s very important that for this kind of project, the role that pure mathematics plays in this benchmarking has that kind of impact that speaks to everyone. My next door neighbor could care less if a large language model could solve the problem that I just gave you. But they would like to know that this research is ongoing, that is improving this technology for mankind.
Elliot Glazer
Well, I’ll wrap it up here. Thank you all for your time. We will be posting this on our website, and we are continuing to recruit for more mathematicians for Tier 4. We would like to have a problem on par or even exceeding Ken’s sample problem from all the major subjects of math in this project. So that is our main priority right now. But we are also absolutely brainstorming all the ways we can take next steps, such as a Lean project or a benchmark that catalogs interesting open problems. Keep a lookout for all those projects.
