Why frontier AI can't solve this professor's math problem - Greta Panova
Greta Panova wrote a math problem so difficult that today’s most advanced AI models don’t know where to begin.
USC mathematician and Putnam problem-writer Greta Panova wrote a math problem so difficult that today’s most advanced AI models don’t know where to begin. Even if you give them the right steps and papers to read.
She thinks that when AI finally can, it will have crossed a threshold in general human-level reasoning.
Watch the interview here:
Transcript:
Writing an Unsolvable Math Problem for AI [00:00:00]
Daria
Could you introduce yourself and your background?
Greta
My name is Greta Panova. I’m a professor in mathematics at the University of Southern California. I do algebraic combinatorics mostly, but I also do research in connections with computational complexity theory, probability, statistical mechanics, and sometimes I work in molecular biology, which is a completely separate endeavor.
Daria
What brought you to the FrontierMath Symposium?
Greta
I’m a problem writer and I’m on the editorial board right now for the Putnam Mathematical Competition. I’m involved in creating problems. Before there was any AI, I’ve always been very curious about problems—what’s doable, what’s not, whether it’s for humans or AI. Of course I would join.
Daria
I’m wondering how you approached writing the problem for the symposium. How would it be similar or different to coming up with hard questions for students, like your work with the Putnam?
Greta
For the symposium, it took several iterations because Eliot wanted the problems not to be guessable. I realized that I can come up with something hard that basically only I know how to do. There are some things like that. It’s just not a very mainstream thing. Eventually I found some case which I could solve, but was out of reach of anyone without hearing the ideas that it involves. Of course, we tested it with the current AI systems. Even if you tell it basically what to do, what papers to look at, it has no idea how to proceed.
Daria
Do you think a research mathematician or a graduate student would be able to do it? If so, how long would it take them?
Greta
It is possible, but they would need to get all the hints somehow. If they’re given the problem from scratch, it’s not a canonical thing. It involves several steps, several non-trivial steps. There are very few people who would know how to approach it. That’s something I was doing when I was doing my PhD. It took me quite some time to figure that out. It took me some more time to figure out the current problem.
How Hard Is It Compared to the Putnam? [00:03:08]
Daria
Would you say it’s much harder than the hardest problems on the hardest tests, that being the Putnam?
Greta
Oh, yeah. Yes. Because it also involves theoretical constructions that are graduate material and all these other things. It’s not an elementary problem.
Daria
How long do you expect AIs to struggle with it? How long will they take to solve it?
Greta
Initially I had a forecast about one to two years. Now I’m not so sure. I’ve spent a lot more time with the ChatGPT Pro version and testing problems for Putnam, and it’s not so good at extrapolating. It’s not that clear how well and in what direction it will develop. To be honest, it might take longer than two years. Of course it depends on how the community and OpenAI and all these other companies involved in training models will continue to pursue this and how serious they would be for theoretical math.
Testing ChatGPT and Frontier Models on Math [00:04:25]
Daria
How good are the current frontier models you’ve been interacting with at math questions?
Greta
Well, these models have some abilities that humans don’t. They have a huge database. They can search very quickly. They can mix and match. They can compute, they can run Python code. They can do all sorts of things that for us would take some time. At the same time, right now, what they seem to be doing is piecing together different arguments from the literature if they are more mainstream. If they appear in several textbooks and exercises and things like that, it will know how to apply. But something that is very special and maybe appears only in one paper or is pretty obscure, it wouldn’t know even to look at it. Also, the logic is not properly built in yet, so it’s doing logical errors. It can try to prove an inequality. It would reverse the arrow, the direction at some point arbitrarily. It’s just piecing together things that seem to look right. That’s how it was probably trained also. Sometimes it is right.
Daria
I think for many models search is not available. But then O3 is able to just Google things. Have you seen a difference between those? I’m guessing the search models would be better at finding some very specific results.
Greta
One thing that ChatGPT tends to do is give fake references. For references I used Copilot more often—that one can actually find papers and summarize information correctly, at least to some superficial extent. But then other features are missing. One should take it with a grain of salt if it’s searching the web. There was this joke among programmers: “I told you to take it from Stack Overflow.” “Well, yes I did.” “You should have taken it from the answers, not from the question.” Sometimes somebody posts some code saying, where is the error in this one? And it may not realize that it’s getting the wrong thing.
Why AI Still Fails at Logic [00:07:00]
Daria
If the AI or one of the reasoning models was your student, is there any advice you could give it?
Greta
The reasoning models are quite different from a student. The thing that is missing is really some connection with the logic base, a proof verification system so that it makes sure that all the arguments are correct. Maybe not undergraduate students, but graduate students certainly know how to write a proof, and they will notice immediately when there is a gap. Right now, the AI models don’t see the gaps. They can invent some steps. It’s still pretty rudimentary.
Using AI in Mathematical Research [00:08:03]
Daria
Do you use AI models in your research or teaching at all for math?
Greta
I try. Unfortunately, the problems I cannot solve, they also cannot solve so far. It hasn’t been very helpful in that respect, but it is helpful in generating code, testing hypotheses. It can even run that code itself. Let’s say, find me some examples, compute the first values, things like that. It can also, if you give it a paper, at least understand some of it, summarize it. If there is an algorithm described in the paper, it may actually try to apply it and run it and get you some output from that algorithm. It’s not clear how correct it is. Of course, if the algorithm is not described very well, it will also not take this into account and give you errors. But things like that—more mechanical things it can do. This is really helpful. But one should be very careful what it’s giving you because it sounds very convincing. It gives you something that looks like an answer. Somewhere in the middle of it, something might be switched and it makes the whole thing completely wrong.
Daria
Looking into the future, do you expect the role of a mathematician to shift? What will it be? Pretty much the same?
Greta
It really depends on how complicated a task you give it and how advanced mathematics you want it to do. I’m not yet sure how we’re going to build in error checking and the logic. I can’t give predictions, but it’s developing like humans in some sense. I do expect it to be able to do what humans can with more computing power than we have.
Daria
Is there a specific time when an AI model surprised you? Something related to math, whether in a good or a not so good way? Whether it was great or disappointing.
Greta
The first time I started using AI was a few months ago. I wrote up a proof of something that we call combinatorial interpretation. The coefficient of some sequence have a positive combinatorial formula in the sense that they are counting some things. What came out of this proof was something horrendous, counting some things, but it’s like some trees with some labels on the leaves and then some kind of global condition on all the leaves. It’s just not your friend. I sent it to some people who were looking at these things and they’re like, what is this thing? I don’t understand anything. Can you give us an example? I was myself really frustrated with the whole thing. I couldn’t, I didn’t want to spend too much time on this. Just for the sake of having an example, I gave the paper that I wrote to ChatGPT, and I asked, can you run the algorithm from this paper? Compute the example for whatever values. The first time it ran it, it gave me something wrong. Then I realized I forgot to specify something in the writeup, but once I wrote it, then it actually gave something completely correct. That part surprised me positively in some sense.
Other parts that surprised me—maybe negatively. It’s not really a surprise, but you ask a question and it gives you something completely wrong. It’s like one really unprepared but ambitious student comes and says, I want an A. Look here, I did this. It’s not quite right, but can I have some more points? I wish it wasn’t doing this. Maybe it shouldn’t be acting like a human that much. That’s not the value for these models.
But the third thing is that it does seem sometimes surprisingly good. Sometimes it gives the correct answer with the wrong reason, but I have no idea where the answer came from. It’s almost like Ramanujan who was coming up with identities for huge infinite sums over partitions. It wasn’t even clear how he came up or how it was derived.
How AI “Searches” Like Humans [00:13:52]
Daria
Do you think it’s possibly quite good at the sort of creative or intuitive kind of thinking? This surprises me, because what I’ve heard a lot so far is it can find an example or write some code, do something concrete, like look through—it has access to a bunch of data so it can pull from that. But then it seems like you’re mentioning the intuitive creative leaps, which to me feel like a very different flavor of mathematical thinking.
Greta
That’s true. But in some sense, this is the part of our own thinking process, that reasoning process that we don’t quite understand, which is when we solve a problem that is not some straightforward algorithm we need to apply, and we have some creative idea. We identify it as creative, but in reality probably in the back of our minds we are also searching the space of various mathematical objects and identities or whatever, and somehow matching what we have from two different fields. That’s probably the most creative part of mathematics—seeing some object and realizing it can correspond to some other object where you already have some tools available and apply this. It seems that those AI systems are also doing something like that. That’s how they become creative. It’s not that mysterious. It’s just that we don’t see the search process that led to that.
What Happens If AI Surpasses Mathematicians [00:15:41]
Daria
Do you think it’s conceivable, given that the AI seems to already be doing both the creative thinking in math and getting the data and gathering the information? Is it conceivable that it will be as capable as a current research mathematician in some amount of time, and you can tell me how long you expect that to be. Do you think AI could, in principle, ever get as good? In that case, what would happen to mathematicians and how would their role shift?
Greta
In the early times, the AI is going to get better and better and it will help us. Probably it will do proofs of simple statements, maybe do various calculations, give us examples to lead the way. All these little things that we spend time on when we do research, those will be easier and faster. In the short run it’s probably going to be good. With one little caveat—this will also lead people to produce more papers with questionable content. This will flood the space and journals and everything. I’m not sure that will be that good. The human aspect in the whole thing is that we still have a big picture in mind that the AI doesn’t seem to have. The human part in the process is going to be the selection or the editorial part where we figure out what’s valuable, what’s not, and what direction to go, which problems to solve, things like that. But in the long run, it is conceivable that AI will become better than any human at anything.
I have a claim that if the AI actually can do math on the level of a math professor, then it will be able to do anything else that is just intellectual work that doesn’t require some kind of dexterity or menial work. Then the whole humanity has a problem in some sense. Because then suddenly, all jobs will disappear. The hope is that others will be created. But I’m not that positive. It’s not going to be all good. Of course, there are alignment problems. Whether the AI will become autonomous at some point and start doing things against our own good. There are other problems like what is actually good. It might be good for one person, maybe bad for another. People are excited in the short term. But if you start extrapolating, you will soon realize that a few years from now, who knows? Maybe we can’t really imagine. It’s not going to be like in the movie Terminator. It’s going to be something much more benign looking.
The Risk of “Math Collapse” If AI Solves Open Problems [00:20:00]
Daria
Is there one thing that worries you most?
Greta
For mathematics specifically, the most concerning part is that if at some point AI starts to solve actual open problems, maybe Riemann hypothesis or some other millennium problem without serious human intervention, that would lead to a collapse of the whole field somehow.
Daria
On a somewhat brighter note, is there any advice you would give to either the math community or new math students in relation to AI?
Greta
The math community, I think we should all get involved in what’s going on in this benchmarking process, in steering the AI in the right direction to aid us rather than distort the field. For grad students, even though there may be an AI system which can solve all the simple problems in the textbook, we should still be able to do this ourselves. Because at some point this AI system may not be there. The less we understand how things work around us, the more vulnerable we are. We should not just rely on the AI to do things for us. Even how to compute an integral, even though Mathematica already could do this for most of the integrals, like already 20 years ago. We should still know how to do these things.
We should all try to understand how these large language models actually work, what makes them go in certain directions. This is both for AI safety but also for general alignment and usefulness. This is still going to require mathematics. We should continue to learn mathematics. It just maybe the end game is not going to be what it used to be. Maybe we will have to solve different types of problems.
Why Benchmarks Like FrontierMath Matter [00:22:12]
Daria
You mentioned the importance of being involved, thinking about these tools and also the benchmarking process. Coming back to the FrontierMath, what do you see as the value of this effort? Why do we want to come up with these hard problems? In general, how do you see that?
Greta
In all areas, we like to have some benchmarks to basically test how capable the tools that we developed are. Even in mathematics, pure math research, you’re trying to find some bounds. Can you do one step? You have one bound and you have to do another step, another bound. All these steps are also benchmarks in some sense. You’re trying to improve something. You’re trying to beat it. It is important to know what the AI systems can do or not, whether in math or in any other field. It is important to interpret the results of what they can do correctly. In that sense, mathematicians should decide whether they can do math. They should be providing those problems and ways to assess the answers to those problems.
Right now with FrontierMath, because there are not so many mathematicians involved in the assessment process, the benchmarks are made so that the answer is some huge complicated number that cannot be guessed, nor just computed with brute force. When it gets to that number, you know it did something right. That doesn’t unfortunately necessarily mean that it’s actually solved the problem correctly because in many fields the hard part is to prove something, even though it may look obvious or intuitively clear. To find the right arguments to address it and derive it is really the hard part, like for example in analysis. We should be more involved in the assessment part, because we don’t want to get to a point where let’s say next day a model gets the right numerical answers to all this FrontierMath questions that we submitted. Then somebody higher up says, oh, well, then math is done. AI can do math. Just stop all funding for math. Stop doing math. Go do something else. That would be horrible.
Daria
Do you expect the problems from the symposium as a whole to stand up against AIs for pretty long? How big is that fear that AI will just solve everything pretty soon?
Greta
I don’t know all the problems and I don’t know in other fields. It’s a complicated question, because sometimes somebody may think that the problem is hard because it requires knowing many theories and interpreting something in various ways. But then the AI is doing some pattern matching immediately and it just puts it together and solves it. There may be problems like that. There would be easy to solve soon. There are problems which require some computation that the human would not be able to do by hand, but if you actually plug it in, it will come out immediately. It is hard for a human, but not for a computer.
At least some part of those problems, at least a few ones that I’ve seen should survive for a while, just because they require many non-trivial steps that are not available in the literature. In some sense, if only a few people know how to do this, it’s going to take a while for the AI to start learning from that. In fact, that’s actually one aspect in which humans are still better. Humans can learn from just one example. We can extrapolate from very little data and the way that the models are trained, they require a lot of data to learn the same thing. They’re not so efficient. Somehow humans seem to be more efficient and we don’t exactly know how our brains work to say that. Problems that require some very specific knowledge and ideas that are not readily available in the literature will take a while. Whether it is one year or two years, I don’t know. Five.
What Would Truly Surprise Greta in AI Math [00:27:49]
Daria
What will it take for an AI to surprise you in the future? Because you’re quite familiar with what they’re capable of now. What would need to happen for you to go, wow, that’s actual math right there?
Greta
Well, even if I see an argument that was not a modification of something that was already there. Like something that we would interpret as honestly creative. Something that is not available in the literature, something completely on its own, like when it was training for Go, the AlphaGo. It was learning from humans and human moves. Then it was extraordinary. At some point it just figured out the move that nobody would ever do. It was a new move and it started winning. If it does something like that in math, we haven’t seen any of that yet. But if it does something like that, that would be scary and surprising.
Daria
A bit more broadly, to what degree do you expect AIs to reshape math?
Greta
It may as well be that certain fields would may even die. I’m not exactly sure, but certain fields may disappear and it will create other fields. It’s not a lose-lose situation. It will change the problems that we care about or that we think are significant. Just as a simple example, maybe 30, 40 years ago, people were publishing papers with binomial identities. Eventually the technology developed. You can even use some software to try and guess these identities. This is not considered serious math anymore. For most part. The same thing will happen when AI comes.
Daria
On a scale from zero to ten, where zero is the level of pocket calculator and ten is mathematicians just going obsolete, how much do you expect AI to reshape math?
Greta
I would say eight.
Daria
Similar related to that. How much will AI reshape the world where zero is again, a pocket calculator and ten is at least as much as the Industrial Revolution?
Greta
Maybe nine.
How Fast Should AI Progress? [00:30:56]
Daria
If you could choose whether to stop AI progress—so that would be a zero—or accelerate as fast as possible, where would you land?
Greta
That one is hard to answer because I don’t think any of us has the power to actually stop it. There are so many companies, the technology is available. There’s just no way you can stop people from developing large language models further. Even though the European Union is trying to do this right now. It’s just impossible. But I wouldn’t want to accelerate it. Maybe the answer would be four or five just because there’s something that’s going to develop no matter what. We should at least have some control on how it develops. We should try to exert control. If it develops too fast, then we will lose control. Because humans themselves don’t develop so fast, we cannot keep up.
Daria
If you had to put a number, maybe in years or months, how long will it be until any AI system solves your problem? Do you have a rough estimate on that?
Greta
I would say a year and a half to two. It depends on how much effort is put in Lean. There is some work required to be done from mathematicians in the process. If you just keep training large language models I’m not sure it will succeed. It will need something else.
How Mathematicians Can Help Train AI [00:33:00]
Daria
Do you think there is a way a model could be made better at math with the help of mathematicians, or if there is some special fine tuning or something that could be done to make it better at that specific kind of thinking?
Greta
Yeah, definitely. From what I understand, if it’s not trained by mathematicians, it just is not going to be that much better. It needs mathematicians to figure out what to feed it, how to evaluate the answers, in what directions to steer the model and things like that. It cannot just randomly learn math. I don’t think it’s that good. Not from Math Overflow, from the questions.
Correcting Misconceptions [00:34:09]
Greta
We are doing this some time after the symposium. There was a Scientific American article and also a Financial Times article that came about the FrontierMath Symposium. The Scientific American article somehow took a quote from Ken Ono that ChatGPT is as good as a graduate student, as a PhD, as a good PhD student, or something to that effect. First of all, this is wrong. Of course, Ken Ono was really amazed that it solved one of the problems that he created. But overall, most of our opinions are that it’s not nearly as good. Of course, it depends on the graduate student. Certainly the graduate students are much better than that. In general, we should be careful with statements like this. AI is not nearly close to solving mathematics, the hard parts of mathematics. Nobody should be making such insinuations.
Many mathematicians are excited from the possibilities. But if you extrapolate far ahead, things become dangerous in some sense. As a community, we should think about this, and we should try to be involved in the process of AI development and see where things go.