AI + Math Chat #2 - Formalizing Proofs & Breaking Codes

“Can LLMs go beyond pattern-matching to produce fully formalized mathematical proofs and novel cryptographic attacks?” w. Kevin Buzzard, Alex Kontorovich, Kristin Lauter, Yang-Hui He.

Oct 30, 2025

Note, we have updated our previous video with transcript!

AI + Math Chat: “Can large-language models go beyond pattern-matching to produce fully formalized mathematical proofs and novel cryptographic attacks?” with mathematicians Kevin Buzzard and Alex Kontorovich, mathematician and cryptographer Kristin Lauter, and mathematical physicist Yang-Hui He (originally uploaded Aug 18, 2025).

In this second installment of AI + Math Chat, Epoch AI’s lead mathematician gathers four experts at the intersection of formal mathematics, cryptography, and AI, to explore a compelling challenge: Can large-language models go beyond pattern-matching to produce fully formalized mathematical proofs and novel cryptographic attacks?

Elliot Glazer from Epoch welcomes Kevin Buzzard, Yang-Hui He, Alex Kontorovich, and Kristin Lauter to discuss the potential and obstructions for AI formalizing mathematics in Lean, and how LLMs learning the rules of arithmetic through pattern-matching might pave the way to discovering novel attacks in cryptography.

Originally recorded in March, 2025.

Transcript

Introduction [00:00:00]

Elliot Glazer

Hi, everyone. It’s Elliot Glazer from Epoch AI. Back for our second AI and mathematics chat, this time with another set of four distinguished guests. As before, I’ll introduce you guys in alphabetical order, because that’s how we do it here in mathematics. That means I’ll be beginning with Kevin Buzzard. You’re lucky I’m not introducing you in order of your blog names, because then Kevin would be last.

Kevin Buzzard intro [00:01:13]

Elliot Glazer

Yeah. Kevin is a professor of math at University of London, and, well, I guess I don’t really have a better way to describe you, but “the Lean guy.” You’ve been the ambassador for the Lean project to the math community, and are currently working on formalizing the proof of Fermat’s Last Theorem, if I’m not mistaken.

Kevin Buzzard

Yeah, and I’m at Imperial College London.

Elliot Glazer

Oh, whoops. And dumb question, but do you have an official affiliation with the Lean project or are you just really, really enthusiastic about it?

Kevin Buzzard

No, no, no. Yeah. I mean, it’s a free and open source project. I’m a maintainer of its mathematics library now, as of a few months ago.

Elliot Glazer

Oh, congratulations. You finally got promoted.

Kevin Buzzard

That’s right. Yeah.

Elliot Glazer

I actually met him at a Lean meetup where I was complaining to him about the unnecessary use of choice and inaccessibles in number theory proofs that come from Lean formalization. But that’s just the kind of thing I care about because I’m a meta-mathematician.

Kevin Buzzard

Yeah, we don’t care so much about constructivism in Lean’s maths library.

Elliot Glazer

Well, that’s fine for our purposes, since your main goal right now is just to get it more accessible to the math world at large. Cool.

Yang-Hui He intro [00:01:52]

Elliot Glazer

Next up we got Yang-Hui He, a mathematical physicist at the London Institute. And looks like you work on a wide array of topics: quantum field theory, string theory, algebraic geometry. And it sounds like you’re getting more involved with researching machine learning applications into these topics. How did you get into that?

Yang-Hui He

Yeah. That’s right. So my background is in the interface between algebraic geometry and string theory. But I happen to have a very large database of Calabi-Yau manifolds at my disposal. So back in 2017, I started playing around with them from a point of view of using machine learning. And since 2017, I’ve been essentially… I wrote one of the first papers on this, on what I guess now becomes AI system mathematics. And since then, I’ve just made a lot of friends in every field of mathematics I can understand, from arithmetic geometry to representation theory, in terms of getting the data to try to raise conjectures and formulate problems.

Elliot Glazer

Yeah. Well, so this Calabi-Yau manifold database, how is this actually structured?

Yang-Hui He

So I mean this Calabi-Yau landscape came out of interactions between string theory and algebraic geometry. And sort of within the ten years between the late 80s to the early 2000s, there’s been a sort of group effort in the community just to list as much as you can in certain terms, in terms of construction. They’re publicly available. So it was natural sooner or later for someone to apply some kind of machine learning technique to this. And, you know, that was a familiar database to me at the time.

Elliot Glazer

It seemed like a natural first step in there. But definitely excited to see how machine learning will be applicable to the field as a whole.

Yang-Hui He

Yeah, exactly. I mean, at the time, I was just trying to play around with predicting topological invariants like Hodge numbers and Euler characteristic just by looking, literally looking at these varieties – because they’re defined by multi-degree polynomials in some kind of embedding. And you can just look at the degree as a picture and try to predict. I was expecting to get complete nonsense, but I was completely surprised that, you know, even a not-too-deep neural network without any knowledge of algebraic geometry can predict with very high precision what these things are. So that got me really, really excited. So I’ve devoted the last 7 or 8 years just making friends, trying to say “what kind of mathematical data is amenable to this kind of analysis?” And a lot of them really are.

Elliot Glazer

Well, it’s exciting to see mathematicians using all the tools at their disposal to engage in research.

Alex Kontorovich intro [00:04:55]

Elliot Glazer

Alright, next up, Alex Kontorovich, distinguished professor of math at Rutgers, specializing in analytic number theory and harmonic analysis. And he’s been getting more involved with both the Lean community, as well as just using online media like Twitter to better connect the math community with the rest of the world. And he’s got a very distinguished resume, not least of which being he was my complex analysis professor back when I was an undergrad. To be clear, his admission into this call was completely meritocratic. It was not out of personal bias or favor towards him.

How Alex and Kevin got into Lean [00:05:45]

Alex Kontorovich

I remember that you were an undergrad at Rutgers. I didn’t remember actually teaching you. So that’s good to know.

Elliot Glazer

I think I’ve done more to distinguish myself since then. But yes, back then, I might have just been undergrad number 25. But yeah, how did you get into Lean?

Alex Kontorovich

I heard Kevin’s talks.

Kevin Buzzard

Yay!

Alex Kontorovich

I’ve always enjoyed playing with computers. I use Mathematica a lot just in my research. Anytime I have some complicated calculation—it’s harder in analytic number theory because… I think Dick Gross told me, there are two kinds of number theorists: number theorists who use an equal sign and number theorists who use a less than sign. And I’m one that uses a less than sign.

Whenever I have an equals, there’s a Big O at the other side of that equals. But on the rare occasion that I have an actual exact identity, you know, especially if you’re chaining a whole lot of exact identities in some 100-page paper that has a thousand constants – that all have to perfectly line up at the end and coalesce – it’s very hard to keep all of that in one’s RAM at once while working on it. I found I needed to get my kids off to bed, get my wife off to bed, and then just sit at, you know, go to my office at 10 p.m. And then I would have from then until 5 a.m., I would have seven hours uninterrupted. There’s no interesting emails coming in. There’s no phone calls, no graduate students knocking on my door. And the first 2 to 3 hours of those seven hours is just re-uploading into my RAM the details of these hundred parameters, and which one needs to be estimated by what – and how to get this thing all worked out.

And so there’s some small amount of that that you can do in something like Mathematica. You know, even if you have identities, you can just stick values for some of the parameters and make sure that you have the identities exactly right (you’re not off by a minus sign, which can sometimes completely kill an argument). And then I saw Kevin’s lectures and I said, “oh, this is exactly the tool that I need in order to write 100-page analytic number theory papers, because it’ll keep track of it for me.” And all I have to do is make my local steps and not worry about, ‘oh, wait… I made some decision a hundred lines ago or a thousand lines ago. And how is that—that was the wrong decision, I had to change it. And how will that affect the next thousand lines? Just keeping track of these things is an absolute nightmare. And it’s actually a miracle that we managed to write any papers that are big, long and complicated before this technology.

Kevin Buzzard

Yeah, because all our papers are correct, right? There’s no errors in them when we’ve actually finished writing them. Right.

Elliot Glazer

You are morally correct. Who’s going to read them line by line?

Alex Kontorovich

You know, it’s interesting because Kevin really was pointing to the accuracy, the correctness of the mathematics. And I really was not concerned about the accuracy because, well, anything—that’s the beauty of graduate students, right? You write some 100-page paper and you put it out there and a referee reads it and a journal publishes it. And then ten years later, a graduate student comes along and wants to read this paper. And so they look at it, and that’s maybe the first time other than your collaborator—although depending on your collaborator, maybe your collaborator doesn’t read what you’ve written and just says “yeah, you know, I know how it’s supposed to go, you wrote it up, that sounds good to me.” But a graduate student is where, you know, things get really sharpened. And you wrote this lemma, but this lemma is not true. Yes, sure. That’s not what I meant there. Here’s how you really do it. That’s where mathematics really lives currently, but it doesn’t have to.

But again, so for me it was not… I found Kevin’s main point was let’s get mathematics actually correct – really, really correct. And I wasn’t so concerned about it because of this feedback loop with younger people coming along and checking things and being able to, if I can explain it to them in a way, year after year, if they’re reading different papers or whatever, anything that’s important gets checked over enough times by new people coming through… or it should, as long as our culture is continuing to move along in a healthy way… that if there are serious mistakes... And throughout the history of mathematics, there have been lots of serious mistakes in both directions: things that were thought to be correct that later turned out not to be, and vice versa. So sure, in the age of everyone doing everything in Lean, that will be gone. But I was in it for a completely different reason, which was just I saw it as a better mechanism by which to write papers, if and when it becomes a better mechanism by which to write papers. That’s my dream.

Kevin Buzzard

So I was having some kind of midlife crisis five years ago. I was very worried about accuracy in mathematics. And somehow people have convinced me that this is not something worth worrying about really, now. So I’m more in for that reason… Yeah. But somehow at that time, it was very—I was very concerned. Yeah. Now I’m worried—I mean, now I’m worried about other things. Fermat’s Last Theorem is fine, right? I’m not teaching a computer Fermat’s Last Theorem because I’m worried about it. It’s fine, you know, doing it for different reasons.

Elliot Glazer

Yeah. Now I’m starting to appreciate why in your complex analysis class, you were insistent that we write our proofs with Big O notation, that you didn’t want to see 100 problem sets, each with really complicated ad hoc equations, when that doesn’t really convey the essential aspects of an analytic argument.

Alex Kontorovich

That’s the beauty of abstraction. Mathematics is really all about seeing larger structures at a scale that our human RAM can hold them. And so seeing a thousand constants and knowing that that constant is less than that constant, that constant less than that—that’s just too much. I can’t hold that, but I can hold Big O. And I know how Big O percolates through the computation. And the beauty of I think where Lean is going is that it’s going to be massively better at that, at hiding, at suppressing all the technical details that we don’t need to know so we can really just sweat the big stuff.

Elliot Glazer

But it seems to be the case that I think the Lean community is still disproportionately being used by more algebraic-focused researchers than analytic ones.

Alex Kontorovich

That’s a function of how good Kevin is at getting people in versus how good I am at getting people in.

Elliot Glazer

All right, Kevin’s winning right now. But we’ll see if Alex is able to take it back.

Alex Kontorovich

I don’t need to take it back. I just need to do my part in getting people in.

Yang-Hui He

I’m very happy to be hosting Kevin next Friday.

Kevin Buzzard

You are?

Yang-Hui He

Yeah, exactly. Yeah. Next Friday, Kevin. And we have sort of a London Lean community coming in. So it’ll be—we’ll continue our conversation over that.

Kevin Buzzard

And you will see that there are people doing all sorts of things. There’ll be combinatorics and topology. There’ll be toric varieties. There’ll be everything.

Yang-Hui He

Awesome.

Elliot Glazer

Yeah. This really is a small world.

Alex Kontorovich

Sorry, just one last point. I think it’s actually, there’s some technical sense in which it’s harder to prove theorems with less than signs than with equal signs. Because with equal signs, there’s only one thing… if it’s true, there’s only one thing that can go on the other side. Whereas with the less than sign, there’s so much flexibility about what the argument means and what there can be. And what that means is you have to work—when you’re doing something like commutative algebra, something very abstract, all you have is the definitions. But the definitions were made to prove the theorems. Whereas if you’re doing something where the definition is something really, really low to the axioms and the theorems—the definitions were not designed to help you prove the theorems—you actually have to slog through the various steps. I think there actually is a technical sense in which analysis is harder to formalize right now, until we get enough tactics that do the messy stuff for us. I think… the more abstract something is, the more amenable it is to formalization. So there’s both the community and the culture and how many people have gotten involved and how successful—and gratefully we all, how grateful we all are for Kevin for doing the proselytizing that he has. So there’s getting people in, but then there’s, I think, also a technical component to it.

Elliot Glazer

This has also been applicable to FrontierMath, where analytic number theory is also much less represented than algebraic number theory on it, just because the whole inequality thing makes it hard for our rather specific concrete closed-form answer format. So that has also been a technical barrier we’ve been doing our best to find ways to overcome it. Actually, I’ll be meeting with Alex tomorrow to discuss one of his ideas for a problem. So I’m curious to see what your approach was.

Kristin Lauter intro [00:14:09]

Elliot Glazer

But finally, it’s time to introduce our last guest, Kristin Lauter, who is an expert in algebraic geometry and elliptic curve cryptography. And your career is quite unique among those that I’ve interviewed so far, since you’ve done a lot of your research in industry – including in Microsoft Research and Facebook FAIR. Tell me a little bit more about your professional journey.

Kristin Lauter

Oh, sure. Thanks for inviting me. So I was actually in academia. I was an assistant professor at University of Michigan in the 1990s. And then, I’ve spent most of my career in research and industry. So 22 years at Microsoft Research and then the last four years at FAIR, which is Fundamental AI Research at what used to be called Facebook, which is now called Meta. So the amazing thing for me has been being able to do mathematics in industry, in Microsoft Research all of those years, and then now at FAIR… I’ve been able to collaborate across disciplines with people that I would not probably have had the opportunity to collaborate with. And so that’s been really exciting. So when I started, I got into AI around 2011 at Microsoft Research when I was working on privacy technologies. So kind of ramped up the field of private AI and released SEAL, which is the Simple Encrypted Arithmetic Library for doing AI on encrypted data. So that was the whole private AI area that I was very passionate about and especially passionate about, you know, protecting data such as health data and genomic data, that kind of thing.

But then, when I decided to kind of pivot away from that just to do something new, and partly also because people don’t seem to care as much about their own privacy as I did as a privacy researcher. So but anyway, when I went to Meta, to Facebook at the time, it was specifically to work on what I now call AI for crypto. So I had these ideas, and I really needed collaborators and compute and all the stuff that it requires to do these big, huge projects. And I wanted to build AI tools to attack the new kind of post-quantum crypto systems – in particular the lattice systems, Learning with Errors, which has recently been standardized by NIST as one of the main public key crypto systems to protect our data for the next 25 years. So I think it’s super important to understand how well AI can do on these problems. And I’ve been working on that for the last four years. I have a team of really great collaborators at FAIR.

And so I do think that it’s within AI for math, it’s a little bit different flavor than Lean and formalizing, you know, formal verification, formal mathematics – all of which I think is very, very important. But I think looking at AI for crypto is a different area where you can see what happens when you try to take hard problems, hard computational problems, that humans cannot do (I mean, we can’t do in our lifetime) and try to see if AI can do better. Or if we can at least get insights from training AI models to work on them. And so both of those are nice directions and threads that I’ve been very much caught up with in the last four years.

Implications of AI finding a cryptographic attack [00:17:31]

Elliot Glazer

What would be your reaction if an AI system you were studying does actually manage to break through a previously thought-to-be-secure cryptosystem? Would you be more excited at the mathematical breakthrough of it? Or would you be very concerned due to the fact that this would seemingly undermine much of the protection of our current data?

Kristin Lauter

Well, no, actually, I’d be extremely, extremely excited about that possibility. But you have to remember that the way the crypto community works worldwide. And I’m talking about academic researchers, industry researchers and government researchers in all governments around the world. So it’s really an international thing, not just an American thing or a Canadian thing or something. So what the community does is try to profile all known approaches to attacking existing math problems, and then you try to scale them up. And sometimes there’s distributed… for RSA there were these distributed efforts around the world with labs contributing compute for sieving and things like that, and then gathering in one central place and doing a big linear algebra stage on a supercomputer. So all of these types of efforts, international efforts to understand the security, get profiled. And then we try to predict how strong they’ll be going forward. And so the thing that is disruptive or potentially dangerous is when, if a new technique comes in that’s not anticipated and has not been profiled…

So quantum is predicted to be that disruptive technology, right. So when quantum computers exist and then you can do polynomial time, we actually know how good the polynomial time algorithms will be once the quantum computer exists. So that’s why we have this post-quantum effort to have these new systems that don’t have quantum attacks against them. The problem is that some of them may have classical attacks against them, like AI. And so what I’m pushing very hard right now is for the community to be jumping in and helping us with our AI attacks so that we can profile them better and know how we should scale the parameters to avoid some kind of disruptive situation. So that’s how I view… that’s how most crypto researchers view the process of assuring safety in the future.

Kevin Buzzard

I was going to say you’re looking for patterns like Yang. You’re looking for patterns in data, but patterns that humans can’t spot.

Kristin Lauter

Yes, because if you look at the transformer models, the number of layers and the size of them is very big. I would have to say, honestly, Alex, even probably bigger than the space that you’re working in. And so if we can kind of look down into the representation layer and find what the computer is doing as it’s processing more and more data, that’s the type of pattern that you wouldn’t get from a human working on that same problem.

Alex Kontorovich

I was going to say Kristin gave a really beautiful talk at the JMM. It was really fun to listen to. And it made me think… and maybe you actually said this, I can’t remember, but you know, an LLM can guess, you give it two 100-digit numbers, it can somehow guess the product. Even though we have, of course, super fast deterministic algorithms that will multiply two numbers. LLMs somehow guess this thing. So then you wonder, well, could it do the reverse, right? You feed it a 200-digit semiprime and can it just guess some of the digits. Because if it can guess a whole bunch of the digits (and maybe you’re left with just a logarithmic amount of fiddling) and you have no proof that this algorithm works. You have no reason to know why it works.

Kevin Buzzard

Who cares?

Alex Kontorovich

But who cares because it keeps breaking semiprimes, right. Is that, Kristin, is that an accurate sort of version of what you’re describing?

Kristin Lauter

Yeah. But so for factoring… I mean that’s something where we haven’t made a lot of progress yet, because one of the things that’s kind of interesting to me is how bad our AI models are so far at just doing modular arithmetic. So that’s an example where humans with our algebraic formulas are really good at modular arithmetic. You just use the Euclidean algorithm and you can add stuff and you can multiply and even divide, etc. But for some reason or other, the AI models that we’ve been working on, they really struggle with doing modular arithmetic. And so it’s been fun making progress on that problem. And so doing the modular multiplication is one example of that. And then I feel like once we can do modular multiplication, then I feel like we’ll be better able to work on factorization.

Yang-Hui He

Actually, that’s a really good point. I have a fun anecdote about this. You know, when ChatGPT came out and (this was maybe 2020, 2021, after it just kind of stabilized,) I was very excitedly telling my various colleagues that it’s solving undergraduate problems in mathematics – solving fairly advanced-looking problems in calculus. You ask it to compute whether certain vector field diverges of a vector field and whatnot. I was so excited. This is amazing stuff. And then Kevin would appreciate this. And then this was in the SCR in Oxford. And who should walk in at that moment but Andrew Wiles himself. And I was telling Andrew about how exciting this is. And Andrew says, “why don’t you find the 10th digit of 25 divided by 17?” And he was like, it’s not going to be able to do that one. I was like, seriously, you know, come on. This is, you know, basic arithmetic. And lo and behold, that was exactly the one that gets these LLMs really hung up on.

Kevin Buzzard

They were not good at arithmetic in 2021, 2022. They were terrible.

Alex Kontorovich

Why would they be?

Kevin Buzzard

Yeah, humans are not good at arithmetic.

Yang-Hui He

And yet it was doing fairly, you know, advanced, you know, second, third year undergrad math.

Kevin Buzzard

Yeah. That’s because it’s doing pattern recognition.

Yang-Hui He

Exactly, exactly. Now, in retrospect, I understand a lot more what’s actually happening.

When/how will AI jump from pattern-matching to rigorous mathematical reasoning? [00:24:15]

Elliot Glazer

It actually gets me into the first topic that I want to go over. By the way, that was a good, long introduction. It’s a good thing that you guys all have your own great experiences with technology and mathematics and had a lot to go off of there. But the main thing I really want to talk about today was this question about how are models going to reach a point where it’s not just pattern matching, but is actually able to engage in some sort of properly rigorous reasoning.

So have you guys all seen the Terence Tao blog post, where he talks about how mathematicians go through this development process, where first we reason pre-rigorously. Then as undergrads, we learn what it means to have rigorous mathematical deduction. But then once we’re actually researchers: okay, it turns out it’s too inefficient for the small RAM human brain to always be doing formal deductions. So we do this sort of enlightened post-rigorous reasoning where we still think intuitively – but we know what things we’re saying can be rigorous if push came to shove. And I’m wondering what it will look like for AI models to make this same leap. Because I have gotten the pleasure of seeing – as the models have tackled this collection of problems I’ve been compiling for FrontierMath – the ways in which it has improved its pattern matching and its pre-rigorous reasoning. But still, seeing these models all make false deductions because they don’t fully understand what rigorous reasoning is or even have full mastery of objective operations like arithmetic.

So this is why I’m interested in the idea of seeing models develop their skills through Lean or any other formal client, and mastering math… mastering rigorous mathematical reasoning by engaging in auto-formalized versions of all the math that has been developed. And then pushing it further by a reinforcement learning process the same way chess and Go got mastered by models. So what are your thoughts on this approach to AI development and whether you believe that this is something that is feasible?

Kevin Buzzard

Are you sure they’re going to go through the rigorous stage? Let them just skip straight to the post-rigorous.

Alex Kontorovich

They’re there already, aren’t they? They’re wrong, but they’re there.

The math/chess analogy: can math capabilities be RL’d through synthetic data? [00:27:26]

Kevin Buzzard

So if you want a language model to be rigorous, I think it has to be backed up by a theorem prover. I don’t see any other way. And this is one, you know, one big application, potential application for software like Lean. But there’s still a huge amount of engineering to be done. I mean, these things… they multiply ten-digit numbers together because they ask Python, right? They don’t… that’s my understanding of how a language model does it now. So if you ask them to prove a theorem, you know, maybe they should ask Lean. But it’s not as simple as that because you’ve got to—there’s billions of lines of Python code on the internet, but only millions of lines of Lean code. And so, you know, maybe first you’ve got to formalize all the textbooks. But we can’t do that manually. So we’ve got to get a system to do that. And that’s a hard problem.

Elliot Glazer

So that would be the significance of synthetic data. For example, if my memory is right, the way AlphaZero mastered chess was it taught itself how to play chess. It learned the official ruleset first, of course. That had to be given to it. But then it simulated an ungodly number of chess games against itself – and very, very slowly started pattern matching the most successful ways to go forward. That was purely synthetic data. And similarly, once you have an internal Lean client, there’s nothing stopping you from manufacturing every possible combination of symbols until you start slowly bumbling towards an idea of what is the most effective way to find new mathematical ideas.

Kevin Buzzard

But a big difference between chess and math is that chess, there’s a good value function, right? It’s like if you’ve got a queen and the other guy not have a queen, you know. You can look at a position and give an approximate, you know, very good estimate for who’s winning. But in math, you don’t know if you’re two steps away from proving your theorem or whether you’ve gone completely down the wrong route. It’s much more difficult to judge the value of a partially solved problem.

Alex Kontorovich

Well, one interesting thing about AlphaZero is that it sort of learned its own value function. Right? That was one of the big ingredients there, which is why it was willing to sacrifice a queen for a positional advantage that only materialized, you know, six moves later. So it’s actually very interesting how human-like the AlphaZero chess engine is in that sense – that it makes these kinds of sacrifices, which are things that you wouldn’t see purely from a value function, from a human value function, just added up. You know, a queen is nine points or whatever. But there is something like a value function in—I mean, here’s a very crude value function. You don’t know the answer of the value function until the theorem has been proved. In a similar way, by the way, you know, AlphaZero, when it first started, if it had some just randomly generated value function, and then it plays games and it learns if that game was a win or a loss, and that’s the only information, then it generates another game and it sees some other value function. It learns whether that’s a win or loss. So in the same way you could go through Mathlib; and you could take every single theorem that has a tactic proof. And you could say here’s the goal state and here’s how many more lines of code it took until the goal state was solved. So that is some proxy for the quality of the chess position. Right?

Whatever the chess imbalance is, the more imbalance, the closer you are to winning. If it’s plus 4.3, then white should really win this game. And similarly as you move through the goal state and see how it changes from one line to the next, given that next line of code that was typed into Lean, that’s something like a value reduction function, like an energy minimization or something, where once there are no more lines of code needed to solve the theorem, then you’re done. And someone could—I don’t know if anyone’s actually tried this, tried learning on just lines of code as some proxy for how much progress you’re making.

Kevin Buzzard

But you’re talking about analyzing complete proofs when you know you’re at the end. And then you can say, here’s halfway, you know, 50-line proof, 25 lines through, halfway through, you know; 40 lines through, 80% through. But if we don’t know where we’re going, if you want to generate synthetic data, I mean… you people who know a lot more about AI than me—can AI figure out if we’re halfway through the proof of a theorem? That sounds like a really hard problem.

Alex Kontorovich

In the same way that when AlphaZero is in the process of… playing a game, it doesn’t know if it’s a win or loss until it gets to the end. So similarly, you could imagine an LLM generating a line. First you have to check if Lean says yes, it’s OK. Maybe it generates 20 potential next lines. And you just follow each of those down some big tree. And at the end some of those actually succeed in solving theorems.

Kevin Buzzard

If you’re lucky.

Alex Kontorovich

If you’re lucky, and if the theorem is not too complicated and so on. You can imagine, you know, trying to generate synthetic data in this way. The problem is math is just so many orders of magnitude larger than chess. I mean, at any particular… Yes, chess is a massive state space. But in any particular position, there’s 16 pieces, each of which might have, you know, four possible moves. That’s really microscopic on the scale of how many possible lines of code, lines of text there could be for the next Lean proof to make progress. So that’s why… at the same time, I can imagine how you might try to generate synthetic data, but I’m also very skeptical that it’ll work – just because the space of math is so many orders of magnitude larger.

Elliot Glazer

One thing I’ve wondered about is, even in this approach, it still depends on having some pre-established body of the key theorems we want to see the AI figure out how to prove. But I wonder if we could really make this analogy with the chess game stronger by turning math into a competitive game with two parties simulated, both by the AI of this form.

One model starts thinking about various... possible legal arguments, and eventually reaches some conclusion. And then it shows the other simulated entity the final result, but doesn’t tell it the proof and asks it to, as efficiently as possible, reach that end state. And it wins if it manages to outsmart the opponent, come up with a very clever conclusion which the other one cannot quickly figure out how it was proven, and vice versa. And doing this, it could sort of turn mathematics into a competitive game, fully formalized in Lean that with a huge, huge, huge amount of data might eventually result in competent mathematical analysis.

Does the hierarchical nature of math make it feasible to search for reasonable next steps in a deduction? [00:35:15]

Kristin Lauter

So I think that’s a really exciting idea and mirrors something happening with LLMs. We have LLMs as judges kind of thing for comparing the outputs of two LLMs, and you keep iterating on that. So I do think that that’s a very interesting and fruitful direction to explore. But one thing that’s interesting to me is that I don’t find the comparison with the chess games as inspiring. For me, I think about AI for math – like the whole community of mathematicians and the theorems that we’ve proved and stuff – as very fertile ground for helping improve our ability to do reasoning where the end goal is not mathematics. But the end goal is doing reasoning in the real world. So when you think about real world problems, like let’s say you’re going to have a robot helping you in your house or an LLM as an assistant, there’s a lot of reasoning stuff that has to do with the real world. And it can be kind of hard to organize the real world, right? So you can think of our physical space; you can think of the inputs in different modalities; and you can think of reasoning over time or reasoning back in time. There’s all these different axes that you can think of for how you should organize your information so that you can reason about things and talk about a big space to reason over, right.

But if you go back to mathematics and you think about mathematics as a space where you want to reason, I feel like we have at least a chance (like with Mathlib) of organizing our information well enough... I think of it as being possible to organize mathematics hierarchically. So if you have a theorem that’s been proved, you’ve got the steps, like the lemmas, the things that it’s proved. So down here you’ve got objects and definitions and then you’ve got lemmas and you’ve got theorems. And it’s very exciting that sometimes a theorem would be proved by something that’s way over somewhere else in the graph, really far away. And you don’t know that. You potentially have to explore this whole graph in order to find this bizarre connection, you know, that you wouldn’t have thought of. But maybe that’s where human intuition can still help. But the fact is that if we’re able to both organize mathematics well – in this kind of hierarchical graph structure – and also to be able to navigate that and to reason. To me, that’s the foundation for what we could or should be able to do in the real world in terms of reasoning. So that’s my main motivation in mathematics. I know that’s AI for math with a little bit of a different spin than a lot of mathematicians think of it, but that’s how I think of it.

Kevin Buzzard

I was going to say that in some sense, the Mathlib community is trying to organize mathematics in that way. It’s just, it’s a manual process and it’s taking a very long time.

Kristin Lauter

It’s very impressive.

Elliot Glazer

Do you think auto-formalization is going to speed that process up?

Kevin Buzzard

That would be lovely.

Yang-Hui He

It’s kind of interesting because, you know, this kind of pure mathematical data has come, in the community has become known as Platonic data. And it’s a beautiful word because, you know, it’s without error. It’s a very special kind of data that Kristin and I have been playing with, whether it’s classification of polytopes or whatever, distributions of L-functions or whatever. But this is kind of really nice. But Alex was suggesting, I mean, this is really like meta-Platonic data. Have some kind of LLM just keep on generating large sets of Lean sentences until you piece them together, and then maybe you can spot some patterns within the sentences. Is there anyone in the Lean community doing that? Because I was talking to people… I think, in fact, your postdoc, Patrick, he was telling me one of the greatest challenges towards auto-formalization is the fact that, exactly as Kevin says, there are billions of lines of code on the internet. But there are only millions of lines of Lean. So there’s not enough training data to even start doing this auto-formalization. But if we started just having people keep on generating Lean code that’s valid, can we at least generate this kind of synthetic meta-Platonic data?

Alex Kontorovich

But I think that’s what AlphaProof, you know—I mean, I haven’t seen a paper. Obviously I haven’t played with the model. But I think what they try to do is really implement this like generate a whole bunch of lines and see which of them—there’s some functional that they’re minimizing as they move through the graph, where they have an LLM generating potential next lines. And then they have Lean check if it makes sense and call by that.

Yang-Hui He

And that’s kind of randomly generated? Or do they have some kind of idea of what might be generating?

Kevin Buzzard

This is the joys of working with tech companies, isn’t it? We don’t know. It’s still no paper.

Kristin Lauter

Hey, not all tech companies. I have to shout out for FAIR, Meta: open science, open source, all details, models and code released. People ignore Llama—Llama’s the only really foundational open model. Sorry, you provoked me, Kevin.

Kevin Buzzard

That’s fine.

Elliot Glazer

I think you got cut off for a second. Can you repeat that sentence?

Kristin Lauter

Well, it’s just that the Llama models that we release are open, which means the model is available. So if you look in the medical community, for example, these models can be used, not just the output of the model. So it’s really valuable for the startup community, for the academic community, for the medical community, AI for science – all these things – for the actual models to be open. So I was trying to counter Kevin’s criticism of the tech companies because we are a tech company. But we are not in that bucket.

Alex Kontorovich

Lean itself is a product—Lean itself is a product of Microsoft and it’s completely open source.

On keeping FrontierMath’s problems private [00:40:31]

Kevin Buzzard

So the yeah, these open source models are crucial for Terence Tao’s AIMO project, of course, because, you know, the rules are: you upload your model to Kaggle and then it says, you know, they’ve got 50 secret questions. And the answer is just like Elliot’s data, just like FrontierMath. You’ve got 50 secret questions. The answers are numbers. But the way the questions are kept secret, this is a problem that the Epoch AI will have. They’ve got their secret database. But if OpenAI want to try something or somebody with a closed model, then they’re going to have to figure out how does the discussion look. Whereas with the AIMO, you have to use an open source model, and then you upload your open source model with no access to the internet. And then you get the results of your experiment afterwards. You know, you’ve got 30 out of 50 of these questions. So open source models are absolutely crucial. And one great thing about the AIMO is that it’s pushing for bigger and better open source models. It can see that this is an important issue.

Elliot Glazer

And this really is one of the challenges we face as we’re doing our research. Yeah, it’s preferable when companies have open source models that we can simply run on our own GPUs and fully see exactly what’s going on as we’re testing it. Unfortunately, since Epoch is trying to understand the state-of-the-art frontier models, this does require us to deal with the super secret closed source models of companies like OpenAI. We can’t say we understand the state of the art of mathematical progress if we tell OpenAI that you can only give us open source models. They’ll simply just say, well, okay, I guess you’re not going to see what we’ve got cooking right now. So it’s a really tough compromise that we have to deal with.

Kevin Buzzard

Do you have a problem basically that every tech company, you can only give them the questions once. Because… my understanding is that your database is getting bigger… Like the FrontierMath database, it never said what the size was, but I heard a rumor that it was a certain size. And then later I heard a rumor that it was bigger. It’s still growing, right? And then Tier 4 is going to make it grow more.

Elliot Glazer

I mean, I can say the numbers right here. We have finished the base FrontierMath data set only a couple of weeks ago. The final number being 300 problems for base FrontierMath distributed as 75 tier one, 150 tier two, 75 tier three. We were vague about numbers earlier because it was still growing then… Now we’re trying to produce Tier 4. And our goal there is to produce 50 research-level problems.

Kevin Buzzard

I read carefully the description of the Tier 4 problem. And it’s like you’re asking humans to do a compression algorithm. You’re sort of saying we want a question, and the answer has to be sort of short and easily checkable. But yet figuring out the answer, you must necessarily have had to do something really non-trivial mathematically. Maybe you must have constructed a careful, rigorous argument. There’s no way you can get that... So you’re asking for a witness of a complicated argument, but that witness has to be an extremely small thing. So this is a sort of strange and very difficult compression problem. And listening to Daniel Litt in the Tier 4 Epoch AI video, he said, you know, “I had this hard problem and the answer was a big number, and the model got it right.” But then I was pressing them with the details, and actually they didn’t know the details. They’d just… got a good guess – and ruling out, oh no, I can’t do this bit, but here’s a good guess. You know, they’re sort of breaking the compression problem there. You haven’t really done the hard work. But you’ve got the right number anyway. So I don’t know how—this is a difficult problem.

How to benchmark a model’s competence in producing Lean proofs [00:45:27]

Elliot Glazer

It is super difficult. And I mean again I am a set theorist. I think set theory as well as analytic number theory are two of the subjects that really suffer from having research that is really, really bad at producing exact numbers. Now, for me, I actually enjoyed the process of trying to figure out what aspect of my own research would best fit into the format, and ended up coming up with something pretty clever based on my research into choiceless spectrum theory. But it’s a pretty tough thing to do. And this is why, for future projects, it probably will not be in this exact same format of ‘find the integer.’

For example, I am interested in running a project for Lean formalized math proofs. We’re still trying to figure out right now what would be the best format for that. So actually, let’s make this a conversation topic. Say one year from now some big AI company says our model is super good at generating Lean proofs. What would be a good way to test this out?

Alex Kontorovich

Give it Bourbaki. Here’s a Bourbaki book in PDF. Read it and give me the formalization.

Elliot Glazer

Okay, so this is what I’ve been thinking. I think we really do have to start with a lot of formalization.

Alex Kontorovich

Well, if they’re saying they can formalize Lean proofs, you need a source of Lean proofs. Here’s a source of Lean proofs.

Kevin Buzzard

It’s not so difficult to formalize questions. It’s much harder… Just, you know, formalize some questions. Formalize the statement of the classification of finite simple groups. That’s not asking so much.

Alex Kontorovich

But one of the really hard things, and I learned this from talking to Kevin, I think, is it gets—especially if you’re something like a Bourbaki book. Let’s say it’s on groups. And the very first thing is the definition of a group. And Kevin has a whole rant about “what is a group?” and how you open up ten textbooks and there are ten different definitions. And in Mathlib there was a preliminary definition which seemed just fine. And a thousand lines of code later, you realize you’re hitting up against the same issues again and again and again. And then you go and refactor it, and now you can get to 10,000 lines of code; and then you realize there’s another issue; and then you refactor it again and rewrite all that code yet again. Now you can get to 100,000 lines. And finally, it feels like you’ve found the right definition.

And so auto-formalization is so much more than just: get the facts in there. It’s get them in there in a way that they’re usable, which requires refactoring. And so you would want a tool that can see a large amount of Lean data, and see that you’re applying the same little rewrites again and again. Or maybe you should write a tactic for this. Or maybe you should go all the way back to the beginning and change the definition – and then everything will zip up and work, you know, in a nice way. The fact that it’s executable is really important. It’s not just a theoretical thing, it actually has to work.

Elliot Glazer

Actually, it’s really funny that you chose Bourbaki as your example of something that’d be really funny to see auto-formalized because Bourbaki is infamous for how many very specific details they laboriously describe. I’m linking one of my favorite blog posts on this right now. If you look in the chat.

Alex Kontorovich

I mean, the reason I said Bourbaki, if you can’t do Bourbaki, you can’t do anything else. They’re the closest thing to fully rigorous, every line written out.

Yang-Hui He

Or Russell and Whitehead’s, which Elliot may like.

Alex Kontorovich

Even better. Yeah.

Elliot Glazer

Yes, the infamous 200-page proof that one plus one equals two, which isn’t actually that long. And yeah, Bourbaki’s definition of the number one is supposedly 10 to 20 or so symbols long or something ridiculous if you actually lay it out. This was a calculation done by Adrian Mathias.

Testing LLMs’ arithmetic abilities in the absence of tools [00:48:34]

Kristin Lauter

But Elliot, I just wanted to mention something. Sometimes as mathematicians, you can have a tendency to make things too complicated. I remember a first year of grad school, Professor Bob Fefferman, giving me some advice. If you’re asked a question, always give the simplest proof, like the simplest answer. And one thing that I wonder if you’ve done in your 300 FrontierMath questions is: what about just ask it to do modular arithmetic? So the thing is, in crypto, the sizes—we’re talking about the prime is like 256 bits long to represent. So there’s no way a model can memorize all the answers. Like if you ask it to do arithmetic mod 17 or something, of course it can memorize stuff. But if you take really large problems… I’m pretty sure it can’t do modular multiplication, for example. So I know it’s not as fancy as giving a nice math proof, but it’s also similar to what Yang was mentioning that Wiles said – can it actually do long division and give you the 10th digit accurately? And the answer right now is no, right?

Elliot Glazer

This would work for an open source model where I can see exactly what it’s doing. The problem is, if I take a long multiplication number and put it into ChatGPT (as I think Kevin mentioned earlier), there’s nothing stopping O3 mini from just pulling up its copy of Python and plugging it in there and spitting out the answer. So if it’s an open source model, I can see that it’s actually really doing it by hand or something.

Kristin Lauter

Yeah. But I mean, it’s meaningless to have… I mean, obviously you can use tools to solve these problems. So if you’re calling tools then it’s game over.

Elliot Glazer

But yeah, I’m just saying it’s very hard to tell when it is.

Alex Kontorovich

Where in the rules does it say it has to be a token predictor. You know, maybe the token predictor recognizes that it should send something to Wolfram Alpha, get the computation, and give you the answer. I’m tool agnostic. If it will solve 13 over 27 whatever that calculation was by opening Wolfram Alpha and doing it there, great. You know? Why does an LLM need to know how to do modular multiplication, right?

Elliot Glazer

And that is why FrontierMath is designed the way it is. FrontierMath problems were all designed with the intention that they would be impressive no matter which tool did them. I think this modular arithmetic idea is interesting if only applied to specific tools. But due to the fact that calculators exist, that wouldn’t work as a tool-agnostic math test.

Kevin Buzzard

You could ask about solutions to equations over finite fields and come up with something – you know, this is modular arithmetic – but come up with something where the tools would have to be pushed much harder. Or where maybe you even needed some understanding. That’s ideally what you want to be testing, right. Understanding is one thing that the systems don’t have, you know, they don’t have understanding. You could say it doesn’t even make sense to say “does a language model understand anything?” It’s not the right word. It’s the kind of thing that humans do. But you could imagine that Frontier—there was a finite field question on the FrontierMath model, but unfortunately the answer was sort of too easy to guess. So I do wonder, I think there might be scope in that direction. If Kristin wanted to come up with a super tricky elliptic curve over finite field question, maybe you’re the person.

Kristin Lauter

Well, that’s exactly the direction I was thinking. The whole motivation for seeing if you can train AI models to do modular arithmetic is as the foundation for attacking more and more cryptography problems. So we already can do it well enough to attack lattice-based crypto because the primes are relatively small, they’re only about 20 bits, 20 to 40 bit primes. So when you can do 2000 bit primes, then you can start working on RSA. Or if you want to take the example that Kevin just gave, multivariate cryptography is based on finding solutions to equations in many variables – you know, nonlinear equations in many variables, many equations. And so that’s the foundation for being able to solve crypto problems better than humans can solve them.

Kevin Buzzard

So you really want to take Python out of the equation. You want these things to learn modular arithmetic, but without cheating, right?

Studying how transformers’ learn to pattern match increasingly complex mathematical phenomena, towards discovering new algorithms [00:53:11]

Kristin Lauter

Right. I mean, that’s what we do already. We’ve published several papers about that, but it’s just one example of a hard problem that you can study the transformers and then you can look at the interpretability questions inside the layers of the transformer and what is happening there and how is it learning. So that’s the motivation for working on these kind of things as well.

Alex Kontorovich

And could that potentially point to an algorithm that you hadn’t thought of?

Kristin Lauter

Yes, that’s the idea.

Alex Kontorovich

Can I ask actually in terms of open source, the crypto community is maybe best suited to discuss issues of open versus closed, right? These are matters of national security. So there’s some things that are not just closed, like classified. So I don’t know. Do you have, Kristin, having seen this side of it, do you have suggestions for what the non-national security side could be doing in terms of open versus closed models and good ways of interacting with them?

Kristin Lauter

The cryptography, the non-government cryptography community, or you could call it the open cryptography community, is a really good example. It has historically, you know, existed over the last 3 to 5 decades. And people are strongly in favor of open everything: patent-free systems, open code available, full description available, everything available and verifiable, and shared. So most of the competitions to determine the next cryptosystems are open international competitions. So it’s a very strong ethic within the crypto community to do things openly.

Elliot Glazer

Let me quickly ask one question. I’d be interested in your take on… Actually this is more of a follow up of something I think Alex said, which was that: if a model does reach a point where it is solving elliptic curve cryptography problems really fast, how confident are you that you’d be able to extract the algorithm from it? Or would it be more like the model as a whole would have to be taken (with its pattern matching capabilities) as some sort of vast black box efficient elliptic curve attacking algorithm?

Kristin Lauter

So I just want to say, first of all, we’re really far away from AI models being able to do elliptic curve discrete logarithm problems. Like I said, we first need to be able to do modular arithmetic. Either that or we call modular arithmetic as a tool, and we don’t train the models to do it. But then if we haven’t learned to do modular arithmetic, I’m not confident that you’re going to learn any elliptic curve – because elliptic curve discrete logarithm problem just involves doing repeated modular arithmetic computations to add elliptic curve points together. So it’s possible some kind of approach could work with interleaving tools and learning. But we’re not close to solving elliptic curve discrete logarithm problem in my opinion.

I almost think factoring is an easier target. If you go back to what Alex said earlier, where let’s say you can guess the first half of the digits of the factors. We already have work in the crypto community that uses lattice-based solutions, like Coppersmith or whatever, to recover the rest of the digits of the prime factors, if you’ve given some partial information. So I think that’s much more likely as a path forward than the elliptic curve discrete logarithm problem to start with.

Epoch would be excited to use Lean as a way to design new math benchmarks without the shortcomings of FrontierMath [00:56:45]

Elliot Glazer

One last thing I want to say is that Epoch is very interested in trying to find new strategies for tracking AI’s progress in mathematics. As Kevin has pointed out, there are definitely downsides to our current strategy – which was we have these problems of the form find a simple answer for a very specific, if contrived math problem. This format was born out of necessity due to it being the most easy to verify. But I’m really excited about the idea of using, for example, Lean-based, making Lean-based benchmarks in the future, especially as we begin to track the formalized side of mathematical reasoning as it progresses among the models. So I’ll be continuing to brainstorm various possible projects we can do of this form. And if any of you at any point have specific ideas you’d be excited to see my team try to implement, by all means, email me. And I can talk to you more about how we’re going to make that into a reality. But yeah. Thank you all for coming. I thought that was an enlightening conversation. And take care.

Rainbow Roxy

This second AI + Math chat is perfectly timmed, building on the first. I'm always a bit skeptical about LLMs generating truly novel proofs, but you make such compeling arguments. Great job.

Expand full comment

ANDREEA LEONTE

Fascinating. It's truly compelling to see how LLMs might move beyond pattern-matching for fully formal proofs and new cryptographic attacks, especially with such diverse experts like Kristin Lauter contributing to this vital discusion.

Epoch AI

AI + Math Chat #2 - Formalizing Proofs & Breaking Codes

“Can LLMs go beyond pattern-matching to produce fully formalized mathematical proofs and novel cryptographic attacks?” w. Kevin Buzzard, Alex Kontorovich, Kristin Lauter, Yang-Hui He.

Transcript

Introduction [00:00:00]

Kevin Buzzard intro [00:01:13]

Yang-Hui He intro [00:01:52]

Alex Kontorovich intro [00:04:55]

How Alex and Kevin got into Lean [00:05:45]

Kristin Lauter intro [00:14:09]

Implications of AI finding a cryptographic attack [00:17:31]

When/how will AI jump from pattern-matching to rigorous mathematical reasoning? [00:24:15]

The math/chess analogy: can math capabilities be RL’d through synthetic data? [00:27:26]

Does the hierarchical nature of math make it feasible to search for reasonable next steps in a deduction? [00:35:15]

On keeping FrontierMath’s problems private [00:40:31]

How to benchmark a model’s competence in producing Lean proofs [00:45:27]

Testing LLMs’ arithmetic abilities in the absence of tools [00:48:34]

Studying how transformers’ learn to pattern match increasingly complex mathematical phenomena, towards discovering new algorithms [00:53:11]

Epoch would be excited to use Lean as a way to design new math benchmarks without the shortcomings of FrontierMath [00:56:45]

Discussion about this post