Feedback Loops Guide AI to Proof Checking

Some of the earliest work on artificial intelligence (AI) saw mathematics as a major target and key to making breakthroughs quickly. In 1961, leading computer scientist and AI pioneer John McCarthy argued at the Fifth Symposium in Pure Mathematics that the job of checking mathematical proofs would likely be “one of the most interesting and useful applications of automatic computers.”

McCarthy saw the possibility for mathematicians to try out different ideas for proofs quickly that the computers then tested for correctness. More than 60 years later, such a proof assistant has yet to appear. But recent developments in both mathematics and computer science may see a breakthrough sooner rather than later.

Much of the work of verifying proofs formally using a computer still relies on a lot of manual effort by specialists such as Kevin Buzzard, professor of pure mathematics at the U.K.’s Imperial College London. Last year, he kicked off a project, funded for five years by the U.K.’s Engineering and Physical Sciences Research Council, to formalize the proof of Fermat’s Last Theorem developed by Andrew Wiles 30 years ago. Buzzard estimates it will take some 100 person-years of work to complete the process. Much of the help is coming from a community of volunteers who have, in recent years, shown how well crowdsourcing can work in this area. That, in turn, may provide an easier route for AI to finally make an entrance.

A key characteristic of projects like Buzzard’s is that the work readily separates into modules with clearly defined interfaces. This attribute helped one of the first major crowdsourced proof verification projects to be completed in just a couple of years. The Liquid Tensor Experiment (LTE) took shape in 2020, when Peter Scholze, professor of mathematics at Germany’s University of Bonn, asked the community for help in checking the 600-page proof he and Dustin Clausen had painstakingly threaded together by hand. Scholze wrote in his appeal how he had lost confidence in his ability to comprehend all the subtleties of the proof because of its sheer size and intricacy.

Languages such as Lean, used in both the LTE and Fermat projects, use keywords like “sorry” to mark unfinished components. This makes it possible to sketch out a skeleton of the overall proof that team members fill in gradually, until they are ready to have the proof engine used by these languages check the result and mark that section as complete. When working on LTE, Johan Commelin, assistant professor at the Netherlands’ Utrecht University, said he would wake up in the morning and find new parts of the proof had appeared overnight.

Researchers see AI benefitting not just from the same approach. Instead of expecting the software to work on complete proofs, tools could work on much smaller and more manageable pieces. The current generation of AI also benefits from the data that has spun off from the crowdsourcing efforts that can be used to train the models. Lean now holds the equivalent of an undergraduate course in mathematics and is rapidly catching up to the size of the Mathematics Components library developed over a longer period for the older language Coq.

In principle, the large language model (LLM) makes a good choice for pulling together the elements of a proof. However, the technology has problems, as Michail Karatarakis reported to colleagues at the Conference on AI and Theorem Proving (AITP) in Aussois, France in 2023. The Radboud University Ph.D. student used the Mistral LLM to generate sketches of proofs that, with some editing, could be checked by the Lean proof engine.

For the test, Karatarakis used two lemmas from a textbook on number theory, one picked because the components needed were already in the library. The other included a definition not yet in the library, “offering a way to see how Mistral handles unfamiliar definitions,” he explained.

Despite having a large body of existing material to draw upon and only being required to produce “sketches,” or small components of a larger proof, the output needed many corrections, particularly to the syntax. As well as other issues with the output, the LLM’s training set seemed to include many examples from older versions of the Lean language. Yet Karatarakis was using the latest version of the language, which has important differences.

That LLMs can struggle to build even proof sketches is not unusual. A 2024 review of activity in automated theorem proving by an international group led by Xujie Si, assistant professor of computer science at Canada’s University of Toronto, found advanced LLMs at best could completely formalize just a quarter of high-school and undergraduate-level problems.

One key problem that LLMs face with full proofs is the lack of reasoning these tools possess.

“Since the task in our case was to provide proof sketches rather than full proofs, syntax issues had a larger impact than reasoning,” Karatarakis said. “For tasks involving complete proofs, reasoning remains the primary challenge, even with improved syntax handling.”

Injecting more feedback into the training process could address some of the issues faced by LLMs. This intuition drove the creation of Draft, Sketch, and Prove (DSP) by researchers working at the U.K. universities of Cambridge and Edinburgh. This tool uses the LLM to create an initial sketch of an idea that goes to an automated theorem prover that can work at a more informal level. Several of these have been developed over the past couple of decades to assist mathematicians working in languages such as Coq and Isabelle. The last part of DSP is a separate formal engine that checks the work. The trio of engines form a loop where the LLM retrains on the proofs that the symbolic engine accepts.

Said Wenda Li, lecturer in hybrid AI at the U.K.’s University of Edinburgh, “Sometimes the gap is too large for the automated theorem prover to bridge. But generating new drafts is relatively cheap. We can sample millions of them to get just the right gap for the prover to bridge.”

Accuracy improved to over 50% with DSP’s combination of feedback and division of responsibilities. In a further step presented last summer, the team added a fourth module in a version called Sketch, Prove, Add Details & Repeat (SPADeR). The extra module called on GPT-4o to fill in blanks that would otherwise block a full proof. This increased the number of successfully verified problems in one test set to 93 from 85 using the earlier DSP tool.

For practical mathematical work, AI’s output may still need to be refined after completing a proof successfully. One common thread in the manual formalization efforts is the importance of finding good definitions and proof steps that support the process. At one stage of the LTE project, it looked as though just a few lines in one key part of the proof would mean formalizing an immense body of knowledge as a prerequisite. Commelin led work to avoid the problem by building a new underpinning that was far easier to implement.

In Li’s work with colleagues on a formalized version of another textbook on number theory, the group found that a formal definition of a key type of function of complex numbers would be better expressed by the use of an arithmetic expansion, rather than the more-intuitive approach used in the informal source. That approach would help underpin a larger set of dependent proofs, he explained. In some projects, these considerations have led to changes to underlying definitions, sometimes repeatedly.

“Over time, it becomes increasingly difficult to refactor definitions or lemmas due to growing dependencies,” Li said. “AI could potentially assist with this, much like other code-assistance tools, such as Copilot.”

Other goals may provide targets that AI can serve more easily than auto formalization itself. Patrick Massot, professor of mathematics at France’s University of Paris-Saclay, argues one lingering issue with the current formal languages like Coq and Lean is that they are too opaque to non-computer scientists. An “informalizer” would help scholars read the verified proofs. It could, as a byproduct, provide the foundation for building interactive math textbooks, in which students are able to drill down into the background of any proof or lemma they see.

A couple of teams have used LLMs to try to do the work. Though LLMs make fewer mistakes here than in formalization, the task demands much higher accuracy than they can deliver. In his work on informalization with Massot, Kyle Miller, assistant professor at the University of California at Santa Cruz, has been exploring the use of more traditional symbolic AI techniques. This involves far more manual engineering than training an LLM. Simply mapping the grammar of a language like Lean into English is not enough; it needs more changes. For example, the code created to check a proof formally can contain a lot of repetition that the tool would ideally remove from the human-readable version.

If successful, the work on informalizers in turn may help close the loop for theorem-proving engines based on LLMs and similar technologies. The output from these tools would provide a rich resource of synthetic data that can be used to retrain the AI engines as they create new proofs.

The multiple feedback loops that are now appearing may mean the logjam that has held up one of computer science’s major applications is finally breaking. But it may take a lot more research into hybrid schemes to strengthen AI’s reasoning skills and, perhaps, finally make autoformalization work for mathematicians.

Further Reading

Feedback Loops Guide AI to Proof Checking

DOI

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Further Reading

Feedback Loops Guide AI to Proof Checking

DOI

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.