Toward Reliable Proof Generation with LLMs

Abstract

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI's o1 model (58%-70% improvement); both analogous problems and the verifier's feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

Problem Formulation

We demonstrate our ideas in the domain of Euclidean geometry. Our input is a geometry problem, described in both natural language and via a formal representation. The description includes the geometric entities involved (e.g., lines, angles), their relationships (e.g., perpendicular, collinear), and measurements or algebraic expressions over them. We also receive a goal, some quantity to be determined (e.g., the length of a line). In addition, the model has access to a dictionary of theorems that may be used in the proof. The output is a formal proof that derives the goal from the given conditions and theorems, along with the final, numeric answer. The proof consists of steps, each applying a specific theorem from the dictionary.

Approach

Our neuro-symbolic approach. Given a target problem from the FormalGeo-7k dataset, we first convert it into an abstract form by replacing entity names (e.g., lines, angles) and specific numeric values with placeholders. We then retrieve structurally similar problems from the abstracted dataset by computing Jaccard similarity over key formal components: construction (entities and geometric relations), conditions (e.g., angle equalities, segment lengths), and goal (the conclusion to be proven). This is based on the observation that structurally similar problems often share proof patterns. The retrieved problems, along with their corresponding formal proofs, are presented to an LLM as in-context examples, together with the available theorems from the Geometry Theorem Dictionary, to guide proof generation for the target problem. Finally, a symbolic verifier iteratively checks the generated proof and provides feedback until a correct proof is produced or a retry limit is reached.

Experiments

% correct proofs per level of difficulty (50 samples, 10 per level). Our analogy-based method outperforms the o1 base model (non-analogy) in all settings. Analogy retrieval, verifier feedback, and multiple runs each significantly contributes to performance. Our full pipeline (blue triangle) outperforms the baseline in every level, reaching an average aggregated accuracy of 80%. Even without multiple runs (blue square), performance remains strong at 68%, far exceeding the 10% of the base model baseline (red hollow circle).

@article{sultan2025towards, title={Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach}, author={Sultan, Oren and Stern, Eitan and Shahaf, Dafna}, journal={arXiv preprint arXiv:2505.14479}, year={2025} }

Toward Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Abstract

Problem Formulation

Approach

Experiments

BibTeX