LLMs versus the Halting Problem:
Revisiting Program Termination Prediction

Oren Sultan^{1, 2} , Jordi Armengol-Estapé¹ , Pascal Kesseli¹ , Julien Vanegue^3,4 , Dafna Shahaf² , Yossi Adi^1* , Peter O'Hearn^1,5*

(1) FAIR Team, Meta AI

(2) The Hebrew University of Jerusalem, Israel

(3) Bloomberg, New York, USA

(4) Imperial College London, UK

(5) University College London, UK

* Joint supervision

Conceptual illustration of the Halting Problem and Large Language Models

Abstract

Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of 𝐶 programs from the Termination category of the International Competition on Software Verification (SV-Comp 2025) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.

AI Podcast 🎙🤖️

Termination prediction task.
Given a program, the task is to predict whether the program halts for all inputs (instantiations of nondeterministic variables).
(1) Traditional verification tools such as PROTON rely on multi-component architectures for parsing, input augmentation, and tool-chain management (2) We investigate whether large language models (LLMs) can match state-of-the-art verification tools, offering a simpler, language-agnostic solution, as illustrated in the example above where the LLM correctly predicts termination.

Non-termination witness prediction

Website

Given a C program, an LLM that predicts nontermination must additionally output a witness automaton as a proof in JSON format. The witness automaton models a potentially infinite execution: nodes correspond to program states, and edges correspond to transitions.
The predicted JSON is converted to GraphML and validated using a witness validator (e.g., UAutomizer). The example illustrates a loop in which the variable i is initialized in the range [−5, 5], eventually reaches 0 (as specified by the assumption on edge E2), and then executes indefinitely.

SV-Comp 2025 Dataset

Website

We evaluate on SV-Comp 2025, the largest public verification benchmark, focusing on the termination category (2,328 C programs) spanning bit-precise arithmetic, complex control flow, heap manipulation, and real-world code.

SV-Comp 2025 Scoring

Website

We adopt SV-Comp's asymmetric scoring, treating non-termination (NT) as the positive class. Correct termination predictions (true negatives, TN) receive +2 points. Correct non-termination predictions receive +1 point when accompanied by a valid witness (TP_{valid witness}), and 0 points when the witness is invalid (TP_{invalid witness}) or the prediction is marked as unknown. Incorrect non-termination predictions (false positives, FP) are penalized with −16 points, while incorrect termination predictions (false negatives, FN) incur a −32 point penalty. This asymmetric scheme reflects a safety-first priority: misclassifying a non-terminating program as terminating is far more harmful than unnecessary conservatism.

SV-Comp 2025 main results

Website

Mean SV-Comp scores for LLMs (averaged over 100 bootstrap samples), the top SV-Comp 2025 verification tools, and the maximum achievable score (minimum possible score: −50,064). GPT-5 with test-time scaling (TTS) and Claude Sonnet-4.5 (TTS) would rank 2nd and 3rd with scores of 3,520 and 3,448, respectively, trailing the gold-medal system PROTON. CWM ranks just below UAutomizer. In contrast, GPT-4o performs substantially worse, scoring 546 with TTS and −5,145 without TTS. * Test-time scaling (TTS) uses consensus voting: for each instance, we sample 10 of 20 predictions and output the label only if all agree (ignoring "unknown"); otherwise, we predict "unknown" to reflect uncertainty and avoid risky errors.

SV-Comp 2025 F1 Scores

LLMs F1 scores per class. F1 (T) and F1 (NT) are averaged over 100 bootstrap runs. F1 decreases under TTS due to increased unknown predictions (counted as errors in F1). Rankings align with SV-Comp 2025 score.

Witness Prediction Accuracy

LLMs witness prediction (validated by UAutomizer) success rates. Metrics: P (Precision): (%) of validated witness NT predictions out of all NT predictions; R (Recall): (%) of validated witness NT predictions out of all NT samples; V (Validity): (%) of validated witness NT predictions out of all correct NT predictions. GPT-5 and Claude lead in prediction rates, while CWM, GPT-4o, and Qwen3-32B perform lower, highlighting the challenge of generating a valid witness.

Unknown Prediction Rates

LLMs unknown prediction distribution. Unk: (%) of unknown predictions, aggregated over 20 generations per entry. TTS-Unk: (%) of cases without unanimous agreement among 10 model predictions, resulting in an unknown. As we can see, GPT-5 and Claude Sonnet-4.5 rarely predict unknown, while GPT-4o does so most often. For TTS-Unk, GPT-5 and Claude show high prediction unanimity, whereas CWM, Qwen3-32B, and GPT-4o have more disagreement – improving SV-Comp 2025 scores but lowers F1 scores.

Performance vs. Code Length

LLMs mean SV-Comp 2025 score vs. code length. Dataset examples are grouped into three equal size bins by code length (measured in tokens using Instruct Llama3 Tokenizer) (x axis). The mean SV-Comp 2025 score (y axis) per bin is shown for all models (20 predictions per sample). Scores decrease with code length; GPT-5 leads, followed by Claude and CWM.

Precondition Prediction

Website

To obtain an interpretable alternative to the complex SV-Comp witness format, models are prompted to output a logical expression over non-deterministic variables characterizing conditions for non-termination (e.g., x < 0 ∧ y = 0). We focus on relatively short programs with at most two non-deterministic variables to enable reliable human annotation, and use the Z3 theorem prover to verify logical equivalence between predicted and ground-truth expressions. On these short SV-Comp non-termination samples, all models achieve high Pass@1 and Pass@3 accuracy, with consistently stronger performance on the Domain witness format.

BibTeX

@article{sultan2026llms,
  title={LLMs versus the Halting Problem: Revisiting Program Termination Prediction},
  author={Sultan, Oren and Armengol-Estape, Jordi and Kesseli, Pascal and Vanegue, Julien and Shahaf, Dafna and Adi, Yossi and O'Hearn, Peter},
  journal={arXiv preprint arXiv:2601.18987},
  year={2026}}

Web Analytics