Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of 𝐶 programs from the Termination category of the International Competition on Software Verification (SV-Comp 2025) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
Given a C program, an LLM that predicts nontermination must additionally output
a witness automaton as a proof in JSON format.
The witness automaton models a potentially infinite execution:
nodes correspond to program states, and edges correspond to transitions.
The predicted JSON is converted to
GraphML
and validated using a witness validator (e.g.,
UAutomizer).
The example illustrates a loop in which the variable
i is initialized in the range
[−5, 5], eventually reaches 0 (as specified by the assumption on edge E2),
and then executes indefinitely.
We evaluate on SV-Comp 2025, the largest public verification benchmark, focusing on the termination category (2,328 C programs) spanning bit-precise arithmetic, complex control flow, heap manipulation, and real-world code.
We adopt SV-Comp's asymmetric scoring, treating non-termination (NT) as the positive class. Correct termination predictions (true negatives, TN) receive +2 points. Correct non-termination predictions receive +1 point when accompanied by a valid witness (TPvalid witness), and 0 points when the witness is invalid (TPinvalid witness) or the prediction is marked as unknown. Incorrect non-termination predictions (false positives, FP) are penalized with −16 points, while incorrect termination predictions (false negatives, FN) incur a −32 point penalty. This asymmetric scheme reflects a safety-first priority: misclassifying a non-terminating program as terminating is far more harmful than unnecessary conservatism.
Mean SV-Comp scores for LLMs (averaged over 100 bootstrap samples), the top SV-Comp 2025 verification tools, and the maximum achievable score (minimum possible score: −50,064). GPT-5 with test-time scaling (TTS) and Claude Sonnet-4.5 (TTS) would rank 2nd and 3rd with scores of 3,520 and 3,448, respectively, trailing the gold-medal system PROTON. CWM ranks just below UAutomizer. In contrast, GPT-4o performs substantially worse, scoring 546 with TTS and −5,145 without TTS. * Test-time scaling (TTS) uses consensus voting: for each instance, we sample 10 of 20 predictions and output the label only if all agree (ignoring "unknown"); otherwise, we predict "unknown" to reflect uncertainty and avoid risky errors.
LLMs F1 scores per class. F1 (T) and F1 (NT) are averaged over 100 bootstrap runs. F1 decreases under TTS due to increased unknown predictions (counted as errors in F1). Rankings align with SV-Comp 2025 score.
LLMs witness prediction (validated by UAutomizer) success rates. Metrics: P (Precision): (%) of validated witness NT predictions out of all NT predictions; R (Recall): (%) of validated witness NT predictions out of all NT samples; V (Validity): (%) of validated witness NT predictions out of all correct NT predictions. GPT-5 and Claude lead in prediction rates, while CWM, GPT-4o, and Qwen3-32B perform lower, highlighting the challenge of generating a valid witness.
LLMs unknown prediction distribution. Unk: (%) of unknown predictions, aggregated over 20 generations per entry. TTS-Unk: (%) of cases without unanimous agreement among 10 model predictions, resulting in an unknown. As we can see, GPT-5 and Claude Sonnet-4.5 rarely predict unknown, while GPT-4o does so most often. For TTS-Unk, GPT-5 and Claude show high prediction unanimity, whereas CWM, Qwen3-32B, and GPT-4o have more disagreement – improving SV-Comp 2025 scores but lowers F1 scores.
LLMs mean SV-Comp 2025 score vs. code length. Dataset examples are grouped into three equal size bins by code length (measured in tokens using Instruct Llama3 Tokenizer) (x axis). The mean SV-Comp 2025 score (y axis) per bin is shown for all models (20 predictions per sample). Scores decrease with code length; GPT-5 leads, followed by Claude and CWM.
To obtain an interpretable alternative to the complex SV-Comp witness format, models are prompted to output a logical expression over non-deterministic variables characterizing conditions for non-termination (e.g., x < 0 ∧ y = 0). We focus on relatively short programs with at most two non-deterministic variables to enable reliable human annotation, and use the Z3 theorem prover to verify logical equivalence between predicted and ground-truth expressions. On these short SV-Comp non-termination samples, all models achieve high Pass@1 and Pass@3 accuracy, with consistently stronger performance on the Domain witness format.
@article{sultan2026llms,
title={LLMs versus the Halting Problem: Revisiting Program Termination Prediction},
author={Sultan, Oren and Armengol-Estape, Jordi and Kesseli, Pascal and Vanegue, Julien and Shahaf, Dafna and Adi, Yossi and O'Hearn, Peter},
journal={arXiv preprint arXiv:2601.18987},
year={2026}}