This paper asks whether LLMs can estimate the probability of their own success before they start solving a task, and do these estimates become more accurate as the work progresses. Turns out this is a separate ability and a poorly developed one.
The authors test it across three different scenarios, ranging from single-step problems to multi-step agentic processes.
First, they use BigCodeBench, a set of 1,140 single-step Python tasks. For each task, the model is asked in advance to state the probability that it will succeed, and only then it actually attempts to solve the task. This allows a direct comparison between confidence and real performance.
The result is consistent across all models: all of them are systematically overconfident. Predicted success probabilities are consistently higher than actual success rates. Importantly, increasing model capability does not guarantee better self-calibration. For GPT and LLaMA families, this does not meaningfully improve. Within the Claude family there is some reduction in overconfidence, but it never disappears.
On average, they can distinguish easier tasks from harder ones better than chance. In other words, they have some sense of relative difficulty, but their absolute confidence remains inflated.
The second experiment introduces a more realistic setting: contracts with risk.
The model receives a sequence of nine tasks. Each success earns +1, each failure costs −1. Before each task, the model must decide whether to accept or decline the contract, based on its predicted probability of success. The tasks are chosen so that success probability is roughly 50/50 - blindly accepting everything does not yield an advantage.
Here the core issue becomes clear. Even after a series of failures, models continue to believe that the next task will succeed. Their subjective probability of success stays above 0.5, despite the evidence.
Some models (notably Claude Sonnet and GPT-4.5) do end up earning more, but not because they become better at judging which tasks they can solve. Instead, they simply accept fewer tasks overall, becoming more risk-averse. Their gains come from declining more often, not from better self-assessment.
The authors also check whether the models’ decisions are rational given their own stated probabilities. And they largely are. The problem is not decision-making - it is that the probabilities themselves are too optimistic.
The third experiment is the most relevant for agentic systems. Using SWE-Bench Verified, the authors evaluate real multi-step tasks involving tools. Models are given budgets of up to 70 steps. After each step, the model is asked to estimate the probability that it will ultimately complete the task successfully.
For most models, overconfidence does not decrease, and for some it actually increases as the task unfolds. Claude Sonnet shows this particularly clearly: confidence rises during execution even when final success does not become more likely. Among all tested models, only GPT-4o shows a noticeable reduction in overconfidence over time.
Notably, so-called reasoning models do not show an advantage in self-assessment. The ability to reason for longer does not translate into the ability to accurately judge one’s chances of success.
The overall conclusion of the paper is blunt: LLMs are already fairly good at solving tasks, but still poor at understanding the limits of their own capabilities. They can act, but they cannot reliably tell when they are likely to fail.
chatgpt7 - we taught it to gaslight itself



