TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

May 16, 2025·

Tunyu Zhang

Equal contribution

Haizhou Shi

Equal contribution

Yibin Wang

Hengyi Wang

Xiaoxiao He

Zhuowei Li

Haoxian Chen

Ligong Han

Kai Xu

Huan Zhang

Dimitris Metaxas

Hao Wang

· 0 min read

arxiv PDF

Abstract

While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model’s reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

Type

Publication

The Fourteenth International Conference on Learning Representations (ICLR), 2026

Last updated on February 24, 2026

Bayesain Deep Learning Trustworthy AI Large Language Models

Authors

Yibin Wang (he/him)

Incoming Ph.D. student

I am an incoming Ph.D. student in the Computer Science Department at Rutgers University. I received my Bachelor’s degree at Huazhong University of Science and Technology in 2024. I was under the guidance of Prof. Kun He @ HUST, Prof. Hao Wang @ Rutgers and Prof. Huan Zhang @ UIUC.

From such a gentle thing, from such a fountain of all delight, my every pain is born.
—— Michelangelo

← Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay June 10, 2025

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models December 10, 2024 →

No results found

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning