Skip to content
This repository was archived by the owner on Jan 14, 2026. It is now read-only.

holi-lab/ThinkBrake-v1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ThinkBrake: Mitigating Overthinking in Tool Reasoning

arXiv License

Overthinking Example
Example of overthinking in tool reasoning: the model reaches a correct solution but continues reasoning and produces an incorrect final output.

Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. ThinkBrake is a training-free decoding heuristic that addresses this issue by monitoring the log-probability margin between </think> and the current top token at sentence boundaries, triggering early termination when the margin becomes small.

Installation

Quick Start

1. Clone the repository

https://github.com/holi-lab/ThinkBrake.git
cd think-brake

2. Install dependencies

Using uv (recommended):

bash install.sh

Or manually:

uv init . -p 3.10
uv venv -p 3.10

source .venv/bin/activate
pip install -e .
pip install bfcl-eval vllm flashinfer-python

Usage

Environment Setup

Set the output directory for experiment results:

export THINK_BRAKE_PROJECT_ROOT=/path/to/your/outputs

Running Experiments

1. Generate Predictions

Run the generation script with your desired model and test categories:

python scripts/generate.py \
  --model Qwen/Qwen3-4B-Thinking-2507 \
  --test-category non_live live \
  --temperature 0.7 \
  --gpu-memory-utilization 0.95

Available test categories:

  • non_live: Simple function calling tasks
  • live: Live function calling tasks
  • single_turn: All single-turn categories
  • Individual categories: simple_python, simple_java, simple_javascript, multiple, parallel, parallel_multiple, live_simple, live_multiple, live_parallel, live_parallel_multiple

2. Evaluate Results

Evaluate the generated predictions:

python scripts/evaluate.py \
  --model Qwen/Qwen3-4B-Thinking-2507 \
  --test-category non_live live \
  --threshold 0.25

Using Shell Scripts

For convenience, you can use the provided shell scripts:

# Generate predictions
bash run_generate.sh

# Evaluate results
bash run_evaluate.sh

Edit these scripts to customize other parameters.

Supported Models

Currently supported models:

Citation

@article{oh2025thinkbrake,
  title={ThinkBrake: Mitigating Overthinking in Tool Reasoning}, 
  author={Minjae Oh and Sangjun Song and Seungkyu Lee and Sungmin Jo and Yohan Jo},
  year={2025},
  eprint={2510.00546},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2510.00546}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published