This repository investigates how iterative inference steps, self-reflection, and feedback loops affect model reasoning performance.
Single-pass inference forces models to produce answers in one forward pass, which can lead to errors when complex reasoning is required. Multi-step inference allows models to refine their thinking by evaluating intermediate outputs and adjusting their approach. This is particularly relevant for tasks that benefit from self-correction, such as mathematical problem-solving where an initial attempt might contain calculation errors or logical missteps.
A baseline agent that queries the model once and returns the answer without any retries or reflection. This serves as the single-pass inference baseline.
An agent that attempts a solution, evaluates whether the output seems correct using simple heuristics (checking for uncertainty keywords, very short responses, or question repetition), and retries with the previous output as context if uncertain. Limited to a maximum of 3 attempts.
-
Run the evaluation script to compare both agents on the dataset:
python benchmarks/evaluate.py
This generates
benchmarks/results.jsonwith accuracy, inference step counts, and per-question results. -
Generate the visualization:
python plots/plot_results.py
This creates
plots/reasoning_improvement.pngshowing the accuracy comparison.
The reflexive agent typically uses more inference steps (1-3 per question) compared to the basic agent (always 1 step). Whether this translates to improved accuracy depends on the model's ability to self-correct and the effectiveness of the uncertainty detection heuristics. The trade-off between computational cost (more steps) and potential accuracy gains is a key consideration in evaluating these approaches.