Evaluation code for HAERAE-Vision benchmark - a Korean visual QA dataset with real-world, under-specified questions.
- Paper: What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
- Dataset: HAERAE-HUB/HAERAE-VISION
- Leaderboard: https://board.haerae.world/
Requirements: Python >= 3.9
pip install -r requirements.txt
cp env.example .env # Add your API keys (see env.example for details)Note:
OPENAI_API_KEYis required for the judge model.
The dataset includes two question types:
original: Under-specified, authentic user queriesexplicit: Clarified queries with full context
Both share the same images and reference answers, allowing controlled evaluation of query under-specification.
The evaluation process consists of two stages:
- Inference (
main.py): Generate model responses - Scoring (
score.py): Evaluate responses with a judge model (GPT-5-mini)
Note: Stage 2 always requires
OPENAI_API_KEYfor the judge model. Stage 1 requires API keys only for cloud models (GPT, Claude, Gemini, etc.); local vLLM models do not need API keys.
# Evaluate with ORIGINAL questions (under-specified)
python main.py \
--engine litellm \
--model gpt-4o \
--question-type original \
--output results/gpt4o_original.csv
# Evaluate with EXPLICIT questions (clarified)
python main.py \
--engine litellm \
--model gpt-4o \
--question-type explicit \
--output results/gpt4o_explicit.csv
# Using OpenRouter (access multiple models with one API key)
python main.py \
--engine litellm \
--model openrouter/anthropic/claude-3.5-sonnet \
--question-type original \
--output results/claude_original.csv
# Using vLLM for local models (requires GPU)
CUDA_VISIBLE_DEVICES=0,1 python main.py \
--engine vllm \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--question-type original \
--output results/qwen_original.csvArguments:
--engine:vllmorlitellm--model: Model name (examples below, but supports any litellm/vLLM compatible model)- GPT models:
gpt-4o,gpt-4o-mini,gpt-5-mini - Gemini:
gemini/gemini-2.5-pro,gemini/gemini-2.5-flash - Claude:
claude-3.5-sonnet,claude-3-opus - Via OpenRouter:
openrouter/{provider}/{model} - Local vLLM:
Qwen/Qwen2.5-VL-3B-Instruct,OpenGVLab/InternVL3-8B, etc.
- GPT models:
--question-type:originalorexplicit(default:original)--output: Output CSV path (theresults/directory will be created automatically)--max_tokens: Max generation tokens (default: 512)--temperature: Sampling temperature (default: 0.2)
After generating responses, evaluate them using a judge model (default: GPT-5-mini):
Required:
OPENAI_API_KEYmust be set for the judge model.
python score.py \
--input results/gpt4o_original.csv \
--output results/gpt4o_original_scored.csv \
--model gpt-5-miniArguments:
--input: CSV from step 1--output: Output CSV with scores--model: Judge model (default:gpt-5-mini)--question-col: Question column name (default:question_used)--answer-col: Response column name (default:response)
After Stage 1 (main.py):
- All dataset fields (question_original, question_explicit, images, etc.)
response: Model's generated answerquestion_type: Which question type was usedquestion_used: Actual question text used
After Stage 2 (score.py):
- All Stage 1 fields plus:
judge_response: Judge model's detailed evaluationscore: Final score (0.0 to 1.0)
To submit your model for official evaluation on the full test set:
- Visit the HAERAE-VISION Leaderboard
- Log in to your account
- Click the Submit (제출하기) button
- Follow the submission instructions
@misc{choi2026usersleaveunsaid,
title={What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models},
author={Dasol Choi and Guijin Son and Hanwool Lee and Minhyuk Kim and Hyunwoo Ko and Teabin Lim and Ahn Eungyeol and Jungwhan Kim and Seunghyeok Hong and Youngsook Song},
year={2026},
eprint={2601.06165},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.06165},
}- Dataset: Dasol Choi ([email protected])
- Leaderboard: Guijin Son ([email protected])