(NeurIPS24) NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Links:

🚩 News

🎉 Sep. 26, 2024. NaturalBench was accepted by NeurIPS!
✅ We have integrated NaturalBench into lmms-eval and VLMEvalKit.
✅ We have integrated NaturalBench-Retrieval Dataset into t2v_metric.
✅ NaturalBench-Retrieval Dataset: the download link from huggingface homepage.

Usages

VQA Task

There are two approaches to use and evaluate NaturalBench benchmark:

1. Evaluation based on the example code:

Learn how to use and evaluate NaturalBench by reviewing the simple example in naturalbench_vqa.py.

2. Evaluation with `lmms-eval` and `VLMEvalKit`:

Please refer to the official documentation of lmms-eval and VLMEvalKit for more details.

lmms-eval:

python3 -m accelerate.commands.launch \
    --num_processes=1 \
    -m lmms_eval \
    --model llava_onevision \
    --model_args pretrained="lmms-lab/llava-onevision-qwen2-7b-ov" \
    --tasks naturalbench \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_naturalbench \
    --output_path ./logs/

VLMEvalKit:

python run.py --data NaturalBenchDataset --model llava-onevision-qwen2-7b-ov-hf --verbose

Retrieval Task

To use the retrieval task, install t2v_metric package, then run the evaluation code:
```
python naturalbench_retrieval.py
```

Citation Information

@inproceedings{naturalbench,
  title={NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples},
  author={Li, Baiqi and Lin, Zhiqiu and Peng, Wenxuan and Nyandwi, Jean de Dieu and Jiang, Daniel and Ma, Zixian and Khanuja, Simran and Krishna, Ranjay and Neubig, Graham and Ramanan, Deva},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=Dx88A9Zgnv}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

(NeurIPS24) NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Links:

🚩 News

Usages

VQA Task

1. Evaluation based on the example code:

2. Evaluation with `lmms-eval` and `VLMEvalKit`:

Retrieval Task

Citation Information

Files

README.md

Latest commit

History

README.md

File metadata and controls

(NeurIPS24) NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Links:

🚩 News

Usages

VQA Task

1. Evaluation based on the example code:

2. Evaluation with lmms-eval and VLMEvalKit:

Retrieval Task

Citation Information

2. Evaluation with `lmms-eval` and `VLMEvalKit`: