This is the official code implementation for the AAAI 2025 paper "Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization".
- Image Data: The image dataset can be downloaded by filling out the form at this link.
- Text Data: The text data has already been uploaded. You can access it from the repository.
We would like to thank the creators of the following datasets for their contributions:
- Flickr30k: A large-scale image dataset with natural language descriptions.
- SNLI (Stanford Natural Language Inference): A dataset for developing and evaluating models for natural language inference.
-
Clone the repository:
git clone https://github.com/yourusername/Defeasible_Visual_Entailment.git cd Defeasible_Visual_Entailment -
Install the necessary dependencies:
pip install -r requirements.txt
The dataset consists of images paired with text captions. These pairs are annotated for visual entailment tasks, where the model determines whether the image entails, contradicts, or is neutral to the given text.
| Split | Number of Samples | Weakener Count | Strengthener Count | Unique Images |
|---|---|---|---|---|
| Training | 93,082 | 46,541 | 46,541 | 9,507 |
| Validation | 1,888 | 944 | 944 | 195 |
| Test | 1,972 | 986 | 986 | 203 |
Each sample contains:
- An image premise
- A text hypothesis
- A textual update that either strengthens or weakens the hypothesis
To train the model, run the following command:
python visual_text_training.py \
--train_csv_file ../Data/DVE_train.csv \
--val_csv_file ../Data/DVE_dev.csv \
--test_csv_file ../Data/DVE_test.csv \
--image_dir ../Data/flickr30k_images \
--epochs 20 \
--lr 5e-6 \
--batch_size 32 \
--wandb_project "Defeasible_Visual_Entailment" \
--output_model "DVE_model.pth" \
--gpu 0 \
--classification_weight 0.9 \
--use_classification_headThe model is designed to handle visual and textual inputs, combining them to predict the relationship (entailment, contradiction, or neutral) between the image and the caption.
The model integrates a reasoning evaluator, which assesses the strength of updates and their impact on visual entailment tasks. This allows for reward-driven optimization, improving model performance over time.
The pre-trained evaluator model can be downloaded from Hugging Face:
You can also download it via wget or curl:
wget "https://huggingface.co/skywalkerzhang19/DVE_evaluator/resolve/main/evaluator_weights.pth?download=true" -O evaluator_weights.pthor
curl -L "https://huggingface.co/skywalkerzhang19/DVE_evaluator/resolve/main/evaluator_weights.pth?download=true" -o evaluator_weights.pthTo evaluate the model, use the following command:
python visual_text_inference.py \
--test_csv_file ../Data/DVE_test.csv \
--image_dir ../Data/flickr30k_images \
--model_path evaluator_weights.pth \
--output_file "test_results.csv" \
--gpu 0 \
--test_batch_size 64For running inference on specific data using inference_demo.py, execute the following command:
python inference_demo.py \
--image_path path/to/your/image.jpg \
--text "Your hypothesis text" \
--update "Your update text" \
--model_path Evaluator/evaluator_weights.pth \
--output_file "inference_results.txt" \
--gpu 0