Vision-Language Programs (VLP) is a framwork for inducing executable programs that explain Bongard-like visual reasoning tasks. The system combines vision-language models (VLMs) with a symbolic DSL, and searches for structured programs that satisfy a dataset’s positive/negative image constraints. This repository contains the full training, inference, and evaluation stack used in our experiments.
| Path | Description |
|---|---|
main.py, |
End-to-end pipeline that orchestrate symbol grounding, DSL construction, and program search. |
method/ |
DSL definitions, program search algorithms, grammar utilities, and type system. |
models/ |
Prompter implementations for each supported VLM, including caching logic in models/*/memory/. |
prompts/ |
Prompt templates for variable discovery, baselines, judgment etc.. |
utils/ |
Argument parsing, dataset helpers, prompter factory, and GPU reservation utilities. |
scripts/ |
Convenience launchers for running sweeps across datasets, models, and seeds. |
All experiment outputs are written to results/<dataset>/... and include discovered programs, cached image representations, and token usage summaries.
-
Clone and create an environment
git clone <repo-url> cd vision-language-programs python -m venv .venv && source .venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
Qwen3 VL models depend on the bleeding-edge Transformers package:
pip install --upgrade pip pip install git+https://github.com/huggingface/transformers
-
Configure model credentials
- Hugging Face models (InternVL, Ovis, Molmo, Qwen) require cached checkpoints or access tokens.
- GPT backend expects API key
models/*/memorywill cache question/answer pairs per dataset, so make sure the folders are writable.
-
Hardware
- Experiments assume at least one CUDA GPU.
utils/util.py::reserve_gpus()pins a tensor on every detected device so LLM loading does not preempt another job. - Memory requirements grow with the chosen model (InternVL3-14B and Qwen3 30B need model-parallel setups, see the device mapping in
models/internvl/main.py).
- Experiments assume at least one CUDA GPU.
Datasets should be placed under data/<dataset-name>/... following the paths assumed in utils/dataset_utils.py. The loader handles train/test splits and enforces that test images are disjoint from train.
To obtain the datasets you can use the following commands.
wget https://zenodo.org/record/7079175/files/bongard_hoi_images.tar?download=1 -O bongard_hoi_images.tar
tar -xvf bongard_hoi_images.tar -C data/mkdir -p data
cd data
git clone https://github.com/pavonism/Bongard-RWR.git
cd ..- Images are expected under
data/bongard-op/. - Metadata is loaded via
datasets.load_dataset("rujiewu/Bongard-OpenWorld"); make sure you have a local Hugging Face cache.
wget https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/2611/CLEVR-Hans3.zip
unzip CLEVR-Hans3.zip -d data/wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip train2017.zip -d data/cocologic/coco
unzip val2017.zip -d data/cocologic/coco
unzip annotations_trainval2017.zip -d data/cocologic/coco
python data/cocologic/cocologic.pyIf you add a new dataset, register it in utils/args.py, implement a loader in utils/dataset_utils.py, and optionally define a custom DSL in method/DSL/.
The main entry point is main.py, which loops over every task in a dataset, discovers variables, builds a DSL, and searches for programs:
python main.py \
--dataset bongard-op \
--model InternVL3-8B \
--search_timeout 10 \
--n_objects 10 \
--n_properties 10 \
--n_actions 3 \
--max_program_depth 4 \
--max_imgs 6 \
--variable_distribution naive_weighted \
--seed 0Important flags (see utils/args.py for the full list):
--dataset: one of the loaders implemented inutils/dataset_utils.py.--model: any backend supported byutils/prompters.get_prompter().--max_program_depth/--search_timeout: trade off search completeness and runtime.--variable_distribution: chooseuniform,naive_frequency,naive_weighted, orpositive_ratioPCFG weighting.--no_sampling,--xil_remove_confounders,--xil_add_functions,--xil_add_properties: toggles for ablations.
Results are written to results/<dataset>/.../discovered_programs_<args>.json along with cached image representations and a TXT file containing total VLM token usage.
Use run_single_task.py to focus on one Bongard puzzle, inspect the discovered variables, and return the best program/D SL pair:
python run_single_task.py --dataset bongard-op --model Qwen2.5-VL-7B-Instruct --seed 4baseline.py, baseline_structure.py, and baseline_structure.py implement direct prompting approaches that predict rules without program synthesis. Each baseline uses prompts/baseline_prompt.txt (or alternatives) and logs qualitative predictions to results/qualitative.
xil_experiment.py mirrors main.py but augments the DSL with extra functions/properties or removes confounders. Use the flags above or run the script directly for fine-grained control.
scripts/*.sh encode common experiment grids (datasets × models × seeds). They assume a UNIX shell with CUDA visibility (e.g., CUDA_VISIBLE_DEVICES=<id> bash scripts/run_experiment.sh). Adapt these scripts to your cluster scheduler if needed.
eval.pycontains helpers to aggregate accuracies from the JSON result files (seen_tasks_per_datasetandparams_per_datasetfor recommended settings). You can importeval.eval(path, n_tasks)in a notebook or a small driver script.eval_variable_discovery.pyandeval_qualitative.pybenchmark the discovery prompts and produce qualitative grids of predictions.plots/,motivating_example.ipynb, andqualitative_examples.ipynbshow example visualizations of discovered programs and rules.
- Prompt editing: Modify files in
prompts/to adjust how variables are queried, how baselines describe rules, or how judges verify predictions. - DSL extensions: Add new primitives under
method/DSL/and update the semantics/types so they appear in the CFG (seemethod/DSL/dsl_with_img_repr.pyfor reference). - Additional VLMs: Implement a new prompter in
models/<name>/main.pythat exposesprompt_with_images, register it inutils/prompters.get_prompter, and add any required dependencies torequirements.txt. - Caching: Every prompter writes JSON files under
models/<model>/memory/<dataset>/to avoid repeated API calls.