The goal in this work is to model the human vision process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.
We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition. The complete details for this study are in this paper.
- pytorch 1.1 or more
- sklearn
- scikit-image
- tqdm
- tensorboard
- sacred
The folder raw_data contains minimal image examples and their human full interpretation data (referred below as annotations), which were developed and used for the experiments in the paper above. The raw annotations for each minimal image category are stored in a MAT file, which contains both the minimal images and the human 'full interpretation' for it. More details are in this page.
The human full interpretation labels include the following object categories and their internal parts:
represented by a set of contours, points, and bboxes, e.g.,
Download this tar file
(~21M size) and extract it into your data/
folder. Update DATA_DIR in your
The folder contains both images and contour annotation maps. A summary of the data follows:
This data file also includes hard negative examples, which are visually similar to the minimal images. To add more negative examples, you can use the random crop procedure described in the next section.
In your data/
folder, create a new folder nonfour
, with sub-folders nonfour/train
and nonfour/test
Then run
python -i data/imgs/negatives/nonfour_samples/train -o nonfour/train --mode sliding
To use selective search mode (recommended over sliding window) clone this repo:
git clone
then use --mode selective
rahter than --mode sliding
You can can also run the script with more parameters, e.g.,
python -i data/imgs/negatives/nonfour_samples/train/ -o nonfour/train -ns 10 -lm 400 --mode selective
To use files from VOC dataset, download the VOC dataset from,
then change paths in
and run e.g.,
python -i voc_horse -o nonfour -ns 1 -lm 100000000 --mode selective
- 'interp' -- UNet-based interpretation (without classification)
- 'dual' -- UNet-based interpretation + Bottom-Up classification
- 'dualtd' -- UNet-based interpretation + Top-Down classification
- 'dualmulti' -- Multiple streams (default=2) of UNet-based interpretation + classification
Training a classification model:
python with vanilla dataset=HorseHead
Train a segmentation model:
python with mis interp deeplab veryverylong
Training the interpretation-only model:
python with interp dataset=HorseHead
Training the dual model:
python with dual equal dataset=HorseHead
Training the dual top-down model:
python with dualtd equal dataset=HorseHead loss_ratio=[1.0,1.0] subset=10000 epochs=1000
Training the multi-steam model:
python with dualmulti equal dataset=HorseHead
Experiments include 14,061,063 negative examples used for training.
$python with weights=storage_unix/logs/HorseHead/RecUNetMirc/\[1.0\,\ 100.0\]/6/weights_HorseHead_RecUNetMirc_best.pth dataset=HorseHead
or with plots:
$python with weights=storage_unix/logs/HorseHead/RecUNetMirc/\[1.0\,\ 100.0\]/6/weights_HorseHead_RecUNetMirc_best.pth dataset=HorseHead plot
If you use the code or data in this repo please cite the following paper:
title={Full interpretation of minimal images},
author={Ben-Yosef, Guy and Assif, Liav and Ullman, Shimon},
Other relevant papers include:
- G. Ben-Yosef, L. Assif, D. Harari, and S. Ullman, A model for full local image interpretation. Proceedings of the 37th Annual Meeting of the Cognitive Science Society, 2015.
- S. Ullman, L. Assif, E. Fetaya, D. Harari, Atoms of recognition in human and computer vision. Proceedings of the National Academy of Sciences, 2016. 113, 2016.
- G. Ben-Yosef, L. Assif, and S. Ullman, Structured learning and detailed interpretation of minimal object images. Workshop on Mutual Benefits of Cognitive and Computer Vision, the International Conference on Computer Vision, 2017.
- G. Ben-Yosef, A. Yachin, and S. Ullman, A model for interpreting social interactions in local image regions. AAAI spring symposium on Science of Intelligence: Computational Principles of Natural and Artificial Intelligence, 2017.
- G. Ben-Yosef and S. Ullman, Image interpretation above and below the object level. Journal of The Royal Society Interface Focus, 8(4), 20180020, 2018.
- S. Srivastava, G. Ben-Yosef*, X. Boix*, Minimal images in deep neural networks: Fragile Object Recognition in Natural Images. International Conference on Learning Representations, 2019. (* equal contribution)
- Y. Holzinger, S. Ullman, D. Harari, M. Behrmann , G. Avidan, Minimal Recognizable Configurations Elicit Category-selective Responses in Higher Order Visual Cortex. Journal of Cognitive Neuroscience, 2019, 31(9): 1354-1367.
- H. Benoni, D. Harari and S. Ullman, What takes the brain so long: Object recognition at the level of minimal images develops for up to seconds of presentation time. ArXiv:2006.05249, 2020, q-bio.NC.
Related work on interpretation and action recognition in minimal video configurations is in this github repo.