🐒 simages🐒

Find similar images within a dataset.

Useful for removing duplicate images from a dataset after scraping images with google-images-download.

The Python API returns pairs, duplicates, where pairs are the (ordered) closest pairs and distances is the corresponding embedding distance.

Install

See the installation docs for all details.

pip install simages

or install from source:

git clone https://github.com/justinshenk/simages
cd simages
pip install .

To install the interactive interface, install mongodb and use rather pip install "simages[all]".

Demo

Minimal command-line interface with simages-show:

Interactive image deletion with simages add/find:

Usage

Two interfaces exist:

minimal interface which plots the duplicates for visual inspection
mongodb + flask interface which allows interactive deletion [optional]

Minimal Interface

In your console, enter the directory with images and use simages-show:

$ simages-show --data-dir .

usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
                    [--pairs PAIRS] [--zdim ZDIM] [-s]

  -h, --help            show this help message and exit
  --data-dir DATA_DIR, -d DATA_DIR
                        Folder containing image data
  --show-train, -t      Show training of embedding extractor every epoch
  --epochs EPOCHS, -e EPOCHS
                        Number of passes of dataset through model for
                        training. More is better but takes more time.
  --num-channels NUM_CHANNELS, -c NUM_CHANNELS
                        Number of channels for data (1 for grayscale, 3 for
                        color)
  --pairs PAIRS, -p PAIRS
                        Number of pairs of images to show
  --zdim ZDIM, -z ZDIM  Compression bits (bigger generally performs better but
                        takes more time)
  -s, --show            Show closest pairs

Web Interface [Optional]

Note: To install the web interface API, install and run mongodb and use pip install "simages[all]" to install optional dependencies.

Add your pictures to the database (this will take some time depending on the number of pictures)

simages add <images_folder_path>

A webpage will come up with all of the similar or duplicate pictures:

simages find <images_folder_path>

Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:
    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

Python APIs

Numpy array

from simages import find_duplicates
import numpy as np

array_data = np.random.random(100, 3, 48, 48)# N x C x H x W
pairs, distances = find_duplicates(array_data)

Folder

from simages import find_duplicates

data_dir = "my_images_folder"
pairs, distances = find_duplicates(data_dir)

Default options for find_duplicates are:

def find_duplicates(
    input: Union[str or np.ndarray],
    n: int = 5,
    num_epochs: int = 2,
    num_channels: int = 3,
    show: bool = False,
    show_train: bool = False,
    **kwargs
):
    """Find duplicates in dataset. Either `array` or `data_dir` must be specified.

    Args:
        input (str or np.ndarray): folder directory or N x C x H x W array
        n (int): number of closest pairs to identify
        num_epochs (int): how long to train the autoencoder (more is generally better)
        show (bool): display the closest pairs
        show_train (bool): show output every
        z_dim (int): size of compression (more is generally better, but slower)
        kwargs (dict): etc, passed to `EmbeddingExtractor`

    Returns:
        pairs (np.ndarray): indices for closest pairs of images, n x 2 array
        distances (np.ndarray): distances of each pair to each other

`Embeddings` API

from simages import Embeddings
import numpy as np

N = 1000
data = np.random.random((N, 28, 28))
embeddings = Embeddings(data)

# Access the array
array = embeddings.array # N x z (compression size)

# Get 10 closest pairs of images
pairs, distances = embeddings.duplicates(n=5)

In [0]: pairs
Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])

In [1]: distances
Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])

`EmbeddingExtractor` API

from simages import EmbeddingExtractor
import numpy as np

N = 1000
data = np.random.random((N, 28, 28))
extractor = EmbeddingExtractor(data, num_channels=1) # grayscale

# Show 10 closest pairs of images
pairs, distances = extractor.show_duplicates(n=10)

Class attributes and parameters:

class EmbeddingExtractor:
    """Extract embeddings from data with models and allow visualization.

    Attributes:
        trainloader (torch loader)
        evalloader (torch loader)
        model (torch.nn.Module)
        embeddings (np.ndarray)

    """
    def __init__(
        self,
        input:Union[str, np.ndarray],
        num_channels=None,
        num_epochs=2,
        batch_size=32,
        show_train=True,
        show=False,
        z_dim=8,
        **kwargs,
    ):
    """Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.
    
    Args:
    input (np.ndarray or str): data
    num_channels (int): grayscale = 1, color = 3
    num_epochs (int): more is better (generally)
    batch_size (int): number of images per batch
    show_train (bool): show intermediate training results
    show (bool): show closest pairs
    z_dim (int): compression size
    kwargs (dict)
    
    """

Specify tne number of pairs to identify with the parameter n.

How it works

simages uses a convolutional autoencoder with PyTorch and compares the latent representations with closely 📐.

Dependencies

simages depends on the following packages:

closely
torch
torchvision
scikit-learn
matplotlib

The following dependencies are required for the interactive deleting interface:

pymongodb
fastcluster
flask
jinja2
dnspython
python-magic
termcolor

Cite

If you use simages, please cite it:

    @misc{justin_shenk_2019_3237830,
      author       = {Justin Shenk},
      title        = {justinshenk/simages: v19.0.1},
      month        = jun,
      year         = 2019,
      doi          = {10.5281/zenodo.3237830},
      url          = {https://doi.org/10.5281/zenodo.3237830}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
docs		docs
images		images
src/simages		src/simages
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
demo.ipynb		demo.ipynb
environment_cpu.yml		environment_cpu.yml
environment_gpu.yml		environment_gpu.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐒 simages🐒

Install

Demo

Usage

Minimal Interface

Web Interface [Optional]

Python APIs

Numpy array

Folder

`Embeddings` API

`EmbeddingExtractor` API

How it works

Dependencies

Cite

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

justinshenk/simages

Folders and files

Latest commit

History

Repository files navigation

🐒 simages🐒

Install

Demo

Usage

Minimal Interface

Web Interface [Optional]

Python APIs

Numpy array

Folder

Embeddings API

EmbeddingExtractor API

How it works

Dependencies

Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

`Embeddings` API

`EmbeddingExtractor` API

Packages