We only need to calculate the per-image metrics once when we first run the install script. The installer will save the the results to intermediate files which can then be loaded much faster than having to recalculate the metrics again.
Below we show a list of algorithms we tested to find duplicates.
One option is to compare various per-image metrics using various image "similarity" algorithms:
Unfortunately, no single one of these nor any combination work particularly well across the entire dataset. They produce far too many false positives and false negatives to be useful.
Use the hashlib
python package to calculate md5 checksum.
Perceptual image hash functions are available through the contrib add-on package beginning with OpenCV 3.3.0.
- Cross Entropy?
- Shannon Entropy?
grey-level co-occurrence matrix (wiki)
- energy
- contrast
- homogeneity
- correlation
Available through scikit-image
See also, Harris geospatial
- Is solid?
- Ship counts
- Binary pixel difference
- Absolute pixel difference
https://en.wikipedia.org/wiki/Relative_change_and_difference
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.455.8550&rep=rep1&type=pdf
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html