A Marimba Pipeline for processing and standardizing historical slide-film imagery collections from marine ecological surveys. The Pipeline specializes in reconstructing spatio-temporal context from historical records while preserving image quality and experimental metadata.
The Image Digitization Pipeline is designed to process historical slide-film imagery collected during CSIRO marine monitoring campaigns between 1993 and 1996. It handles data from the CSIRO Osprey Camera System (OCS) and Towed Camera Array (TACOS), which captured benthic habitat imagery across the continental shelf.
Key capabilities include:
- Automated processing of digitized slide-film photographs
- Spatio-temporal interpolation for reconstructing image positions
- Integration of survey metadata including GPS coordinates, depths, and timestamps
- Quality-controlled image rotation and sequence reconstruction
- Generation of FAIR-compliant datasets with embedded metadata
The Image Digitization Pipeline is built on the Marimba framework which includes most necessary dependencies for this Pipeline. Additional requirements include:
- openpyxl
Create a new Marimba project and add the Image Digitization Pipeline:
marimba new project my-digitization-project
cd my-digitization-project
marimba new pipeline image_digitization https://github.com/csiro-fair/image-digitization-pipeline.gitThe Pipeline uses a configuration-free approach, with all necessary parameters derived from input data.
Each Collection requires:
batch_data_path: Path to batch processing metadata CSVinventory_data_path: Path to film inventory Excel fileimport_path: Base path for source image directories
Import collections with batch-specific configurations:
marimba import batch1a /path/to/source/batch1a/* --operation link \
--config '{
"batch_data_path": "/path/to/batch1a.csv",
"inventory_data_path": "/path/to/inventory.xlsx",
"import_path": "/path/to/source/batch1a/"
}'For large digitization projects, multiple batches can be imported:
# Import multiple batches
marimba import batch1a /path/to/batch1a/* --config '...'
marimba import batch1b /path/to/batch1b/* --config '...'
marimba import batch1c /path/to/batch1c/* --config '...'The Pipeline expects digitized slide-film data organized by sequential folders:
source/
└── batch1a/
├── 00000001/
│ └── *.jpg # Digitized slide-film images
├── 00000002/
│ └── *.jpg
└── ...
marimba processDuring processing, the Image Digitization Pipeline:
- Creates a hierarchical directory structure by survey, platform, and deployment
- Interpolates geographic coordinates between deployment start/end points
- Rotates and sequences images based on film orientation
- Integrates station metadata (coordinates, depth, timestamp)
- Generates thumbnail overviews for visual validation
marimba package SEFHES \
--operation link \
--version 1.0 \
--contact-name "Keiko Abe" \
--contact-email "[email protected]"The --operation link flag creates hard links instead of copying files, optimizing storage for large datasets.
SEFHES/ # Root dataset directory
├── data/ # Directory containing all processed data
│ └── image_digitization/ # Image Digitization Pipeline data directory
│ ├── SS*/ # Survey directories organized by year
│ │ └── [OCS|TACOS]/ # Platform directories
│ │ └── SS*_*/ # Individual deployment directories
│ │ ├── data/ # Deployment data files (CSV)
│ │ ├── stills/ # Full resolution images
│ │ ├── thumbnails/ # Image thumbnails
│ │ └── [OCS|TACOS]_SS*_*_OVERVIEW.JPG # Deployment overview image
│ └── *.csv # Collection of imported CSV files
├── logs/ # Directory containing all processing logs
│ ├── pipelines/ # Pipeline-specific logs
│ │ └── image_digitization.log # Logs from Image Digitization Pipeline
│ ├── dataset.log # Dataset packaging logs
│ └── project.log # Overall project processing logs
├── pipelines/ # Directory containing pipeline code
│ └── image_digitization/ # Pipeline-specific directory
│ ├── repo/ # Pipeline source code repository
│ │ ├── image_digitization.pipeline.py # Pipeline implementation
│ │ ├── LICENSE # Pipeline license file
│ │ ├── README.md # Pipeline README file
│ │ └── requirements.txt # Pipeline dependencies
│ └── pipeline.yml # Pipeline configuration
├── ifdo.yml # Dataset-level iFDO metadata file
├── manifest.txt # File manifest with SHA256 hashes
├── map.png # Spatial visualization of dataset
└── summary.md # Dataset summary and statistics
The Pipeline captures comprehensive metadata including:
- GPS coordinates
- Sampling depths
- Collection times
- Platform configurations
- Environmental parameters
- Image acquisition parameters
- Camera configurations
- Processing parameters
- Quality metrics
- Spatial coordinates (interpolated)
- Temporal information
- Platform details
- Deployment context
All metadata is standardized using the iFDO schema (v2.1.0) and embedded in both image EXIF tags and dataset-level files.
The Image Digitization Pipeline was developed by:
- Christopher Jackett (CSIRO)
- Candice Untiedt (CSIRO)
- Franziska Althaus (CSIRO)
- David Webb (CSIRO)
- Ben Scoulding (CSIRO)
The Image Digitization Pipeline is distributed under the CSIRO BSD/MIT license.
For inquiries related to this repository, please contact:
- Chris Jackett
Software Engineer, CSIRO
Email: [email protected]