Skip to content

Reformatting

Jerome White edited this page Apr 3, 2023 · 7 revisions

It is unlikely the default data format is conducive for a modelling framework. This repository contains several scripts to convert the default format into another standard. This section details the reformatting implementations in this repository, and how they can be used.

This section assumes the user has followed the previous steps of aquiring the repository and the data, and that the users current working directory is the repository root; as mentioned previously:

$> git clone https://github.com/WadhwaniAI/pest-management-opendata.git
$> cd pest-management-opendata
$> aws s3 sync --no-progress s3://wadhwaniai-agri-opendata/ data/

In addition, scripts use the Python logger to report progress. You can get the most complete set of information from each script by setting that value to "info":

$> export PYTHONLOGLEVEL=info

Or ignoring information completely by piping stderr to a file.

Detectron2

Detectron2 is an object detection framework maintained by Meta. Installation of their library is required to run the scripts outlined in this section.

Basic usage

$> ./bin/to-detectron2.sh -d data

This will create two JSON files corresponding to a train and a validation set. The location of those files is reported (to stdout) at the end of the scripts execution.

By default this will use the most recent metadata version. The version can be explicitly controlled using the -v option:

$> ls data/metadata/
20220629-1312
$> ./bin/to-detectron2.sh -d data -v 20220629-1312

Advanced usage

For more advanced usage the Python generation script can be run directly. First, setup your environment:

$> ROOT=`git rev-parse --show-toplevel`
$> export PYTHONPATH=$ROOT:$PYTHONPATH
$> export PYTHONLOGLEVEL=info

Then run the script:

$> zcat $ROOT/data/metadata/20220629-1312/dev.csv.gz | \
	python src/detectron2_/build-output.py \
		--data-root $ROOT/data \
		--split train > train.json

Inside the framework

The Detectron2 format manages labels using integers. They support a means for mapping those integers back to their human-readable values. We can generate that JSON as follows:

$> zcat $ROOT/data/metadata/20220629-1312/dev.csv.gz | \
	python src/detectron2_/things.py > things.json

Assume that this, and the train/val JSONs are in the following directory structure:

> tree /foo/bar/jsons
/foo/bar/jsons
├── data
│   ├── train.json
│   └── val.json
└── things.json

1 directory, 3 files

The instructions in the Detectron2 documentation can be augmented as follows:

import json
from pathlib import Path

from detectron2.data import DatasetCatalog, MetadataCatalog

def retreive(path):
    def get():
        with path.open() as fp:
            return json.load(fp)

    return get

sources = Path('/foo/bar/jsons')
things = (sources
          .joinpath('things')
          .with_suffix('.json'))
thing_classes = retrieve(things)

for i in sources.joinpath('data').iterdir():
    DatasetCatalog.register(i.stem, retreive(i))
    MetadataCatalog.get(i.stem).thing_classes = thing_classes()

Wadhwani AI

The object detection framework designed around this data is available from the Wadhwani Institute for Artificial Intelligence.

Basic usage

$> ./bin/to-wadhwaniai.sh -d data > wadhwaniai.json

Advanced usage

For more advanced usage the Python generation script can be run directly. First, setup your environment:

$> ROOT=`git rev-parse --show-toplevel`
$> export PYTHONPATH=$ROOT:$PYTHONPATH
$> export PYTHONLOGLEVEL=info

Then run the script:

$> python src/wadhwaniai_/build-output.py \
	--data-root $ROOT/data \
	--source $ROOT/data/metadata/20220629-1312/dev.csv.gz \
	--source $ROOT/data/metadata/20220629-1312/test.csv.gz > wadhwaniai.json

MMDetection

This format is experimental

MMDetection is an object detection framework developed by OpenMMLab.

Basic usage

$> ./bin/to-mmdetection.sh -d data

This will create two JSON files corresponding to a train and a validation set. The location of those files is reported (to stdout) at the end of the scripts execution.

Advanced usage

For more advanced usage the Python generation script can be run directly. First, setup your environment:

$> ROOT=`git rev-parse --show-toplevel`
$> export PYTHONPATH=$ROOT:$PYTHONPATH
$> export PYTHONLOGLEVEL=info

Then run the script:

$> zcat $ROOT/data/metadata/20220629-1312/dev.csv.gz | \
	python src/mmdetection_/build-output.py \
		--data-root $ROOT/data \
		--split train > train.json

Ultralytics YOLOv5

YOLO is a popular object detection system. There are several implementations online; we support "v5" which is rigorously maintained by Ultralytics.

Basic usage

$> ./bin/to-ultralytics.sh -d data -o /desired/output/location

After execution, the output location (-o) will contain a structure that YOLO can parse:

$> tree -F -L 1 /desired/output/location
/desired/output/location
├── config.yaml
├── images/
├── labels/
├── test.txt
├── train.cache
├── train.txt
├── val.cache
└── val.txt

2 directories, 6 files

Training can then commence by pointing YOLOv5's train script at that location:

$> python /path/to/yolov5/train.py --data /desired/output/location/config.yaml ...

Flattened image naming

The images directory that is created will contain symlinks back to the original images in data. To honor the flat-file structure that YOLOv5 likes, image symlink names are changed from xx/yy.jpg to xx-yy.jpg; perform:

$> tree /desired/output/location/images/ | head

to see this difference. Please keep this in mind when performing evaluation, as YOLOv5 may resolve the symlink path when loading data, and in turn reporting inference boxes.