-
Notifications
You must be signed in to change notification settings - Fork 0
Reformatting
It is unlikely the default data format is conducive for a modelling framework. This repository contains several scripts to convert the default format into another standard. This section details the reformatting implementations in this repository, and how they can be used.
This section assumes the user has followed the previous steps of aquiring the repository and the data, and that the users current working directory is the repository root; as mentioned previously:
$> git clone https://github.com/WadhwaniAI/pest-management-opendata.git
$> cd pest-management-opendata
$> aws s3 sync --no-progress s3://wadhwaniai-agri-opendata/ data/
In addition, scripts use the Python logger to report progress. You can get the most complete set of information from each script by setting that value to "info":
$> export PYTHONLOGLEVEL=info
Or ignoring information completely by piping stderr to a file.
Detectron2 is an object detection framework maintained by Meta. Installation of their library is required to run the scripts outlined in this section.
$> ./bin/to-detectron2.sh -d data
This will create two JSON files corresponding to a train and a validation set. The location of those files is reported (to stdout) at the end of the scripts execution.
By default this will use the most recent metadata version. The version
can be explicitly controlled using the -v
option:
$> ls data/metadata/
20220629-1312
$> ./bin/to-detectron2.sh -d data -v 20220629-1312
For more advanced usage the Python generation script can be run directly. First, setup your environment:
$> ROOT=`git rev-parse --show-toplevel`
$> export PYTHONPATH=$ROOT:$PYTHONPATH
$> export PYTHONLOGLEVEL=info
Then run the script:
$> zcat $ROOT/data/metadata/20220629-1312/dev.csv.gz | \
python src/detectron2_/build-output.py \
--data-root $ROOT/data \
--split train > train.json
The Detectron2 format manages labels using integers. They support a means for mapping those integers back to their human-readable values. We can generate that JSON as follows:
$> zcat $ROOT/data/metadata/20220629-1312/dev.csv.gz | \
python src/detectron2_/things.py > things.json
Assume that this, and the train/val JSONs are in the following directory structure:
> tree /foo/bar/jsons
/foo/bar/jsons
├── data
│ ├── train.json
│ └── val.json
└── things.json
1 directory, 3 files
The instructions in the Detectron2 documentation can be augmented as follows:
import json
from pathlib import Path
from detectron2.data import DatasetCatalog, MetadataCatalog
def retreive(path):
def get():
with path.open() as fp:
return json.load(fp)
return get
sources = Path('/foo/bar/jsons')
things = (sources
.joinpath('things')
.with_suffix('.json'))
thing_classes = retrieve(things)
for i in sources.joinpath('data').iterdir():
DatasetCatalog.register(i.stem, retreive(i))
MetadataCatalog.get(i.stem).thing_classes = thing_classes()
The object detection framework designed around this data is available from the Wadhwani Institute for Artificial Intelligence.
$> ./bin/to-wadhwaniai.sh -d data > wadhwaniai.json
For more advanced usage the Python generation script can be run directly. First, setup your environment:
$> ROOT=`git rev-parse --show-toplevel`
$> export PYTHONPATH=$ROOT:$PYTHONPATH
$> export PYTHONLOGLEVEL=info
Then run the script:
$> python src/wadhwaniai_/build-output.py \
--data-root $ROOT/data \
--source $ROOT/data/metadata/20220629-1312/dev.csv.gz \
--source $ROOT/data/metadata/20220629-1312/test.csv.gz > wadhwaniai.json
This format is experimental
MMDetection is an object detection framework developed by OpenMMLab.
$> ./bin/to-mmdetection.sh -d data
This will create two JSON files corresponding to a train and a validation set. The location of those files is reported (to stdout) at the end of the scripts execution.
For more advanced usage the Python generation script can be run directly. First, setup your environment:
$> ROOT=`git rev-parse --show-toplevel`
$> export PYTHONPATH=$ROOT:$PYTHONPATH
$> export PYTHONLOGLEVEL=info
Then run the script:
$> zcat $ROOT/data/metadata/20220629-1312/dev.csv.gz | \
python src/mmdetection_/build-output.py \
--data-root $ROOT/data \
--split train > train.json
YOLO is a popular object detection system. There are several implementations online; we support "v5" which is rigorously maintained by Ultralytics.
$> ./bin/to-ultralytics.sh -d data -o /desired/output/location
After execution, the output location (-o
) will contain a structure
that YOLO can parse:
$> tree -F -L 1 /desired/output/location
/desired/output/location
├── config.yaml
├── images/
├── labels/
├── test.txt
├── train.cache
├── train.txt
├── val.cache
└── val.txt
2 directories, 6 files
Training can then commence by pointing YOLOv5's train script at that location:
$> python /path/to/yolov5/train.py --data /desired/output/location/config.yaml ...
The images directory that is created will contain symlinks back to the
original images in data
. To honor the flat-file structure that
YOLOv5 likes, image symlink names are changed from xx/yy.jpg
to
xx-yy.jpg
; perform:
$> tree /desired/output/location/images/ | head
to see this difference. Please keep this in mind when performing evaluation, as YOLOv5 may resolve the symlink path when loading data, and in turn reporting inference boxes.