Skip to content

Add "category" as required var in detections and annotations ds#124

Merged
sfmig merged 6 commits into
mainfrom
smg/add-category-to-ds-definitions
Dec 12, 2025
Merged

Add "category" as required var in detections and annotations ds#124
sfmig merged 6 commits into
mainfrom
smg/add-category-to-ds-definitions

Conversation

@sfmig
Copy link
Copy Markdown
Member

@sfmig sfmig commented Dec 11, 2025

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
At the moment, it is possible to pass to the "export to COCO" function an annotations dataset that does not have a "category" data array. The function will fill in a dummy value (category ID = -1 and category = '') before exporting. However, the resulting JSON file will not pass the corresponding jsonschema validation and I get the error:

ValidationError: nan is not of type 'string'

Failed validating 'type' in schema['properties']['categories']['items']['properties']['name']:
    {'type': 'string'}

On instance['categories'][0]['name']:
    nan

This is because the categories section in the JSON is exported as:

  "categories": [
    {
      "id": -1,
      "name": NaN,
      "supercategory": ""
    }
  ],

Although this is somewhat documented at the moment, it is confusing and not very useful. It is also odd that the category name if not defined is mapped to NaN and the supercategory is mapped to "".

For reference, note that the COCO standard does specify a category_id for each annotation.

What does this PR do?

  • It restricts the definition of bounding box annotation and bbox detection datasets to also require a "category" data variable. This way, if a dataset without a "category" data array is passed to save_bboxes.to_COCO we get a clear error message.
    • I think this makes sense because annotations will likely come from COCO format (which requires category to be specified) or equivalent, and detections will come from a model which likely outputs bounding boxes and classes (I cannot think of a case that outputs boxes only).
    • If this becomes annoying (e.g. when computing bboxes from kpts), we can revisit it, and make a specific validator for bbox annotations/detections that we want to export as COCO.
  • It adapts the existing tests accordingly.

Two additions that are not exactly related to the original issue but relevant when creating the intermediate dataframe to export a dataset to COCO format:

  • if the mapping from category ID to category name is not defined (because the category ID is not a key in the relevant dictionary), the category name is set to empty string. This way, the COCO jsonschema passes and it is more consistent with the "supercategory" behaviour.
  • if the supercategory is defined as a data array in the dataset, it is retained in the derived dataframe, cast as a string and exported to the COCO JSON file. Before, it was always set to empty string.
  • Two tests are added to check these two behaviours.

This PR also adds uv.lock to .gitignore.

References

The issue came up when simulating a proof-reading scenario: a detections dataset with a "label" data array (rather than a "category" data array and following work in PR #114) can be passed to save_bboxes.to_COCO but the output JSON file was violating the COCO file schema.

A quick fix is to rename the data array in the input dataset from "label" to "category", but it seems more consistent to include a required "category" data array in the definition of a bbox dataset if that is the underlying meaning.

How has this PR been tested?

Tests pass locally and in CI

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Yes, the docstrings have been updated as part of this PR.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.45%. Comparing base (6fea349) to head (4333c86).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #124      +/-   ##
==========================================
- Coverage   99.46%   99.45%   -0.01%     
==========================================
  Files           8        8              
  Lines         558      555       -3     
==========================================
- Hits          555      552       -3     
  Misses          3        3              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfmig sfmig marked this pull request as ready for review December 12, 2025 14:02
@sfmig sfmig merged commit 74c3297 into main Dec 12, 2025
18 checks passed
@sfmig sfmig deleted the smg/add-category-to-ds-definitions branch December 12, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant