This repository contains action-centric captioning pipelines for two major video segmentation datasets:
- A2D Dataset (Actor-Action Dataset)
- ViCaS Dataset
The goal is to generate action-centric captions that enable Referring Image Segmentation (RIS) models to better understand ** actions**, not just object categories.
Despite recent advances in RIS, most models are trained on datasets like RefCOCO, where captions are predominantly noun-centric, such as:
"The man on the left", "The dog in the back"
These descriptions lack action-level semantics. However, RIS is a task that often requires distinguishing between multiple objects of the same category β a situation where action-based disambiguation is critical.
To unlock action-level comprehension, we need captions like:
"The man is kicking a ball",
"The child is climbing the stairs"
This repository addresses that gap by proposing pipelines that generate such verb-rich, action-centered captions using A2D and ViCaS datasets.
- Improves the ability of RIS models to distinguish same-category objects based on actions
- Enhances verb understanding in RIS models
This project builds on:
β
Designed and implemented full pipelines for both A2D and ViCaS
β
Applied movability-based filtering, action-focused instruction tuning
β
Significantly enriched RIS training data to enhance verb-level understanding
If you have any questions, feel free to reach out!