Continually updating papers of referring video object segmentation😊
* refers to no official method name🫡
Other awesome projects: Awesome-Video-Instance-Segmentation
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
AL-Ref-SAM 2 | Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation | AAAI | Code | |
MTCM | Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation | ICASSP | Code | |
Sa2VA | Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos | Arxiv | Code | |
VRS-HQ | The Devil is in Temporal Token: High Quality Video Reasoning Segmentation | Arxiv | Code | |
MPG-SAM 2 | MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation | Arxiv | ||
ReferDINO | ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations | Arxiv | Code |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
LoSh | LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation | CVPR | Code | |
DsHmp | Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation | CVPR | Code | |
UniVS | UniVS: Unified and Universal Video Segmentation with Prompts as Queries | CVPR | Code | |
GLEE | General Object Foundation Model for Images and Videos at Scale | CVPR | Code | |
TCE-RVOS | Temporal Context Enhanced Referring Video Object Segmentation | WACV | Code | |
MUTR | Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation | AAAI | Code | |
GroPrompt | GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation | CVPRW | ||
FTEA | Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation | IP&M | ||
HTR | Temporally Consistent Referring Video Object Segmentation with Hybrid Memory | TCSVT | Code | |
VD-IT | Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation | ECCV | Code | |
VISA | Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation | ECCV | Code | |
Ref-AVS | Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes | ECCV | Code | |
VideoLISA | One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos | NeurIPS | Code | |
UniPHD | Referring Human Pose and Mask Estimation In the Wild | NeurIPS | Code | |
BIFIT | Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation | PR | ||
MHTMA | Mamba-driven hierarchical temporal multimodal alignment for referring video object segmentation | Neurocomputing | ||
TrackGPT | Tracking with Human-Intent Reasoning | Arxiv | Code | |
LTCA | LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation | Arxiv | Code | |
VLP-RVOS | Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models | Arxiv | ||
ViLLa | ViLLa: Video Reasoning Segmentation with Large Language Model | Arxiv | Code | |
REM | ReferEverything: Towards Segmenting Everything We Can Speak of in Videos | Arxiv | Code | |
OMFormer | Show Me When and Where: Towards Referring Video Object Segmentation in the Wild | Arxiv | Code | |
HyperSeg | HyperSeg: Towards Universal Visual Segmentation with Large Language Model | Arxiv | Code | |
InstructSeg | InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models | Arxiv | Code | |
SAMWISE | SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation | Arxiv | Code | |
MoRA | Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level | Arxiv | Project | |
SOLA | Referring Video Object Segmentation via Language-aligned Track Selection | Arxiv | Project | |
Video-LLaVA-Seg | ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation | Arxiv | Project |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
OnlineRefer | OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation | ICCV | Code | |
LMPM | MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions | ICCV | Code | |
SgMg | Spectrum-guided Multi-granularity Referring Video Object Segmentation | ICCV | Code | |
TempCD | Temporal Collection and Distribution for Referring Video Object Segmentation | ICCV | Project | |
HTML | HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation | ICCV | Project | |
R2VOS | Robust Referring Video Object Segmentation with Cyclic Structural Consensus | ICCV | Code | |
FS-RVOS | Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples | ICCV | Code | |
UniRef | Segment Every Reference Object in Spatial and Temporal Spaces | ICCV | Code | |
SOC | SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | NeurIPS | Code | |
DMFormer | Decoupling Multimodal Transformers for Referring Video Object Segmentation | TCSVT | Code | |
UniMM* | Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning | TCSVT | ||
Locater | Local-Global Context Aware Transformer for Language-Guided Video Segmentation | TPAMI | Code | |
VLT | VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | TPAMI | Code | |
LASTC* | Language-Aware Spatial-Temporal Collaboration for Referring Video Segmentation | TPAMI | ||
CLUE | CLUE: Contrastive language-guided learning for referring video object segmentation | PRL | ||
EPCFormer | EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation | Arxiv | Code | |
RefSAM | Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation | Arxiv | Code | |
UniRef++ | UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces | Arxiv | Code | |
SimRVOS | Learning Referring Video Object Segmentation from Weak Annotation | Arxiv |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
MTTR | End-to-End Referring Video Object Segmentation with Multimodal Transformers | CVPR | Code | |
ReferFormer | Language as Queries for Referring Video Object Segmentation | CVPR | Code | |
LBDT | Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation | CVPR | Code | |
MLRL* | Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation | CVPR | ||
MANet | Multi-Attention Network for Compressed Video Referring Object Segmentation | ACM MM | Code | |
YOFO | You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation | AAAI | ||
OATNet | Object-Agnostic Transformers for Video Referring Segmentation | TIP | ||
EFCMA* | Referring Segmentation via Encoder-Fused Cross-Modal Attention Network | TPAMI | ||
RefVOS | A Closer Look at Referring Expressions for Video Object Segmentation | MTA | Code |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
VOSRE | Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions | BMVC | ||
CSTM* | Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation | CVPR | ||
CMSA | Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network | TPAMI | ||
CMPC | Cross-Modal Progressive Comprehension for Referring Segmentation | TPAMI | Code | |
ClawCraneNet | ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | Arxiv | ||
CVLS | Contrastive Video-Language Segmentation | Arxiv |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
URVOS | URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark | ECCV | Code |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
ACGA | Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query | ICCV |
Model | Title | Venue | Paper | Code |
---|---|---|---|---|
A2D* | Actor and Action Video Segmentation from a Sentence | CVPR | ||
VOSLRE* | Video Object Segmentation with Language Referring Expressions | ACCV |