简体中文 | English
PaddleVideo包含视频分类和动作定位方向的多个主流领先模型,其中TSN, TSM和SlowFast是End-to-End的视频分类模型,Attention LSTM是比较流行的视频特征序列模型,BMN是视频动作定位模型,TransNetV2是视频切分模型。TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,SlowFast是FAIR在ICCV2019提出的3D视频分类模型,特征序列模型Attention LSTM速度快精度高。BMN模型是百度自研模型,为2019年ActivityNet夺冠方案。基于百度飞桨产业实践,我们自研并开源了ppTSM,该模型基于TSM进行优化,在保持模型参数量和计算量不增加的前提下,精度得到大幅提升。同时,我们的通用优化策略可以广泛适用于各种视频模型,未来我们将进行更多的模型优化工作,比如TPN、SlowFast、X3D等,敬请期待。
领域 | 模型 | 配置 | 测试集 | 精度指标 | 精度% | 下载链接 |
---|---|---|---|---|---|---|
行为识别 | PP-TSM | pptsm.yaml | Kinetics-400 | Top-1 | 76.16 | PPTSM.pdparams |
行为识别 | PP-TSN | pptsn.yaml | Kinetics-400 | Top-1 | 75.06 | PPTSN.pdparams |
行为识别 | PP-TimeSformer | pptimesformer.yaml | Kinetics-400 | Top-1 | 79.44 | ppTimeSformer_k400_16f_distill.pdparams |
行为识别 | AGCN | agcn.yaml | FSD | Top-1 | 62.29 | AGCN.pdparams |
行为识别 | ST-GCN | stgcn.yaml | FSD | Top-1 | 59.07 | STGCN.pdparams |
行为识别 | VideoSwin | videoswin.yaml | Kinetics-400 | Top-1 | 82.40 | VideoSwin.pdparams |
行为识别 | TimeSformer | timesformer.yaml | Kinetics-400 | Top-1 | 77.29 | TimeSformer.pdparams |
行为识别 | SlowFast | slowfast_multigrid.yaml | Kinetics-400 | Top-1 | 75.84 | SlowFast.pdparams |
行为识别 | TSM | tsm.yaml | Kinetics-400 | Top-1 | 70.86 | TSM.pdparams |
行为识别 | TSN | tsn.yaml | Kinetics-400 | Top-1 | 69.81 | TSN.pdparams |
行为识别 | AttentionLSTM | attention_lstm.yaml | Youtube-8M | Hit@1 | 89.0 | AttentionLstm.pdparams |
视频动作定位 | BMN | bmn.yaml | ActivityNet | AUC | 67.23 | BMN.pdparams |
视频切分 | TransNetV2 | transnetv2.yaml | ClipShots | F1 scores | 76.1 | |
深度估计 | ADDS | adds.yaml | Oxford_RobotCar | Abs Rel | 0.209 | ADDS_car.pdparams |
- Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification, Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen
- BMN: Boundary-Matching Network for Temporal Action Proposal Generation, Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen.
- SlowFast Networks for Video Recognition, Feichtenhofer C, Fan H, Malik J, et al.
- Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool
- Temporal Shift Module for Efficient Video Understanding, Ji Lin, Chuang Gan, Song Han
- Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius, Heng Wang, Lorenzo Torresani
- Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Sijie Yan, Yuanjun Xiong, Dahua Lin
- Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition, Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu
- Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks, Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu
- TransNet V2: An effective deep network architecture for fast shot transition detection, Tomáš Souček, Jakub Lokoč
- Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation, Lina Liu, Xibin Song, Mengmeng Wang