Pytorch implementation of our IEEE publication "Temporal Feature Exchange and Difference Network for online real-time action detection" has been released.
Y. Liu, F. Yang and D. Ginhac, "TEDdet: Temporal Feature Exchange and Difference Network for Online Real-Time Action Detection," in IEEE Access, vol. 10, pp. 37870-37881, 2022, doi: 10.1109/ACCESS.2022.3164730.
we propose a lightweight action tubelet detector coined TEDdet which unifies complementary feature aggregation and motion modeling modules. Specifically, our Temporal Feature Exchange module facilitates feature interaction by aggregating action-specific visual patterns over successive frames, enabling spatiotemporal modeling on top of 2D CNN. To address actors' location shift in the sequence, our Temporal Feature Difference module approximates pair-wise motion among target frames in their abstract latent space. These modules can be easily integrated with an existing anchor-free detector (CenterNet) to cooperatively model action instances' categories, sizes and trajectories for precise tubelet generation. TEDdet exploits larger temporal strides to efficiently infer actions in a coarse-to-fine and online manner.
-
We present two lightweight temporal modeling modules: Temporal Feature Exchange (TE) and Temporal Feature Difference (TD) to facilitate learning action-specific spatiotemporal pattern and trajectory.
-
We propose TEDdet, an integrated action tubelet detector on top of 2D CenterNet and TE-TD plug-in. Our detector operates in a coarse-to-fine manner. Alongside the online tube generation algorithm, TEDdet's detection speed well exceeds real-time requirement (89 FPS).
-
Comprehensive analysis in terms of TEDdet's accuracy, robustness, and efficiency are conducted on public UCF-24 and JHMDB-21 datasets. Without relying on any 3D CNN or optical flow, our action detector achieves competitive accuracy at an unprecedented speed, suggesting a much more feasible solution pertinent to realistic applications.
Please refer to https://github.com/MCG-NJU/MOC-Detector for detailed instructions.
The current version of TEDdet support ResNet18 as the feature extraction backbone. To proceed with training, first download the COCO pretrained weights in Google Drive. COCO pretrained models come from CenterNet. Move pretrained models to ${TEDdet_ROOT}/experiment/modelzoo/.
To train your own model, run the train.py script along with relevant input arguments. For instance:
$python train.py --K 5 --exp_id K5_model --rgb_model TED_K5 --batch_size 16 --master_batch 16 --lr 2.5e-4 --gpus 0 --num_worker 16 --num_epochs 10 --lr_step 5 --dataset hmdb --split 1 --down_ratio 8 --lr_drop 0.1 --ninput 1 --ninputrgb 5 --auto_stop --pretrain coco
Note that by setting --auto_save, a validation step will be carried out at the end of each training epoch in order to save the best-performing model (i.e., model_best.pth). Instead, using --save_all will save every epoch's training model. For more details on possible input arguments, refer to ${TEDdet_ROOT}/src/opts.py.
To evaluate a trained model, one performs two sequential steps from the det.py and ACT.py scripts. The former performs model inference and saves detection results (i.e., .pkl) accordingly. The latter loads the detection results and evaluates them based on designated metrics. For instance:
$python det.py --task stream --K 5 --gpus 0 --batch_size 1 --master_batch 1 --num_workers 1 --rgb_model TED_K5/model_best.pth --inference_dir ../data0/TED_K5_stream --ninput 1 --dataset hmdb --split 1 --down_ratio 8 --ninputrgb 5
Note that TEDdet supports two inference modes: normal and stream mode. Stream mode takes into account a FIFO-based feature-caching mechanism to more efficiently process incoming video frames (the reported speed in the article corresponds to this mode). The mode needs to be specified explicitly during both inference and evaluation.
$python ACT.py --task frameAP --K 5 --th 0.5 --inference_dir ../data0/TED_K5_stream --dataset hmdb --split 1 --evaluation_model trimmed --inference_mode stream --ninputrgb 5
To compute videoAP, one needs to additionally generate action tubes using ACT.py by first configuring --task to BuildTubes and then evaluate videoAP.
Our codes refer to the work of CenterNet, MOC and TEA. We thank the authors for their structured implementations and comments!
