TY - JOUR
T1 - STAR
T2 - Efficient SpatioTemporal Modeling for Action Recognition
AU - Kumar, Abhijeet
AU - Abrams, Samuel
AU - Kumar, Abhishek
AU - Narayanan, Vijaykrishnan
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/2
Y1 - 2023/2
N2 - Action recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. By contrast, 3D CNNs capture spatiotemporal information well, but they incur a high computational cost, making deployment challenging. In video, key information is typically confined to a small number of frames, though many current approaches require decompressing and processing all frames, which wastes resources. Others work directly on the compressed domain but require multiple input streams to understand the data. In our work, we directly operate on compressed video and extract information solely from intracoded frames (I-frames) avoiding the use of motion vectors and residuals for motion information making this a single-stream network. This reduces processing time and energy consumption, by extension, making this approach more accessible for a wider range of machines and uses. Extensive testing is employed on the UCF101 (Soomro et al. in UCF101: a dataset of 101 human actions classes from videos in the Wild, 2012) and HMDB51 (Kuehne et al., in: Jhuang, Garrote, Poggio, Serre (eds) Proceedings of the international conference on computer vision (ICCV), 2011) datasets to evaluate our framework and show that computational complexity is reduced significantly while achieving competitive accuracy to existing compressed domain efforts, i.e., 92.6% top1 accuracy in UCF-101 and 62.9% in HMDB-51 dataset with 24.3M parameters and 4 GFLOPS and energy savings of over 11× for the two datasets versus CoViAR (Wu et al. in Compressed video action recognition, 2018).
AB - Action recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. By contrast, 3D CNNs capture spatiotemporal information well, but they incur a high computational cost, making deployment challenging. In video, key information is typically confined to a small number of frames, though many current approaches require decompressing and processing all frames, which wastes resources. Others work directly on the compressed domain but require multiple input streams to understand the data. In our work, we directly operate on compressed video and extract information solely from intracoded frames (I-frames) avoiding the use of motion vectors and residuals for motion information making this a single-stream network. This reduces processing time and energy consumption, by extension, making this approach more accessible for a wider range of machines and uses. Extensive testing is employed on the UCF101 (Soomro et al. in UCF101: a dataset of 101 human actions classes from videos in the Wild, 2012) and HMDB51 (Kuehne et al., in: Jhuang, Garrote, Poggio, Serre (eds) Proceedings of the international conference on computer vision (ICCV), 2011) datasets to evaluate our framework and show that computational complexity is reduced significantly while achieving competitive accuracy to existing compressed domain efforts, i.e., 92.6% top1 accuracy in UCF-101 and 62.9% in HMDB-51 dataset with 24.3M parameters and 4 GFLOPS and energy savings of over 11× for the two datasets versus CoViAR (Wu et al. in Compressed video action recognition, 2018).
UR - http://www.scopus.com/inward/record.url?scp=85138257571&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85138257571&partnerID=8YFLogxK
U2 - 10.1007/s00034-022-02160-x
DO - 10.1007/s00034-022-02160-x
M3 - Article
AN - SCOPUS:85138257571
SN - 0278-081X
VL - 42
SP - 705
EP - 723
JO - Circuits, Systems, and Signal Processing
JF - Circuits, Systems, and Signal Processing
IS - 2
ER -