TY - GEN
T1 - Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions
AU - Li, Zhi
AU - He, Lu
AU - Xu, Huijuan
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.
AB - Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.
UR - http://www.scopus.com/inward/record.url?scp=85144539642&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144539642&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-20080-9_33
DO - 10.1007/978-3-031-20080-9_33
M3 - Conference contribution
AN - SCOPUS:85144539642
SN - 9783031200793
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 567
EP - 584
BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
A2 - Avidan, Shai
A2 - Brostow, Gabriel
A2 - Cissé, Moustapha
A2 - Farinella, Giovanni Maria
A2 - Hassner, Tal
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th European Conference on Computer Vision, ECCV 2022
Y2 - 23 October 2022 through 27 October 2022
ER -