TY - JOUR
T1 - SaGCN
T2 - Semantic-Aware Graph Calibration Network for Temporal Sentence Grounding
AU - Chen, Tongbao
AU - Wang, Wenmin
AU - Han, Kangrui
AU - Xu, Huijuan
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2023/6/1
Y1 - 2023/6/1
N2 - Temporal sentence grounding is a challenging task that aims to localize the semantic corresponding segment from the untrimmed video according to the given query language description. Existing methods either utilize a cross-modal matching architecture following a scan-and-rank pipeline or directly predict the probabilities of being the target boundary for each frame based on the entire video content. However, such methods are weak when some of the critical semantic concepts in the query are actually relevant to multiple video segments or the desired video segment contains a query-irrelevant scene due to ignoring query semantic concepts and local and global crossmodal context. In this paper, we propose a novel semanticaware graph calibration network (SaGCN) to address the issues mentioned above. Specifically, we first introduce a semanticaware local relational graph module to capture the inherent relationships among the specific semantic concept relevant local contextual information for fine-grained cross-modal information interactions. Then, a semantic-aware global relational graph module is derived for global contextual information integration and achieving cross-modal alignment. Finally, an attention-based calibration module is designed for eliminating the irrelevant information maintained in the visual modality under the guidance of query description. Extensive experiments verify the effectiveness of our proposed SaGCN on two widely used datasets (Charades-STA and TACoS), in which we achieve significant and consistent improvement compared to the state-of-the-art approaches.
AB - Temporal sentence grounding is a challenging task that aims to localize the semantic corresponding segment from the untrimmed video according to the given query language description. Existing methods either utilize a cross-modal matching architecture following a scan-and-rank pipeline or directly predict the probabilities of being the target boundary for each frame based on the entire video content. However, such methods are weak when some of the critical semantic concepts in the query are actually relevant to multiple video segments or the desired video segment contains a query-irrelevant scene due to ignoring query semantic concepts and local and global crossmodal context. In this paper, we propose a novel semanticaware graph calibration network (SaGCN) to address the issues mentioned above. Specifically, we first introduce a semanticaware local relational graph module to capture the inherent relationships among the specific semantic concept relevant local contextual information for fine-grained cross-modal information interactions. Then, a semantic-aware global relational graph module is derived for global contextual information integration and achieving cross-modal alignment. Finally, an attention-based calibration module is designed for eliminating the irrelevant information maintained in the visual modality under the guidance of query description. Extensive experiments verify the effectiveness of our proposed SaGCN on two widely used datasets (Charades-STA and TACoS), in which we achieve significant and consistent improvement compared to the state-of-the-art approaches.
UR - https://www.scopus.com/pages/publications/85144029505
UR - https://www.scopus.com/pages/publications/85144029505#tab=citedBy
U2 - 10.1109/TCSVT.2022.3226488
DO - 10.1109/TCSVT.2022.3226488
M3 - Article
AN - SCOPUS:85144029505
SN - 1051-8215
VL - 33
SP - 3003
EP - 3016
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 6
ER -