TY - JOUR
T1 - Video Question Answering With Semantic Disentanglement and Reasoning
AU - Liu, Jin
AU - Wang, Guoxiang
AU - Xie, Jialong
AU - Zhou, Fengyu
AU - Xu, Huijuan
N1 - Publisher Copyright:
© 2024 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2024/5/1
Y1 - 2024/5/1
N2 - Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentencelevel. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper's acceptance.
AB - Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentencelevel. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper's acceptance.
UR - http://www.scopus.com/inward/record.url?scp=85181577253&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85181577253&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3317447
DO - 10.1109/TCSVT.2023.3317447
M3 - Article
AN - SCOPUS:85181577253
SN - 1051-8215
VL - 34
SP - 3663
EP - 3673
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 5
ER -