Video Question Answering With Semantic Disentanglement and Reasoning

Jin Liu, Guoxiang Wang, Jialong Xie, Fengyu Zhou, Huijuan Xu

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentencelevel. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper's acceptance.

Original languageEnglish (US)
Pages (from-to)3663-3673
Number of pages11
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number5
DOIs
StatePublished - May 1 2024

All Science Journal Classification (ASJC) codes

  • Media Technology
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Video Question Answering With Semantic Disentanglement and Reasoning'. Together they form a unique fingerprint.

Cite this