QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog

Muchao Ye, Quanzeng You, Fenglong Ma

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Audio video scene-aware dialog (AVSD) is a new but more challenging visual question answering (VQA) task because of the higher complexity of feature extraction and fusion brought by the additional modalities. Although recent methods have achieved early success in improving feature extraction technique for AVSD, the technique of feature fusion still needs further investigation. In this paper, inspired by the success of self-attention mechanism and the importance of understanding questions for VQA answering, we propose a question-guided self-attentive multi-modal fusion network (QUALIFIER) to improve the AVSD practice in the stage of feature fusion and answer generation. Specifically, after extracting features and learning a comprehensive feature for each modality, we first use the designed self-attentive multi-modal fusion (SMF) module to aggregate each feature with the correlated information learned from others. Later, by prioritizing the question feature, we concatenate it with each fused feature to guide the generation of a natural language response to the question. As for experimental results, QUALIFIER shows better performance than other baseline methods in the large-scale AVSD dataset named DSTC7. Additionally, the human evaluation and ablation study results also demonstrate the effectiveness of our network architecture.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2503-2511
Number of pages9
ISBN (Electronic)9781665409155
DOIs
StatePublished - 2022
Event22nd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022 - Waikoloa, United States
Duration: Jan 4 2022Jan 8 2022

Publication series

NameProceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022

Conference

Conference22nd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
Country/TerritoryUnited States
CityWaikoloa
Period1/4/221/8/22

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog'. Together they form a unique fingerprint.

Cite this