TY - GEN
T1 - QUALIFIER
T2 - 22nd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
AU - Ye, Muchao
AU - You, Quanzeng
AU - Ma, Fenglong
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Audio video scene-aware dialog (AVSD) is a new but more challenging visual question answering (VQA) task because of the higher complexity of feature extraction and fusion brought by the additional modalities. Although recent methods have achieved early success in improving feature extraction technique for AVSD, the technique of feature fusion still needs further investigation. In this paper, inspired by the success of self-attention mechanism and the importance of understanding questions for VQA answering, we propose a question-guided self-attentive multi-modal fusion network (QUALIFIER) to improve the AVSD practice in the stage of feature fusion and answer generation. Specifically, after extracting features and learning a comprehensive feature for each modality, we first use the designed self-attentive multi-modal fusion (SMF) module to aggregate each feature with the correlated information learned from others. Later, by prioritizing the question feature, we concatenate it with each fused feature to guide the generation of a natural language response to the question. As for experimental results, QUALIFIER shows better performance than other baseline methods in the large-scale AVSD dataset named DSTC7. Additionally, the human evaluation and ablation study results also demonstrate the effectiveness of our network architecture.
AB - Audio video scene-aware dialog (AVSD) is a new but more challenging visual question answering (VQA) task because of the higher complexity of feature extraction and fusion brought by the additional modalities. Although recent methods have achieved early success in improving feature extraction technique for AVSD, the technique of feature fusion still needs further investigation. In this paper, inspired by the success of self-attention mechanism and the importance of understanding questions for VQA answering, we propose a question-guided self-attentive multi-modal fusion network (QUALIFIER) to improve the AVSD practice in the stage of feature fusion and answer generation. Specifically, after extracting features and learning a comprehensive feature for each modality, we first use the designed self-attentive multi-modal fusion (SMF) module to aggregate each feature with the correlated information learned from others. Later, by prioritizing the question feature, we concatenate it with each fused feature to guide the generation of a natural language response to the question. As for experimental results, QUALIFIER shows better performance than other baseline methods in the large-scale AVSD dataset named DSTC7. Additionally, the human evaluation and ablation study results also demonstrate the effectiveness of our network architecture.
UR - http://www.scopus.com/inward/record.url?scp=85126101244&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126101244&partnerID=8YFLogxK
U2 - 10.1109/WACV51458.2022.00256
DO - 10.1109/WACV51458.2022.00256
M3 - Conference contribution
AN - SCOPUS:85126101244
T3 - Proceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
SP - 2503
EP - 2511
BT - Proceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 4 January 2022 through 8 January 2022
ER -