TY - GEN
T1 - Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision
AU - Thebaud, Thomas
AU - Favaro, Anna
AU - Guan, Yaohan
AU - Yang, Yuchen
AU - Singh, Prabhav
AU - Villalba, Jesus
AU - Mono-Velazquez, Laureano
AU - Dehak, Najim
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/11/4
Y1 - 2024/11/4
N2 - In the realm of audiovisual emotion recognition, a significant challenge lies in developing neural network architectures capable of effectively harnessing and integrating multimodal information. This study introduces an advanced methodology for the Empathic Virtual Agent Challenge (EVAC), utilizing state-of-the-art speech, language, and image models. Specifically, we leverage cutting-edge pre-trained models, including multilingual variants fine-tuned in French for each modality, and integrate them using late fusion techniques. Through extensive experimentation and validation, we demonstrate the efficacy of our approach in achieving competitive results on the challenge dataset. Our findings highlight that multimodal approaches outperform unimodal methods across Core Affect Presence and Intensity and Appraisal Dimensions tasks, underscoring the effectiveness of integrating diverse modalities. This underscores the importance of leveraging multiple sources of information to capture nuanced emotional states more accurately and robustly in real-world applications.
AB - In the realm of audiovisual emotion recognition, a significant challenge lies in developing neural network architectures capable of effectively harnessing and integrating multimodal information. This study introduces an advanced methodology for the Empathic Virtual Agent Challenge (EVAC), utilizing state-of-the-art speech, language, and image models. Specifically, we leverage cutting-edge pre-trained models, including multilingual variants fine-tuned in French for each modality, and integrate them using late fusion techniques. Through extensive experimentation and validation, we demonstrate the efficacy of our approach in achieving competitive results on the challenge dataset. Our findings highlight that multimodal approaches outperform unimodal methods across Core Affect Presence and Intensity and Appraisal Dimensions tasks, underscoring the effectiveness of integrating diverse modalities. This underscores the importance of leveraging multiple sources of information to capture nuanced emotional states more accurately and robustly in real-world applications.
UR - https://www.scopus.com/pages/publications/85212593117
UR - https://www.scopus.com/pages/publications/85212593117#tab=citedBy
U2 - 10.1145/3678957.3689332
DO - 10.1145/3678957.3689332
M3 - Conference contribution
AN - SCOPUS:85212593117
T3 - ACM International Conference Proceeding Series
SP - 684
EP - 689
BT - ICMI 2024 - Proceedings of the 26th International Conference on Multimodal Interaction
PB - Association for Computing Machinery
T2 - 26th International Conference on Multimodal Interaction, ICMI 2024
Y2 - 4 November 2024 through 8 November 2024
ER -