TY - GEN
T1 - Uncovering Human Traits in Determining Real and Spoofed Audio
T2 - 2024 CHI Conference on Human Factors in Computing Sytems, CHI 2024
AU - Han, Chaeeun
AU - Mitra, Prasenjit
AU - Billah, Syed Masum
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s)
PY - 2024/5/11
Y1 - 2024/5/11
N2 - This paper explores how blind and sighted individuals perceive real and spoofed audio, highlighting differences and similarities between the groups. Through two studies, we find that both groups focus on specific human traits in audio-such as accents, vocal inflections, breathing patterns, and emotions-to assess audio authenticity. We further reveal that humans, irrespective of visual ability, can still outperform current state-of-the-art machine learning models in discerning audio authenticity; however, the task proves psychologically demanding. Moreover, detection accuracy scores between blind and sighted individuals are comparable, but each group exhibits unique strengths: the sighted group excels at detecting deepfake-generated audio, while the blind group excels at detecting text-to-speech (TTS) generated audio. These findings not only deepen our understanding of machine-manipulated and neural-renderer audio but also have implications for developing countermeasures, such as perceptible watermarks and human-AI collaboration strategies for spoofing detection.
AB - This paper explores how blind and sighted individuals perceive real and spoofed audio, highlighting differences and similarities between the groups. Through two studies, we find that both groups focus on specific human traits in audio-such as accents, vocal inflections, breathing patterns, and emotions-to assess audio authenticity. We further reveal that humans, irrespective of visual ability, can still outperform current state-of-the-art machine learning models in discerning audio authenticity; however, the task proves psychologically demanding. Moreover, detection accuracy scores between blind and sighted individuals are comparable, but each group exhibits unique strengths: the sighted group excels at detecting deepfake-generated audio, while the blind group excels at detecting text-to-speech (TTS) generated audio. These findings not only deepen our understanding of machine-manipulated and neural-renderer audio but also have implications for developing countermeasures, such as perceptible watermarks and human-AI collaboration strategies for spoofing detection.
UR - http://www.scopus.com/inward/record.url?scp=85194844236&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85194844236&partnerID=8YFLogxK
U2 - 10.1145/3613904.3642817
DO - 10.1145/3613904.3642817
M3 - Conference contribution
AN - SCOPUS:85194844236
T3 - Conference on Human Factors in Computing Systems - Proceedings
BT - CHI 2024 - Proceedings of the 2024 CHI Conference on Human Factors in Computing Sytems
PB - Association for Computing Machinery
Y2 - 11 May 2024 through 16 May 2024
ER -