TY - JOUR
T1 - The Role of Rhythm and Vowel Space in Speech Recognition
AU - Lai, Li Fang
AU - van Hell, Janet G.
AU - Lipski, John
N1 - Funding Information:
We thank our participants for providing the research data and Carly Danielson and Aleah Combs, former lab managers at Penn State’s Center for Language Science, who helped conduct the online experiments. We also thank Penn State’s Center for Language Science for their financial support for this study.
Publisher Copyright:
© 2022 International Speech Communications Association. All rights reserved.
PY - 2022
Y1 - 2022
N2 - This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.
AB - This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.
UR - http://www.scopus.com/inward/record.url?scp=85166278777&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85166278777&partnerID=8YFLogxK
U2 - 10.21437/SpeechProsody.2022-87
DO - 10.21437/SpeechProsody.2022-87
M3 - Conference article
AN - SCOPUS:85166278777
SN - 2333-2042
VL - 2022-May
SP - 425
EP - 429
JO - Proceedings of the International Conference on Speech Prosody
JF - Proceedings of the International Conference on Speech Prosody
T2 - 11th International Conference on Speech Prosody, Speech Prosody 2022
Y2 - 23 May 2022 through 26 May 2022
ER -