The Role of Rhythm and Vowel Space in Speech Recognition

Li Fang Lai, Janet G. van Hell, John Lipski

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.

Original languageEnglish (US)
Pages (from-to)425-429
Number of pages5
JournalProceedings of the International Conference on Speech Prosody
Volume2022-May
DOIs
StatePublished - 2022
Event11th International Conference on Speech Prosody, Speech Prosody 2022 - Lisbon, Portugal
Duration: May 23 2022May 26 2022

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'The Role of Rhythm and Vowel Space in Speech Recognition'. Together they form a unique fingerprint.

Cite this