Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).
Assessing Smartphone Speech Recognition across Diverse English Accents: A Preliminary Study
Claudia Soria
Primo
;
2024
Abstract
Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).| File | Dimensione | Formato | |
|---|---|---|---|
|
Assessing Smartphone Speech Recognition across Diverse English Accents A Preliminary Study.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
281.68 kB
Formato
Adobe PDF
|
281.68 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


