Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).

Assessing Smartphone Speech Recognition across Diverse English Accents: A Preliminary Study

Claudia Soria
Primo
;
2024

Abstract

Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Claudia Soria en
dc.authority.people Rosalba Nodari en
dc.authority.people Silvia Calamai en
dc.authority.project 2022-1-IT02-KA220-SCH-000087602 en
dc.collection.id.s 69aaa6b3-f0f0-47c1-b9a1-040bae867ec3 *
dc.collection.name 04.02 Abstract in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.date.accessioned 2025/01/22 15:40:09 -
dc.date.available 2025/01/22 15:40:09 -
dc.date.firstsubmission 2025/01/22 14:21:58 *
dc.date.issued 2024 -
dc.date.submission 2025/01/22 14:21:58 *
dc.description.abstracteng Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023). -
dc.description.allpeople Soria, Claudia; Nodari, Rosalba; Calamai, Silvia -
dc.description.allpeopleoriginal Claudia Soria, Rosalba Nodari, Silvia Calamai en
dc.description.fulltext open en
dc.description.international no en
dc.description.numberofauthors 3 -
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/529673 -
dc.identifier.url https://www.filolog.uni.lodz.pl/fileadmin/Wydzialy/Wydzial_Filologiczny/PLIKI_KONFERENCJE/ACCENTS/Accents2024/Accents-2024-BoA.pdf en
dc.language.iso eng en
dc.publisher.country POL en
dc.publisher.name University of Lodz en
dc.publisher.place Lodz en
dc.relation.allauthors Aleksandra Matysiak en
dc.relation.conferencedate 12-14 dicembre 2024 en
dc.relation.conferencename ACCENTS 2024 Accents in various contexts 17th International Conference on Native and Non-native Accents of English en
dc.relation.conferenceplace Lodz, Polonia en
dc.relation.firstpage 63 en
dc.relation.ispartofbook ACCENTS 2024 Accents in various contexts 17th International Conference on Native and Non-native Accents of English, Book of Abstract en
dc.relation.lastpage 65 en
dc.relation.numberofpages 3 en
dc.relation.projectAcronym CIRCE en
dc.relation.projectAwardNumber 2022-1-IT02-KA220-SCH-000087602 en
dc.relation.projectAwardTitle Counteracting Accent Discrimination Practices in Education en
dc.relation.projectFunderName Erasmus+ en
dc.relation.projectFundingStream europeo en
dc.subject.keywords automatic speech recognition, WER, English accents, L1 accents, L2 accents -
dc.subject.singlekeyword automatic speech recognition *
dc.subject.singlekeyword WER *
dc.subject.singlekeyword English accents *
dc.subject.singlekeyword L1 accents *
dc.subject.singlekeyword L2 accents *
dc.title Assessing Smartphone Speech Recognition across Diverse English Accents: A Preliminary Study en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.02 Abstract in Atti di convegno it
dc.type.impactfactor no en
dc.type.miur 274 -
dc.type.referee Comitato scientifico en
iris.mediafilter.data 2025/04/05 13:18:08 *
iris.orcid.lastModifiedDate 2025/02/05 10:39:58 *
iris.orcid.lastModifiedMillisecond 1738748398376 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 04.02 Abstract in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
Assessing Smartphone Speech Recognition across Diverse English Accents A Preliminary Study.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 281.68 kB
Formato Adobe PDF
281.68 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/529673
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact