CNR Institutional Research Information System

Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).

Assessing Smartphone Speech Recognition across Diverse English Accents: A Preliminary Study

Claudia Soria^Primo;Rosalba Nodari^Secondo;Silvia Calamai^Ultimo

2024

Abstract

Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Claudia Soria	en
dc.authority.people	Rosalba Nodari	en
dc.authority.people	Silvia Calamai	en
dc.authority.project	2022-1-IT02-KA220-SCH-000087602	en
dc.collection.id.s	69aaa6b3-f0f0-47c1-b9a1-040bae867ec3	*
dc.collection.name	04.02 Abstract in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2025/01/22 15:40:09	-
dc.date.available	2025/01/22 15:40:09	-
dc.date.firstsubmission	2025/01/22 14:21:58	*
dc.date.issued	2024	-
dc.date.submission	2025/01/22 14:21:58	*
dc.description.abstracteng	Voice-activated artificial intelligence in smartphones is making spoken human-device interactions increasingly common, with many users utilizing these systems for everyday tasks such as creating shopping lists, dictating messages, or querying information (Ammari et al., 2019). The success of these interactions relies heavily on the accuracy of speech recognition technology embedded in devices, which can be significantly affected by accents and dialects. Recent advancements have improved the recognition of various accents beyond standard British or American English, driven by the need to ensure equitable service and representation for diverse communities (Choe et al., 2022; Koenecke et al., 2020). Although some automatic speech recognition (ASR) systems embedded in smartphones offer recognition for certain second language (L2) English accents (Lai, 2021), research on their performance remains limited (Chan et al., 2022; Del Rio et al., 2023; Tadimeti et al., 2022). This work presents preliminary findings from a study assessing the performance of common smartphone speech recognition systems with respect to a range of L1 (native) and L2 (non-native) English accents. The study utilized 36 audio clips from the CIRCE corpus, which consisted of the same short text read aloud by male and female speakers of four L1 and nine L2 English accents. The L1 accents included Standard American, African American, Standard British, and Multicultural London English, while the L2 accents covered Indian, Nigerian, Bosnian, Italian, Turkish, Ukrainian, Chinese, German, and Russian. Each clip averaged 0.32 seconds in length. To simulate typical user experiences, the research evaluated Apple’s Siri voice recognition for two everyday tasks: message/note dictation and voice search. The audio clips were played from a laptop with voice recognition activated on an iPhone using the Notes app. Siri’s different English locales2 (USA, UK, Australia, Canada, Japan, India, New Zealand, Singapore, and South Africa) were tested for each accent. Each clip was played three times, resulting in a total of 702 transcripts. The study measured transcript accuracy using the Word Error Rate (WER) to compare and evaluate the performance of ASR systems. This new and unique comparable speech corpus provided insights into which L1 and L2 English accents are best recognized by common smartphones, as well as a comparative analysis of different automatic recognition models of local Englishes. Additionally, these preliminary results were compared with existing literature on human intelligibility of L1 and L2 accents (Verbeke and Simon, 2023).	-
dc.description.allpeople	Soria, Claudia; Nodari, Rosalba; Calamai, Silvia	-
dc.description.allpeopleoriginal	Claudia Soria, Rosalba Nodari, Silvia Calamai	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	3	-
dc.identifier.source	manual	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/529673	-
dc.identifier.url	https://www.filolog.uni.lodz.pl/fileadmin/Wydzialy/Wydzial_Filologiczny/PLIKI_KONFERENCJE/ACCENTS/Accents2024/Accents-2024-BoA.pdf	en
dc.language.iso	eng	en
dc.publisher.country	POL	en
dc.publisher.name	University of Lodz	en
dc.publisher.place	Lodz	en
dc.relation.allauthors	Aleksandra Matysiak	en
dc.relation.conferencedate	12-14 dicembre 2024	en
dc.relation.conferencename	ACCENTS 2024 Accents in various contexts 17th International Conference on Native and Non-native Accents of English	en
dc.relation.conferenceplace	Lodz, Polonia	en
dc.relation.firstpage	63	en
dc.relation.ispartofbook	ACCENTS 2024 Accents in various contexts 17th International Conference on Native and Non-native Accents of English, Book of Abstract	en
dc.relation.lastpage	65	en
dc.relation.numberofpages	3	en
dc.relation.projectAcronym	CIRCE	en
dc.relation.projectAwardNumber	2022-1-IT02-KA220-SCH-000087602	en
dc.relation.projectAwardTitle	Counteracting Accent Discrimination Practices in Education	en
dc.relation.projectFunderName	Erasmus+	en
dc.relation.projectFundingStream	europeo	en
dc.subject.keywords	automatic speech recognition, WER, English accents, L1 accents, L2 accents	-
dc.subject.singlekeyword	automatic speech recognition	*
dc.subject.singlekeyword	WER	*
dc.subject.singlekeyword	English accents	*
dc.subject.singlekeyword	L1 accents	*
dc.subject.singlekeyword	L2 accents	*
dc.title	Assessing Smartphone Speech Recognition across Diverse English Accents: A Preliminary Study	en
dc.type.circulation	Internazionale	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.02 Abstract in Atti di convegno	it
dc.type.impactfactor	no	en
dc.type.miur	274	-
dc.type.referee	Comitato scientifico	en
iris.mediafilter.data	2025/04/05 13:18:08	*
iris.orcid.lastModifiedDate	2025/02/05 10:39:58	*
iris.orcid.lastModifiedMillisecond	1738748398376	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.02 Abstract in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Assessing Smartphone Speech Recognition across Diverse English Accents A Preliminary Study.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 281.68 kB Formato Adobe PDF Visualizza/Apri	281.68 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/529673

Citazioni

ND

ND

ND

social impact