CNR Institutional Research Information System

This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.

Enhancing token boundary detection in disfluent speech

Srivastava Manu;Ferro Marcello;Pirrelli Vito;Coro Gianpaolo

2026

Abstract

This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.ancejournal	INTELLIGENT SYSTEMS WITH APPLICATIONS	en
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Srivastava Manu	en
dc.authority.people	Ferro Marcello	en
dc.authority.people	Pirrelli Vito	en
dc.authority.people	Coro Gianpaolo	en
dc.authority.project	READLET	en
dc.collection.id.s	b3f88f24-048a-4e43-8ab1-6697b90e068e	*
dc.collection.name	01.01 Articolo in rivista	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/01/02 11:42:55	-
dc.date.available	2026/01/02 11:42:55	-
dc.date.firstsubmission	2025/12/27 23:07:24	*
dc.date.issued	2026	-
dc.date.submission	2025/12/27 23:07:24	*
dc.description.abstracteng	This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.	-
dc.description.allpeople	Srivastava, Manu; Ferro, Marcello; Pirrelli, Vito; Coro, Gianpaolo	-
dc.description.allpeopleoriginal	Srivastava Manu, Ferro Marcello, Pirrelli Vito, Coro Gianpaolo	en
dc.description.fulltext	open	en
dc.description.numberofauthors	4	-
dc.identifier.doi	10.1016/j.iswa.2025.200614	en
dc.identifier.isi	WOS:001648846800001	-
dc.identifier.scopus	2-s2.0-105025107011	-
dc.identifier.source	bibtex	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/561481	-
dc.identifier.url	https://www.sciencedirect.com/science/article/pii/S2667305325001401	en
dc.language.iso	eng	en
dc.relation.medium	ELETTRONICO	en
dc.relation.numberofpages	14	en
dc.relation.projectAcronym	READLET	en
dc.relation.projectAwardNumber	2017W8HFRX	en
dc.relation.projectAwardTitle	READLET	en
dc.relation.projectFunderName	Ministero dell'Università e della Ricerca	en
dc.relation.projectFundingStream	PRIN 2017	en
dc.relation.volume	29	en
dc.subject.keywordseng	Automatic Speech Recognition, Statistical analysis, Disfluencies, Voice Activity Detection	-
dc.subject.singlekeyword	Automatic Speech Recognition	*
dc.subject.singlekeyword	Statistical analysis	*
dc.subject.singlekeyword	Disfluencies	*
dc.subject.singlekeyword	Voice Activity Detection	*
dc.title	Enhancing token boundary detection in disfluent speech	en
dc.type.circulation	Internazionale	en
dc.type.driver	info:eu-repo/semantics/article	-
dc.type.full	01 Contributo su Rivista::01.01 Articolo in rivista	it
dc.type.impactfactor	si	en
dc.type.miur	262	-
iris.isi.ideLinkStatusDate	2026/06/10 16:39:44	*
iris.isi.ideLinkStatusMillisecond	1781102384170	*
iris.isi.metadataErrorDescription	0	-
iris.isi.metadataErrorType	ERROR_NO_MATCH	-
iris.isi.metadataStatus	ERROR	-
iris.mediafilter.data	2026/01/03 03:24:49	*
iris.orcid.lastModifiedDate	2026/06/10 16:39:44	*
iris.orcid.lastModifiedMillisecond	1781102384118	*
iris.scopus.extIssued	2026	-
iris.scopus.extTitle	Enhancing token boundary detection in disfluent speech	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoahost	publisher	*
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.1016/j.iswa.2025.200614	*
iris.unpaywall.hosttype	publisher	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	true	*
iris.unpaywall.landingpage	https://doi.org/10.1016/j.iswa.2025.200614	*
iris.unpaywall.license	cc-by-nc-nd	*
iris.unpaywall.metadataCallLastModified	11/06/2026 03:45:29	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1781142330006	-
iris.unpaywall.oastatus	gold	*
scopus.authority.ancejournal	INTELLIGENT SYSTEMS WITH APPLICATIONS###2667-3053	*
scopus.category	1701	*
scopus.category	1711	*
scopus.category	1707	*
scopus.category	1706	*
scopus.category	1702	*
scopus.contributor.affiliation	Istituto di Scienza e Tecnologie dell'Informazione “Alessandro Faedo” – CNR	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” – CNR	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” – CNR	-
scopus.contributor.affiliation	Istituto di Scienza e Tecnologie dell'Informazione “Alessandro Faedo” – CNR	-
scopus.contributor.afid	60085207	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60085207	-
scopus.contributor.auid	60017833000	-
scopus.contributor.auid	15759406100	-
scopus.contributor.auid	14833305800	-
scopus.contributor.auid	13104305800	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.name	Manu	-
scopus.contributor.name	Marcello	-
scopus.contributor.name	Vito	-
scopus.contributor.name	Gianpaolo	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Srivastava	-
scopus.contributor.surname	Ferro	-
scopus.contributor.surname	Pirrelli	-
scopus.contributor.surname	Coro	-
scopus.date.issued	2026	*
scopus.description.abstracteng	This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.	*
scopus.description.allpeopleoriginal	Srivastava M.; Ferro M.; Pirrelli V.; Coro G.	*
scopus.differences	scopus.subject.keywords	*
scopus.differences	scopus.description.allpeopleoriginal	*
scopus.document.type	ar	*
scopus.document.types	ar	*
scopus.identifier.doi	10.1016/j.iswa.2025.200614	*
scopus.identifier.pui	2042347618	*
scopus.identifier.scopus	2-s2.0-105025107011	*
scopus.journal.sourceid	21101051831	*
scopus.language.iso	eng	*
scopus.publisher.name	Elsevier B.V.	*
scopus.relation.article	200614	*
scopus.relation.volume	29	*
scopus.subject.keywords	Automatic Speech Recognition; Disfluencies; Statistical analysis; Voice Activity Detection;	*
scopus.title	Enhancing token boundary detection in disfluent speech	*
scopus.titleeng	Enhancing token boundary detection in disfluent speech	*
Appare nelle tipologie:	01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
published_paper.pdf accesso aperto Descrizione: Enhancing token boundary detection in disfluent speech Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.66 MB Formato Adobe PDF Visualizza/Apri	2.66 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/561481

Citazioni

ND

0

0

social impact