This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.

Enhancing token boundary detection in disfluent speech

Srivastava Manu
;
Ferro Marcello;Pirrelli Vito;Coro Gianpaolo
2026

Abstract

This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.
Campo DC Valore Lingua
dc.authority.ancejournal INTELLIGENT SYSTEMS WITH APPLICATIONS en
dc.authority.orgunit Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Srivastava Manu en
dc.authority.people Ferro Marcello en
dc.authority.people Pirrelli Vito en
dc.authority.people Coro Gianpaolo en
dc.authority.project READLET en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 973 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/01/02 11:42:55 -
dc.date.available 2026/01/02 11:42:55 -
dc.date.firstsubmission 2025/12/27 23:07:24 *
dc.date.issued 2026 -
dc.date.submission 2025/12/27 23:07:24 *
dc.description.abstracteng This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research. -
dc.description.allpeople Srivastava, Manu; Ferro, Marcello; Pirrelli, Vito; Coro, Gianpaolo -
dc.description.allpeopleoriginal Srivastava Manu, Ferro Marcello, Pirrelli Vito, Coro Gianpaolo en
dc.description.fulltext open en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.1016/j.iswa.2025.200614 en
dc.identifier.scopus 2-s2.0-105025107011 -
dc.identifier.source bibtex *
dc.identifier.uri https://hdl.handle.net/20.500.14243/561481 -
dc.identifier.url https://www.sciencedirect.com/science/article/pii/S2667305325001401 en
dc.language.iso eng en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 14 en
dc.relation.projectAcronym READLET en
dc.relation.projectAwardNumber 2017W8HFRX en
dc.relation.projectAwardTitle READLET en
dc.relation.projectFunderName Ministero dell'Università e della Ricerca en
dc.relation.projectFundingStream PRIN 2017 en
dc.relation.volume 29 en
dc.subject.keywordseng Automatic Speech Recognition, Statistical analysis, Disfluencies, Voice Activity Detection -
dc.subject.singlekeyword Automatic Speech Recognition *
dc.subject.singlekeyword Statistical analysis *
dc.subject.singlekeyword Disfluencies *
dc.subject.singlekeyword Voice Activity Detection *
dc.title Enhancing token boundary detection in disfluent speech en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.impactfactor si en
dc.type.miur 262 -
iris.mediafilter.data 2026/01/03 03:24:49 *
iris.orcid.lastModifiedDate 2026/01/03 02:09:23 *
iris.orcid.lastModifiedMillisecond 1767402563177 *
iris.scopus.extIssued 2026 -
iris.scopus.extTitle Enhancing token boundary detection in disfluent speech -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost publisher *
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.1016/j.iswa.2025.200614 *
iris.unpaywall.hosttype publisher *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj true *
iris.unpaywall.landingpage https://doi.org/10.1016/j.iswa.2025.200614 *
iris.unpaywall.license cc-by-nc-nd *
iris.unpaywall.metadataCallLastModified 03/01/2026 03:03:12 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1767405792116 -
iris.unpaywall.oastatus gold *
scopus.authority.ancejournal INTELLIGENT SYSTEMS WITH APPLICATIONS###2667-3053 *
scopus.category 1701 *
scopus.category 1711 *
scopus.category 1707 *
scopus.category 1706 *
scopus.category 1702 *
scopus.contributor.affiliation Istituto di Scienza e Tecnologie dell'Informazione “Alessandro Faedo” – CNR -
scopus.contributor.affiliation Istituto di Linguistica Computazionale “Antonio Zampolli” – CNR -
scopus.contributor.affiliation Istituto di Linguistica Computazionale “Antonio Zampolli” – CNR -
scopus.contributor.affiliation Istituto di Scienza e Tecnologie dell'Informazione “Alessandro Faedo” – CNR -
scopus.contributor.afid 60085207 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60085207 -
scopus.contributor.auid 60017833000 -
scopus.contributor.auid 15759406100 -
scopus.contributor.auid 14833305800 -
scopus.contributor.auid 13104305800 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.name Manu -
scopus.contributor.name Marcello -
scopus.contributor.name Vito -
scopus.contributor.name Gianpaolo -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.surname Srivastava -
scopus.contributor.surname Ferro -
scopus.contributor.surname Pirrelli -
scopus.contributor.surname Coro -
scopus.date.issued 2026 *
scopus.description.abstracteng This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research. *
scopus.description.allpeopleoriginal Srivastava M.; Ferro M.; Pirrelli V.; Coro G. *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.description.allpeopleoriginal *
scopus.document.type ar *
scopus.document.types ar *
scopus.identifier.doi 10.1016/j.iswa.2025.200614 *
scopus.identifier.pui 2042347618 *
scopus.identifier.scopus 2-s2.0-105025107011 *
scopus.journal.sourceid 21101051831 *
scopus.language.iso eng *
scopus.publisher.name Elsevier B.V. *
scopus.relation.article 200614 *
scopus.relation.volume 29 *
scopus.subject.keywords Automatic Speech Recognition; Disfluencies; Statistical analysis; Voice Activity Detection; *
scopus.title Enhancing token boundary detection in disfluent speech *
scopus.titleeng Enhancing token boundary detection in disfluent speech *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
File Dimensione Formato  
published_paper.pdf

accesso aperto

Descrizione: Enhancing token boundary detection in disfluent speech
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 2.66 MB
Formato Adobe PDF
2.66 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/561481
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact