This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.
Enhancing token boundary detection in disfluent speech
Srivastava Manu
;Ferro Marcello;Pirrelli Vito;Coro Gianpaolo
2026
Abstract
This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.ancejournal | INTELLIGENT SYSTEMS WITH APPLICATIONS | en |
| dc.authority.orgunit | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | en |
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Srivastava Manu | en |
| dc.authority.people | Ferro Marcello | en |
| dc.authority.people | Pirrelli Vito | en |
| dc.authority.people | Coro Gianpaolo | en |
| dc.authority.project | READLET | en |
| dc.collection.id.s | b3f88f24-048a-4e43-8ab1-6697b90e068e | * |
| dc.collection.name | 01.01 Articolo in rivista | * |
| dc.contributor.appartenenza | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.appartenenza.mi | 973 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2026/01/02 11:42:55 | - |
| dc.date.available | 2026/01/02 11:42:55 | - |
| dc.date.firstsubmission | 2025/12/27 23:07:24 | * |
| dc.date.issued | 2026 | - |
| dc.date.submission | 2025/12/27 23:07:24 | * |
| dc.description.abstracteng | This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research. | - |
| dc.description.allpeople | Srivastava, Manu; Ferro, Marcello; Pirrelli, Vito; Coro, Gianpaolo | - |
| dc.description.allpeopleoriginal | Srivastava Manu, Ferro Marcello, Pirrelli Vito, Coro Gianpaolo | en |
| dc.description.fulltext | open | en |
| dc.description.numberofauthors | 4 | - |
| dc.identifier.doi | 10.1016/j.iswa.2025.200614 | en |
| dc.identifier.scopus | 2-s2.0-105025107011 | - |
| dc.identifier.source | bibtex | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/561481 | - |
| dc.identifier.url | https://www.sciencedirect.com/science/article/pii/S2667305325001401 | en |
| dc.language.iso | eng | en |
| dc.relation.medium | ELETTRONICO | en |
| dc.relation.numberofpages | 14 | en |
| dc.relation.projectAcronym | READLET | en |
| dc.relation.projectAwardNumber | 2017W8HFRX | en |
| dc.relation.projectAwardTitle | READLET | en |
| dc.relation.projectFunderName | Ministero dell'Università e della Ricerca | en |
| dc.relation.projectFundingStream | PRIN 2017 | en |
| dc.relation.volume | 29 | en |
| dc.subject.keywordseng | Automatic Speech Recognition, Statistical analysis, Disfluencies, Voice Activity Detection | - |
| dc.subject.singlekeyword | Automatic Speech Recognition | * |
| dc.subject.singlekeyword | Statistical analysis | * |
| dc.subject.singlekeyword | Disfluencies | * |
| dc.subject.singlekeyword | Voice Activity Detection | * |
| dc.title | Enhancing token boundary detection in disfluent speech | en |
| dc.type.circulation | Internazionale | en |
| dc.type.driver | info:eu-repo/semantics/article | - |
| dc.type.full | 01 Contributo su Rivista::01.01 Articolo in rivista | it |
| dc.type.impactfactor | si | en |
| dc.type.miur | 262 | - |
| iris.mediafilter.data | 2026/01/03 03:24:49 | * |
| iris.orcid.lastModifiedDate | 2026/01/03 02:09:23 | * |
| iris.orcid.lastModifiedMillisecond | 1767402563177 | * |
| iris.scopus.extIssued | 2026 | - |
| iris.scopus.extTitle | Enhancing token boundary detection in disfluent speech | - |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.bestoahost | publisher | * |
| iris.unpaywall.bestoaversion | publishedVersion | * |
| iris.unpaywall.doi | 10.1016/j.iswa.2025.200614 | * |
| iris.unpaywall.hosttype | publisher | * |
| iris.unpaywall.isoa | true | * |
| iris.unpaywall.journalisindoaj | true | * |
| iris.unpaywall.landingpage | https://doi.org/10.1016/j.iswa.2025.200614 | * |
| iris.unpaywall.license | cc-by-nc-nd | * |
| iris.unpaywall.metadataCallLastModified | 03/01/2026 03:03:12 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1767405792116 | - |
| iris.unpaywall.oastatus | gold | * |
| scopus.authority.ancejournal | INTELLIGENT SYSTEMS WITH APPLICATIONS###2667-3053 | * |
| scopus.category | 1701 | * |
| scopus.category | 1711 | * |
| scopus.category | 1707 | * |
| scopus.category | 1706 | * |
| scopus.category | 1702 | * |
| scopus.contributor.affiliation | Istituto di Scienza e Tecnologie dell'Informazione “Alessandro Faedo” – CNR | - |
| scopus.contributor.affiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” – CNR | - |
| scopus.contributor.affiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” – CNR | - |
| scopus.contributor.affiliation | Istituto di Scienza e Tecnologie dell'Informazione “Alessandro Faedo” – CNR | - |
| scopus.contributor.afid | 60085207 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60085207 | - |
| scopus.contributor.auid | 60017833000 | - |
| scopus.contributor.auid | 15759406100 | - |
| scopus.contributor.auid | 14833305800 | - |
| scopus.contributor.auid | 13104305800 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.name | Manu | - |
| scopus.contributor.name | Marcello | - |
| scopus.contributor.name | Vito | - |
| scopus.contributor.name | Gianpaolo | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.surname | Srivastava | - |
| scopus.contributor.surname | Ferro | - |
| scopus.contributor.surname | Pirrelli | - |
| scopus.contributor.surname | Coro | - |
| scopus.date.issued | 2026 | * |
| scopus.description.abstracteng | This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research. | * |
| scopus.description.allpeopleoriginal | Srivastava M.; Ferro M.; Pirrelli V.; Coro G. | * |
| scopus.differences | scopus.subject.keywords | * |
| scopus.differences | scopus.description.allpeopleoriginal | * |
| scopus.document.type | ar | * |
| scopus.document.types | ar | * |
| scopus.identifier.doi | 10.1016/j.iswa.2025.200614 | * |
| scopus.identifier.pui | 2042347618 | * |
| scopus.identifier.scopus | 2-s2.0-105025107011 | * |
| scopus.journal.sourceid | 21101051831 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Elsevier B.V. | * |
| scopus.relation.article | 200614 | * |
| scopus.relation.volume | 29 | * |
| scopus.subject.keywords | Automatic Speech Recognition; Disfluencies; Statistical analysis; Voice Activity Detection; | * |
| scopus.title | Enhancing token boundary detection in disfluent speech | * |
| scopus.titleeng | Enhancing token boundary detection in disfluent speech | * |
| Appare nelle tipologie: | 01.01 Articolo in rivista | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
published_paper.pdf
accesso aperto
Descrizione: Enhancing token boundary detection in disfluent speech
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
2.66 MB
Formato
Adobe PDF
|
2.66 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


