This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.

Enhancing token boundary detection in disfluent speech

Manu Srivastava
Primo
;
Marcello Ferro;Vito Pirrelli;Gianpaolo Coro
Ultimo
2026

Abstract

This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.
2026
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Automatic Speech Recognition, Statistical analysis, Disfluencies, Voice Activity Detection
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/561481
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact