The increasing availability of clinical reports offers valuable opportunities for natural language processing (NLP) applications in healthcare. Large Language Models (LLMs), such as BERT-based architectures and generative models, have shown great promise in text classification, summarization, and semantic analysis. However, applying LLMs to Electronic Health Records (EHRs) poses challenges due to token limits and the complexity of clinical text. Sentence Boundary Detection (SBD), which segments text into meaningful units, is a critical preprocessing step to address token constraints and improve model interpretability, particularly for tasks like text classification. This study benchmarks several SBD methods, including traditional approaches (e.g., NLTK, Stanza, PySBD) and state-of-the-art transformer-based models, such as Segment Any Text (SAT), fine-tuned using low-rank adaptation (LoRA) for the clinical domain. The models were evaluated on a dataset of clinical reports in Italian, sourced from the Gemelli hospital of Rome, using metrics like F1-score to measure segmentation quality. The results reveal that PySBD achieved the best performance, closely aligning with the gold standard, with a median F1-Score of 83%. We also assessed the impact of segmentation on a downstream metastasis classification task, comparing the performance of a transformer-based model applied to unsegmented reports versus reports processed with PySBD. Segmentation outperformed the entire report scenario, with a higher F1-Score of 92% versus 88%, demonstrating that SBD improves text classification by ensuring semantic coherence, adhering to token constraints, and providing sentence-level explainability. In conclusion, this study highlights the importance of SBD in enhancing both the quality and interpretability of downstream NLP tasks in healthcare. By benchmarking traditional and transformer-based SBD models, we validate the role of segmentation as a critical preprocessing step to advance clinical NLP applications, offering insights for improving performance and clinical relevance in the processing of EHRs.

Improving Clinical Report Classification with Sentence Boundary Detection

Santoro, Mario
Ultimo
2025

Abstract

The increasing availability of clinical reports offers valuable opportunities for natural language processing (NLP) applications in healthcare. Large Language Models (LLMs), such as BERT-based architectures and generative models, have shown great promise in text classification, summarization, and semantic analysis. However, applying LLMs to Electronic Health Records (EHRs) poses challenges due to token limits and the complexity of clinical text. Sentence Boundary Detection (SBD), which segments text into meaningful units, is a critical preprocessing step to address token constraints and improve model interpretability, particularly for tasks like text classification. This study benchmarks several SBD methods, including traditional approaches (e.g., NLTK, Stanza, PySBD) and state-of-the-art transformer-based models, such as Segment Any Text (SAT), fine-tuned using low-rank adaptation (LoRA) for the clinical domain. The models were evaluated on a dataset of clinical reports in Italian, sourced from the Gemelli hospital of Rome, using metrics like F1-score to measure segmentation quality. The results reveal that PySBD achieved the best performance, closely aligning with the gold standard, with a median F1-Score of 83%. We also assessed the impact of segmentation on a downstream metastasis classification task, comparing the performance of a transformer-based model applied to unsegmented reports versus reports processed with PySBD. Segmentation outperformed the entire report scenario, with a higher F1-Score of 92% versus 88%, demonstrating that SBD improves text classification by ensuring semantic coherence, adhering to token constraints, and providing sentence-level explainability. In conclusion, this study highlights the importance of SBD in enhancing both the quality and interpretability of downstream NLP tasks in healthcare. By benchmarking traditional and transformer-based SBD models, we validate the role of segmentation as a critical preprocessing step to advance clinical NLP applications, offering insights for improving performance and clinical relevance in the processing of EHRs.
2025
Istituto Applicazioni del Calcolo ''Mauro Picone''
Electronic Health Records
Large Language Models
Sentence Boundary Detection
Text Classification
Text Segmentation
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/584004
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact