CNR Institutional Research Information System

The increasing availability of clinical reports offers valuable opportunities for natural language processing (NLP) applications in healthcare. Large Language Models (LLMs), such as BERT-based architectures and generative models, have shown great promise in text classification, summarization, and semantic analysis. However, applying LLMs to Electronic Health Records (EHRs) poses challenges due to token limits and the complexity of clinical text. Sentence Boundary Detection (SBD), which segments text into meaningful units, is a critical preprocessing step to address token constraints and improve model interpretability, particularly for tasks like text classification. This study benchmarks several SBD methods, including traditional approaches (e.g., NLTK, Stanza, PySBD) and state-of-the-art transformer-based models, such as Segment Any Text (SAT), fine-tuned using low-rank adaptation (LoRA) for the clinical domain. The models were evaluated on a dataset of clinical reports in Italian, sourced from the Gemelli hospital of Rome, using metrics like F1-score to measure segmentation quality. The results reveal that PySBD achieved the best performance, closely aligning with the gold standard, with a median F1-Score of 83%. We also assessed the impact of segmentation on a downstream metastasis classification task, comparing the performance of a transformer-based model applied to unsegmented reports versus reports processed with PySBD. Segmentation outperformed the entire report scenario, with a higher F1-Score of 92% versus 88%, demonstrating that SBD improves text classification by ensuring semantic coherence, adhering to token constraints, and providing sentence-level explainability. In conclusion, this study highlights the importance of SBD in enhancing both the quality and interpretability of downstream NLP tasks in healthcare. By benchmarking traditional and transformer-based SBD models, we validate the role of segmentation as a critical preprocessing step to advance clinical NLP applications, offering insights for improving performance and clinical relevance in the processing of EHRs.

Improving Clinical Report Classification with Sentence Boundary Detection

Lilli, Livia;Patarnello, Stefano;Capocchiano, Nikola Dino;Masciocchi, Carlotta;Santoro, Mario^Ultimo

2025

Abstract

The increasing availability of clinical reports offers valuable opportunities for natural language processing (NLP) applications in healthcare. Large Language Models (LLMs), such as BERT-based architectures and generative models, have shown great promise in text classification, summarization, and semantic analysis. However, applying LLMs to Electronic Health Records (EHRs) poses challenges due to token limits and the complexity of clinical text. Sentence Boundary Detection (SBD), which segments text into meaningful units, is a critical preprocessing step to address token constraints and improve model interpretability, particularly for tasks like text classification. This study benchmarks several SBD methods, including traditional approaches (e.g., NLTK, Stanza, PySBD) and state-of-the-art transformer-based models, such as Segment Any Text (SAT), fine-tuned using low-rank adaptation (LoRA) for the clinical domain. The models were evaluated on a dataset of clinical reports in Italian, sourced from the Gemelli hospital of Rome, using metrics like F1-score to measure segmentation quality. The results reveal that PySBD achieved the best performance, closely aligning with the gold standard, with a median F1-Score of 83%. We also assessed the impact of segmentation on a downstream metastasis classification task, comparing the performance of a transformer-based model applied to unsegmented reports versus reports processed with PySBD. Segmentation outperformed the entire report scenario, with a higher F1-Score of 92% versus 88%, demonstrating that SBD improves text classification by ensuring semantic coherence, adhering to token constraints, and providing sentence-level explainability. In conclusion, this study highlights the importance of SBD in enhancing both the quality and interpretability of downstream NLP tasks in healthcare. By benchmarking traditional and transformer-based SBD models, we validate the role of segmentation as a critical preprocessing step to advance clinical NLP applications, offering insights for improving performance and clinical relevance in the processing of EHRs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Strutture organizzative
	
				Istituto Applicazioni del Calcolo ''Mauro Picone''
			
	Parole chiave
	
				Electronic Health Records
Large Language Models
Sentence Boundary Detection
Text Classification
Text Segmentation
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
SBD_to_submit.pdf non disponibili Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 349.85 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	349.85 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/584004

Citazioni

ND

0

ND

social impact