The increasing availability of clinical reports offers valuable opportunities for natural language processing (NLP) applications in healthcare. Large Language Models (LLMs), such as BERT-based architectures and generative models, have shown great promise in text classification, summarization, and semantic analysis. However, applying LLMs to Electronic Health Records (EHRs) poses challenges due to token limits and the complexity of clinical text. Sentence Boundary Detection (SBD), which segments text into meaningful units, is a critical preprocessing step to address token constraints and improve model interpretability, particularly for tasks like text classification. This study benchmarks several SBD methods, including traditional approaches (e.g., NLTK, Stanza, PySBD) and state-of-the-art transformer-based models, such as Segment Any Text (SAT), fine-tuned using low-rank adaptation (LoRA) for the clinical domain. The models were evaluated on a dataset of clinical reports in Italian, sourced from the Gemelli hospital of Rome, using metrics like F1-score to measure segmentation quality. The results reveal that PySBD achieved the best performance, closely aligning with the gold standard, with a median F1-Score of 83%. We also assessed the impact of segmentation on a downstream metastasis classification task, comparing the performance of a transformer-based model applied to unsegmented reports versus reports processed with PySBD. Segmentation outperformed the entire report scenario, with a higher F1-Score of 92% versus 88%, demonstrating that SBD improves text classification by ensuring semantic coherence, adhering to token constraints, and providing sentence-level explainability. In conclusion, this study highlights the importance of SBD in enhancing both the quality and interpretability of downstream NLP tasks in healthcare. By benchmarking traditional and transformer-based SBD models, we validate the role of segmentation as a critical preprocessing step to advance clinical NLP applications, offering insights for improving performance and clinical relevance in the processing of EHRs.
Improving Clinical Report Classification with Sentence Boundary Detection
Santoro, MarioUltimo
2025
Abstract
The increasing availability of clinical reports offers valuable opportunities for natural language processing (NLP) applications in healthcare. Large Language Models (LLMs), such as BERT-based architectures and generative models, have shown great promise in text classification, summarization, and semantic analysis. However, applying LLMs to Electronic Health Records (EHRs) poses challenges due to token limits and the complexity of clinical text. Sentence Boundary Detection (SBD), which segments text into meaningful units, is a critical preprocessing step to address token constraints and improve model interpretability, particularly for tasks like text classification. This study benchmarks several SBD methods, including traditional approaches (e.g., NLTK, Stanza, PySBD) and state-of-the-art transformer-based models, such as Segment Any Text (SAT), fine-tuned using low-rank adaptation (LoRA) for the clinical domain. The models were evaluated on a dataset of clinical reports in Italian, sourced from the Gemelli hospital of Rome, using metrics like F1-score to measure segmentation quality. The results reveal that PySBD achieved the best performance, closely aligning with the gold standard, with a median F1-Score of 83%. We also assessed the impact of segmentation on a downstream metastasis classification task, comparing the performance of a transformer-based model applied to unsegmented reports versus reports processed with PySBD. Segmentation outperformed the entire report scenario, with a higher F1-Score of 92% versus 88%, demonstrating that SBD improves text classification by ensuring semantic coherence, adhering to token constraints, and providing sentence-level explainability. In conclusion, this study highlights the importance of SBD in enhancing both the quality and interpretability of downstream NLP tasks in healthcare. By benchmarking traditional and transformer-based SBD models, we validate the role of segmentation as a critical preprocessing step to advance clinical NLP applications, offering insights for improving performance and clinical relevance in the processing of EHRs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


