Low back pain represents a leading source of disability worldwide and poses a significant challenge for evidence-based clinical decision support. In contexts where Italian-language resources for diversified therapeutic pathways are lacking, we have assembled a novel, annotated dataset comprising up to three pre-treatment documents per patient (MRI report, X-ray report, and patient visit notes), alongside demographic information (age and sex). The cohort consists of 176 patient records, stratified into three therapeutic groups: 50 conservative, 92 regenerative, and 34 surgical. The primary aim is to investigate whether the collected dataset can be harnessed to predict which of the three treatment modalities is most appropriate. To this end, six document-combination scenarios were defined, evaluating each single-report modality as well as all possible pairings. For each scenario, two modeling strategies were contrasted: a traditional Support Vector Machine classifier leveraging TF–IDF features based on unigrams, bigrams, and trigrams, and a fine-tuned Italian BERT model adapted to our corpus. Experimental results indicate that classic n-gram–based approaches achieve the highest performance (macro–𝐹1 up to 71.3%). The BERT model, while outperforming the baseline, encounters limitations in this low-resource scenario.These findings suggest that the present dataset has the potential to catalyze the development of Italian-language clinical decision support systems that account for the distinct signatures of treatment pathways.

A Novel Real-World Dataset of Italian Clinical Notes for NLP-based Decision Support in Low Back Pain Treatment

Bonfigli, Agnese;Piperno, Ruben;Dell'Orletta, Felice;Brunato, Dominique;Merone, Mario;
2025

Abstract

Low back pain represents a leading source of disability worldwide and poses a significant challenge for evidence-based clinical decision support. In contexts where Italian-language resources for diversified therapeutic pathways are lacking, we have assembled a novel, annotated dataset comprising up to three pre-treatment documents per patient (MRI report, X-ray report, and patient visit notes), alongside demographic information (age and sex). The cohort consists of 176 patient records, stratified into three therapeutic groups: 50 conservative, 92 regenerative, and 34 surgical. The primary aim is to investigate whether the collected dataset can be harnessed to predict which of the three treatment modalities is most appropriate. To this end, six document-combination scenarios were defined, evaluating each single-report modality as well as all possible pairings. For each scenario, two modeling strategies were contrasted: a traditional Support Vector Machine classifier leveraging TF–IDF features based on unigrams, bigrams, and trigrams, and a fine-tuned Italian BERT model adapted to our corpus. Experimental results indicate that classic n-gram–based approaches achieve the highest performance (macro–𝐹1 up to 71.3%). The BERT model, while outperforming the baseline, encounters limitations in this low-resource scenario.These findings suggest that the present dataset has the potential to catalyze the development of Italian-language clinical decision support systems that account for the distinct signatures of treatment pathways.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Bonfigli, Agnese en
dc.authority.people Piperno, Ruben en
dc.authority.people Bacco Luca en
dc.authority.people Dell'Orletta, Felice en
dc.authority.people Brunato, Dominique en
dc.authority.people Crispino, Filippo en
dc.authority.people Papalia, Giuseppe Francesco en
dc.authority.people Russo, Fabrizio en
dc.authority.people Vadalà, Gianluca en
dc.authority.people Papalia, Rocco en
dc.authority.people Merone, Mario en
dc.authority.people Pecchia, Leandro en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 17:31:06 -
dc.date.available 2026/03/03 17:31:06 -
dc.date.firstsubmission 2026/03/03 17:15:48 *
dc.date.issued 2025 -
dc.date.submission 2026/03/03 17:20:43 *
dc.description.abstracteng Low back pain represents a leading source of disability worldwide and poses a significant challenge for evidence-based clinical decision support. In contexts where Italian-language resources for diversified therapeutic pathways are lacking, we have assembled a novel, annotated dataset comprising up to three pre-treatment documents per patient (MRI report, X-ray report, and patient visit notes), alongside demographic information (age and sex). The cohort consists of 176 patient records, stratified into three therapeutic groups: 50 conservative, 92 regenerative, and 34 surgical. The primary aim is to investigate whether the collected dataset can be harnessed to predict which of the three treatment modalities is most appropriate. To this end, six document-combination scenarios were defined, evaluating each single-report modality as well as all possible pairings. For each scenario, two modeling strategies were contrasted: a traditional Support Vector Machine classifier leveraging TF–IDF features based on unigrams, bigrams, and trigrams, and a fine-tuned Italian BERT model adapted to our corpus. Experimental results indicate that classic n-gram–based approaches achieve the highest performance (macro–𝐹1 up to 71.3%). The BERT model, while outperforming the baseline, encounters limitations in this low-resource scenario.These findings suggest that the present dataset has the potential to catalyze the development of Italian-language clinical decision support systems that account for the distinct signatures of treatment pathways. -
dc.description.allpeople Bonfigli, Agnese; Piperno, Ruben; Bacco, Luca; Dell'Orletta, Felice; Brunato, Dominique; Crispino, Filippo; Papalia, Giuseppe Francesco; Russo, Fabrizio; Vadalà, Gianluca; Papalia, Rocco; Merone, Mario; Pecchia, Leandro -
dc.description.allpeopleoriginal Bonfigli, Agnese; Piperno, Ruben; Bacco Luca; Dell'Orletta, Felice; Brunato, Dominique; Crispino, Filippo; Papalia, Giuseppe Francesco; Russo, Fabrizio; Vadalà, Gianluca; Papalia, Rocco; Merone, Mario; Pecchia, Leandro en
dc.description.fulltext open en
dc.description.numberofauthors 12 -
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/570763 -
dc.language.iso eng en
dc.relation.ispartofbook Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) en
dc.subject.keywords NLP in healthcare -
dc.subject.keywordseng Large Language Models (LLMs) -
dc.subject.keywordseng Italian Medical Corpus -
dc.subject.singlekeyword NLP in healthcare *
dc.subject.singlekeyword Large Language Models (LLMs) *
dc.subject.singlekeyword Italian Medical Corpus *
dc.title A Novel Real-World Dataset of Italian Clinical Notes for NLP-based Decision Support in Low Back Pain Treatment en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2026/03/04 02:52:05 *
iris.orcid.lastModifiedDate 2026/03/03 17:31:06 *
iris.orcid.lastModifiedMillisecond 1772555466506 *
iris.sitodocente.maxattempts 10 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
11_main_long.pdf

accesso aperto

Licenza: Creative commons
Dimensione 294.14 kB
Formato Adobe PDF
294.14 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570763
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact