CNR Institutional Research Information System

The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.

Chapter 7 - De-identification techniques to preserve privacy in medical records

Rosario Catelli;Massimo Esposito

2023

Abstract

The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Codice ISBN
	
				978-0-323-90531-2
			
	Parole chiave
	
				Clinical de-identification
named entity recognition
deep learning
sub-document level analysis
COVID-19 annotated Italian data set
			
	Appare nelle tipologie:
	
				02.01 Contributo in volume (Capitolo o Saggio)

File in questo prodotto:

File	Dimensione	Formato
prod_485968-doc_201493.pdf solo utenti autorizzati Descrizione: De-identification techniques to preserve privacy in medical records Tipologia: Versione Editoriale (PDF) Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023) Dimensione 767.44 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	767.44 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/462960

Citazioni

ND

ND

ND

social impact