The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.

Chapter 7 - De-identification techniques to preserve privacy in medical records

Massimo Esposito
2023

Abstract

The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.
2023
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
978-0-323-90531-2
Clinical de-identification
named entity recognition
deep learning
sub-document level analysis
COVID-19 annotated Italian data set
File in questo prodotto:
File Dimensione Formato  
prod_485968-doc_201493.pdf

solo utenti autorizzati

Descrizione: De-identification techniques to preserve privacy in medical records
Tipologia: Versione Editoriale (PDF)
Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione 767.44 kB
Formato Adobe PDF
767.44 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/462960
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact