The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.
Chapter 7 - De-identification techniques to preserve privacy in medical records
Massimo Esposito
2023
Abstract
The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.File | Dimensione | Formato | |
---|---|---|---|
prod_485968-doc_201493.pdf
solo utenti autorizzati
Descrizione: De-identification techniques to preserve privacy in medical records
Tipologia:
Versione Editoriale (PDF)
Licenza:
Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione
767.44 kB
Formato
Adobe PDF
|
767.44 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.