The privacy protection mechanism in the health context is becoming a crucial task given the exponential increase in the adoption of the Electronic Health Records (EHRs) all around the world. This kind of data can be used for medical investigation and research only if it is filtered out of all the so called Protected Health Information (PHI). This paper proposes a clinical de-identification system based on deep learning techniques for Named Entity Recognition and aimed at recognizing PHI entities to be replaced by surrogates in EHRs for anonymization purposes. This system is based on ELECTRA, a recent neural language model, and is enhanced through a sub-document level analysis aimed at grouping input sentences together, through a Sentences Grouping Factor (SGF), with the aim of broadening the representation context and consequently enhancing its ability to learn. This system was experimentally tested on the official dataset distributed in 2014 by Informatics for Integrating Biology & the Bedside research group, exhibiting superior performance compared to the state of the art in terms of detection at the category level, crucial for properly substituting PHI entities with surrogates. The effectiveness of the proposed system with respect to its components has been also confirmed by a further experimental analysis performed by substituting BERT language model in place of ELECTRA and varying SGF in accordance with limitations concerning the maximum input size for the language model used.

Clinical de-identification using sub-document analysis and ELECTRA

R Catelli;F Gargiulo;E Damiano;M Esposito;G De Pietro
2021

Abstract

The privacy protection mechanism in the health context is becoming a crucial task given the exponential increase in the adoption of the Electronic Health Records (EHRs) all around the world. This kind of data can be used for medical investigation and research only if it is filtered out of all the so called Protected Health Information (PHI). This paper proposes a clinical de-identification system based on deep learning techniques for Named Entity Recognition and aimed at recognizing PHI entities to be replaced by surrogates in EHRs for anonymization purposes. This system is based on ELECTRA, a recent neural language model, and is enhanced through a sub-document level analysis aimed at grouping input sentences together, through a Sentences Grouping Factor (SGF), with the aim of broadening the representation context and consequently enhancing its ability to learn. This system was experimentally tested on the official dataset distributed in 2014 by Informatics for Integrating Biology & the Bedside research group, exhibiting superior performance compared to the state of the art in terms of detection at the category level, crucial for properly substituting PHI entities with surrogates. The effectiveness of the proposed system with respect to its components has been also confirmed by a further experimental analysis performed by substituting BERT language model in place of ELECTRA and varying SGF in accordance with limitations concerning the maximum input size for the language model used.
2021
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
deep learning
analytical models;privacy;conferences;information filters;electronic healthcare;task analysis
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/429197
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact