The privacy protection mechanism in the health context is becoming a crucial task given the exponential increase in the adoption of the Electronic Health Records (EHRs) all around the world. This kind of data can be used for medical investigation and research only if it is filtered out of all the so called Protected Health Information (PHI). This paper proposes a clinical de-identification system based on deep learning techniques for Named Entity Recognition and aimed at recognizing PHI entities to be replaced by surrogates in EHRs for anonymization purposes. This system is based on ELECTRA, a recent neural language model, and is enhanced through a sub-document level analysis aimed at grouping input sentences together, through a Sentences Grouping Factor (SGF), with the aim of broadening the representation context and consequently enhancing its ability to learn. This system was experimentally tested on the official dataset distributed in 2014 by Informatics for Integrating Biology & the Bedside research group, exhibiting superior performance compared to the state of the art in terms of detection at the category level, crucial for properly substituting PHI entities with surrogates. The effectiveness of the proposed system with respect to its components has been also confirmed by a further experimental analysis performed by substituting BERT language model in place of ELECTRA and varying SGF in accordance with limitations concerning the maximum input size for the language model used.
Clinical de-identification using sub-document analysis and ELECTRA
R Catelli;F Gargiulo;E Damiano;M Esposito;G De Pietro
2021
Abstract
The privacy protection mechanism in the health context is becoming a crucial task given the exponential increase in the adoption of the Electronic Health Records (EHRs) all around the world. This kind of data can be used for medical investigation and research only if it is filtered out of all the so called Protected Health Information (PHI). This paper proposes a clinical de-identification system based on deep learning techniques for Named Entity Recognition and aimed at recognizing PHI entities to be replaced by surrogates in EHRs for anonymization purposes. This system is based on ELECTRA, a recent neural language model, and is enhanced through a sub-document level analysis aimed at grouping input sentences together, through a Sentences Grouping Factor (SGF), with the aim of broadening the representation context and consequently enhancing its ability to learn. This system was experimentally tested on the official dataset distributed in 2014 by Informatics for Integrating Biology & the Bedside research group, exhibiting superior performance compared to the state of the art in terms of detection at the category level, crucial for properly substituting PHI entities with surrogates. The effectiveness of the proposed system with respect to its components has been also confirmed by a further experimental analysis performed by substituting BERT language model in place of ELECTRA and varying SGF in accordance with limitations concerning the maximum input size for the language model used.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.