Clinical de-identification aims to identify Protected Health Information in clinical data, enabling data sharing and publication. First automatic de-identification systems were based on rules or on machine learning methods, limited by language changes, lack of context awareness and time consuming feature engineering. Newer deep learning techniques for sequence labeling have shown better results with a reduction in feature engineering efforts and the use of word representation techniques in vector space. However, they are not able to jointly represent the polysemic and context-dependent nature of words, as well as their morpho-syntactic mutations characteristic of handwriting. To address these limitations, a new de-identification approach based on deep learning techniques for Named Entity Recognition has been proposed, whose key factors are: (i) a Bidirectional Long Short-Term Memory + Conditional Random Field architecture for sequence labeling that takes advantage of the widest possible representation context; (ii) a contextualized language model, working at character level, to capture the polysemy of words and manage the morpho-syntactic variations typical of handwritten notes; (iii) more word representations stacked to better capture latent syntactic and semantic similarities. This approach has been tested on the official Informatics for Integrating Biology & the Bedside 2014 de-identification dataset, showing similar or higher performance than state of the art with respect to category and binary recognition, but without any feature engineering or handcrafted rules. The experiments demonstrate the effectiveness of the proposed approach, in particular with regard to category level recognition which is essential to correctly replace entities with surrogates for anonymization purposes.
Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification
Catelli R;De Pietro G;Esposito M
2021
Abstract
Clinical de-identification aims to identify Protected Health Information in clinical data, enabling data sharing and publication. First automatic de-identification systems were based on rules or on machine learning methods, limited by language changes, lack of context awareness and time consuming feature engineering. Newer deep learning techniques for sequence labeling have shown better results with a reduction in feature engineering efforts and the use of word representation techniques in vector space. However, they are not able to jointly represent the polysemic and context-dependent nature of words, as well as their morpho-syntactic mutations characteristic of handwriting. To address these limitations, a new de-identification approach based on deep learning techniques for Named Entity Recognition has been proposed, whose key factors are: (i) a Bidirectional Long Short-Term Memory + Conditional Random Field architecture for sequence labeling that takes advantage of the widest possible representation context; (ii) a contextualized language model, working at character level, to capture the polysemy of words and manage the morpho-syntactic variations typical of handwritten notes; (iii) more word representations stacked to better capture latent syntactic and semantic similarities. This approach has been tested on the official Informatics for Integrating Biology & the Bedside 2014 de-identification dataset, showing similar or higher performance than state of the art with respect to category and binary recognition, but without any feature engineering or handcrafted rules. The experiments demonstrate the effectiveness of the proposed approach, in particular with regard to category level recognition which is essential to correctly replace entities with surrogates for anonymization purposes.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.