Today's health domain is characterized by heterogeneous, numerous, highly dynamics and geographically distributed information sources. Moreover, the increasing use of digital health data, like electronic health records (EHRs), has led to store an unprecedented amount of information. Managing this large amount of data can, often, introduce issues of information overload, with potential negative consequences on clinical work, such as errors of omission, delays, and overall patient safety. Innovative techniques, approaches and infrastructures are needed to investigate data featured by high velocity, volume and variability. This paper introduces a distributed and self-organizing algorithm for building a management system for big data in highly dynamic environments like healthcare domain. Health data are represented with vectors obtained through the Doc2Vec model, a Natural Language Processing (NLP) approach able to capture the semantic context representing documents in dense vectors namely word embeddings. Doc2Vec is an unsupervised algorithm to generate vectors starting from sentences/documents based on word2vec approach which can generate vectors for words. The servers of a clinical distributed system, by performing autonomous and local operations, organize themselves in a sorted overlay network, so that resource management operations become faster and efficient. The effectiveness of the approach was proved performing a set of preliminary experiments exploiting a tailored implemented simulator.
Natural language processing approach for distributed health data management
Forestiero Agostino;Papuzzo Giuseppe
2020
Abstract
Today's health domain is characterized by heterogeneous, numerous, highly dynamics and geographically distributed information sources. Moreover, the increasing use of digital health data, like electronic health records (EHRs), has led to store an unprecedented amount of information. Managing this large amount of data can, often, introduce issues of information overload, with potential negative consequences on clinical work, such as errors of omission, delays, and overall patient safety. Innovative techniques, approaches and infrastructures are needed to investigate data featured by high velocity, volume and variability. This paper introduces a distributed and self-organizing algorithm for building a management system for big data in highly dynamic environments like healthcare domain. Health data are represented with vectors obtained through the Doc2Vec model, a Natural Language Processing (NLP) approach able to capture the semantic context representing documents in dense vectors namely word embeddings. Doc2Vec is an unsupervised algorithm to generate vectors starting from sentences/documents based on word2vec approach which can generate vectors for words. The servers of a clinical distributed system, by performing autonomous and local operations, organize themselves in a sorted overlay network, so that resource management operations become faster and efficient. The effectiveness of the approach was proved performing a set of preliminary experiments exploiting a tailored implemented simulator.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.