Motivation: The amount of published scientific literature is fast expanding its management and processing is become a burden task. Text Mining (TM) is acquiring a key role for bioinformatics; it seems one of more suitable approaches for heterogeneous data sources integration. Textual data has been recently used to support scientific hypotheses generation (Literature Based Discovery). In this work we have considered the screening of previous unknown molecular in a literature based discovery perspective. The identification of molecular species, that could interact among them, is the first step for in silico design of molecular interaction networks. In order to achieve this goal, we have need to extract the more standardized set of potentially interacting molecules. The published articles reflects the fragmentation biomedical researches, this situation could affect the reliability of the set. During the temporal evolution, new published papers can modify the knowledge about molecular interactions. The evaluation these changes on knowledge is important in a model development perspective. The screening of potentially interacting molecules could be considered equivalent to linguistic named entity recognition process. In this paper we have applied an ensemble of unsupervised learning machines to selection and extraction of named entities associated to potentially interacting molecules; the analysis has been focused on the changes emerged in PubMed repository during the period of time 1985-2000. Results: A set of PubMed queries has been analyzed; everyone of which was a molecular entity. Each corresponding set of PubMed abstracts has been separately retrieved and processed; the retrieval phase has limited to the period 1985-2000. Each set has been split into three chunks; each chunk represent a five year sub interval. This procedure allowed us to screen named entities, specific for each time interval, associated with potentially interacting molecules; The recognition of time invariant named entities is essential for subsequent molecular interaction screening. A data-fusion system, based on self-organization paradigm, seems to be able to evaluate the temporal modification in textual information. Our system has detected, in this preliminary analysis, several named entities that can be functionally related with the original query. Motivation: The amount of published scientific literature is fast expanding its management and processing is become a burden task. Text Mining (TM) is acquiring a key role for bioinformatics; it seems one of more suitable approaches for heterogeneous data sources integration. Textual data has been recently used to support scientific hypotheses generation (Literature Based Discovery). In this work we have considered the screening of previous unknown molecular in a literature based discovery perspective. The identification of molecular species, that could interact among them, is the first step for in silico design of molecular interaction networks. In order to achieve this goal, we have need to extract the more standardized set of potentially interacting molecules. The published articles reflects the fragmentation biomedical researches, this situation could affect the reliability of the set. During the temporal evolution, new published papers can modify the knowledge about molecular interactions. The evaluation these changes on knowledge is important in a model development perspective. The screening of potentially interacting molecules could be considered equivalent to linguistic named entity recognition process. In this paper we have applied an ensemble of unsupervised learning machines to selection and extraction of named entities associated to potentially interacting molecules; the analysis has been focused on the changes emerged in PubMed repository during the period of time 1985-2000. Results: A set of PubMed queries has been analyzed; everyone of which was a molecular entity. Each corresponding set of PubMed abstracts has been separately retrieved and processed; the retrieval phase has limited to the period 1985-2000. Each set has been split into three chunks; each chunk represent a five year sub interval. This procedure allowed us to screen named entities, specific for each time interval, associated with potentially interacting molecules; The recognition of time invariant named entities is essential for subsequent molecular interaction screening. A data-fusion system, based on self-organization paradigm, seems to be able to evaluate the temporal modification in textual information. Our system has detected, in this preliminary analysis, several named entities that can be functionally related with the original query. Availability: http://biocomp.ge.ismac.cnr.it/
Investigation of named entity recognition in molecular biology by data fusion
P Arrigo;
2006
Abstract
Motivation: The amount of published scientific literature is fast expanding its management and processing is become a burden task. Text Mining (TM) is acquiring a key role for bioinformatics; it seems one of more suitable approaches for heterogeneous data sources integration. Textual data has been recently used to support scientific hypotheses generation (Literature Based Discovery). In this work we have considered the screening of previous unknown molecular in a literature based discovery perspective. The identification of molecular species, that could interact among them, is the first step for in silico design of molecular interaction networks. In order to achieve this goal, we have need to extract the more standardized set of potentially interacting molecules. The published articles reflects the fragmentation biomedical researches, this situation could affect the reliability of the set. During the temporal evolution, new published papers can modify the knowledge about molecular interactions. The evaluation these changes on knowledge is important in a model development perspective. The screening of potentially interacting molecules could be considered equivalent to linguistic named entity recognition process. In this paper we have applied an ensemble of unsupervised learning machines to selection and extraction of named entities associated to potentially interacting molecules; the analysis has been focused on the changes emerged in PubMed repository during the period of time 1985-2000. Results: A set of PubMed queries has been analyzed; everyone of which was a molecular entity. Each corresponding set of PubMed abstracts has been separately retrieved and processed; the retrieval phase has limited to the period 1985-2000. Each set has been split into three chunks; each chunk represent a five year sub interval. This procedure allowed us to screen named entities, specific for each time interval, associated with potentially interacting molecules; The recognition of time invariant named entities is essential for subsequent molecular interaction screening. A data-fusion system, based on self-organization paradigm, seems to be able to evaluate the temporal modification in textual information. Our system has detected, in this preliminary analysis, several named entities that can be functionally related with the original query. Motivation: The amount of published scientific literature is fast expanding its management and processing is become a burden task. Text Mining (TM) is acquiring a key role for bioinformatics; it seems one of more suitable approaches for heterogeneous data sources integration. Textual data has been recently used to support scientific hypotheses generation (Literature Based Discovery). In this work we have considered the screening of previous unknown molecular in a literature based discovery perspective. The identification of molecular species, that could interact among them, is the first step for in silico design of molecular interaction networks. In order to achieve this goal, we have need to extract the more standardized set of potentially interacting molecules. The published articles reflects the fragmentation biomedical researches, this situation could affect the reliability of the set. During the temporal evolution, new published papers can modify the knowledge about molecular interactions. The evaluation these changes on knowledge is important in a model development perspective. The screening of potentially interacting molecules could be considered equivalent to linguistic named entity recognition process. In this paper we have applied an ensemble of unsupervised learning machines to selection and extraction of named entities associated to potentially interacting molecules; the analysis has been focused on the changes emerged in PubMed repository during the period of time 1985-2000. Results: A set of PubMed queries has been analyzed; everyone of which was a molecular entity. Each corresponding set of PubMed abstracts has been separately retrieved and processed; the retrieval phase has limited to the period 1985-2000. Each set has been split into three chunks; each chunk represent a five year sub interval. This procedure allowed us to screen named entities, specific for each time interval, associated with potentially interacting molecules; The recognition of time invariant named entities is essential for subsequent molecular interaction screening. A data-fusion system, based on self-organization paradigm, seems to be able to evaluate the temporal modification in textual information. Our system has detected, in this preliminary analysis, several named entities that can be functionally related with the original query. Availability: http://biocomp.ge.ismac.cnr.it/| File | Dimensione | Formato | |
|---|---|---|---|
|
prod_128275-doc_66220.pdf
accesso aperto
Descrizione: INVESTIGATION OF NAMED ENTITY RECOGNITION IN MOLECULAR BIOLOGY BY DATA FUSION
Dimensione
237.1 kB
Formato
Adobe PDF
|
237.1 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


