Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

The BioLexicon: a large-scale terminological resource for biomedical text mining

Simonetta Montemagni;Riccardo del Gratta;Simone Marchi;Monica Monachini;Valeria Quochi;Giulia Venturi;
2011

Abstract

Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.
Campo DC Valore Lingua
dc.authority.ancejournal BMC BIOINFORMATICS -
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Paul Thompson it
dc.authority.people John McNaught it
dc.authority.people Simonetta Montemagni it
dc.authority.people Nicoletta Calzolari it
dc.authority.people Riccardo del Gratta it
dc.authority.people Vivian Lee it
dc.authority.people Simone Marchi it
dc.authority.people Monica Monachini it
dc.authority.people Piotr Pezik it
dc.authority.people Valeria Quochi it
dc.authority.people CJ Rupp it
dc.authority.people Yutaka Sasaki it
dc.authority.people Giulia Venturi it
dc.authority.people Dietrich RebholzSchuhmann it
dc.authority.people Sophia Ananiadou it
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/21 05:50:47 -
dc.date.available 2024/02/21 05:50:47 -
dc.date.issued 2011 -
dc.description.abstracteng Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring. -
dc.description.affiliations School of Computer Science, University of Manchester; National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester; Manchester Interdisciplinary Biocentre, University of Manchester; Istituto di Linguistica Computazionale del CNR; European Bioinformatics Institute, Wellcome Trust Genome Campus; Toyota Technological Institute -
dc.description.allpeople Thompson, Paul; Mcnaught, John; Montemagni, Simonetta; Calzolari, Nicoletta; DEL GRATTA, Riccardo; Lee, Vivian; Marchi, Simone; Monachini, Monica; Pezik, Piotr; Quochi, Valeria; Rupp, Cj; Sasaki, Yutaka; Venturi, Giulia; Rebholzschuhmann, Dietrich; Ananiadou, Sophia -
dc.description.allpeopleoriginal Paul Thompson, John McNaught, Simonetta Montemagni, Nicoletta Calzolari, Riccardo del Gratta, Vivian Lee, Simone Marchi, Monica Monachini, Piotr Pezik, Valeria Quochi, CJ Rupp, Yutaka Sasaki, Giulia Venturi, Dietrich Rebholz-Schuhmann, Sophia Ananiadou -
dc.description.fulltext none en
dc.description.note ID_PUMA: cnr.ilc/2011-A0-011 -
dc.description.numberofauthors 15 -
dc.identifier.doi 10.1186/1471-2105-12-397 -
dc.identifier.isi WOS:000297641800001 -
dc.identifier.scopus 2-s2.0-80053915290 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/175344 -
dc.identifier.url http://www.biomedcentral.com/1471-2105/12/397 -
dc.language.iso en -
dc.miur.last.status.update 2024-10-10T13:46:27Z *
dc.relation.firstpage 1 -
dc.relation.issue 397 -
dc.relation.lastpage 29 -
dc.relation.numberofpages 29 -
dc.relation.volume 12 -
dc.subject.keywords Text Mining -
dc.subject.keywords Information Extraction -
dc.subject.keywords Computational Lexicon -
dc.subject.singlekeyword Text Mining *
dc.subject.singlekeyword Information Extraction *
dc.subject.singlekeyword Computational Lexicon *
dc.title The BioLexicon: a large-scale terminological resource for biomedical text mining en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
dc.type.referee Sì, ma tipo non specificato -
dc.ugov.descaux1 205232 -
iris.isi.metadataErrorDescription 0 -
iris.isi.metadataErrorType ERROR_NO_MATCH -
iris.isi.metadataStatus ERROR -
iris.orcid.lastModifiedDate 2024/04/04 17:36:55 *
iris.orcid.lastModifiedMillisecond 1712245015377 *
iris.scopus.extIssued 2011 -
iris.scopus.extTitle The BioLexicon: A large-scale terminological resource for biomedical text mining -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost publisher *
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.1186/1471-2105-12-397 *
iris.unpaywall.hosttype publisher *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj true *
iris.unpaywall.landingpage https://doi.org/10.1186/1471-2105-12-397 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 13/03/2025 05:50:20 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1741841420980 -
iris.unpaywall.oastatus gold *
iris.unpaywall.pdfurl https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/1471-2105-12-397 *
scopus.authority.ancejournal BMC BIOINFORMATICS###1471-2105 *
scopus.category 1315 *
scopus.category 1303 *
scopus.category 1312 *
scopus.category 1706 *
scopus.category 2604 *
scopus.contributor.affiliation University of Manchester -
scopus.contributor.affiliation University of Manchester -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation Wellcome Trust Genome Campus -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation Wellcome Trust Genome Campus -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation University of Manchester -
scopus.contributor.affiliation Toyota Technological Institute -
scopus.contributor.affiliation Istituto di Linguistica Computazionale del CNR -
scopus.contributor.affiliation Wellcome Trust Genome Campus -
scopus.contributor.affiliation University of Manchester -
scopus.contributor.afid 60003771 -
scopus.contributor.afid 60003771 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60026124 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60026124 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60003771 -
scopus.contributor.afid 60006081 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60026124 -
scopus.contributor.afid 60003771 -
scopus.contributor.auid 57820641400 -
scopus.contributor.auid 22953888200 -
scopus.contributor.auid 15056781100 -
scopus.contributor.auid 8845912500 -
scopus.contributor.auid 34976432900 -
scopus.contributor.auid 36602778700 -
scopus.contributor.auid 27567818000 -
scopus.contributor.auid 23397766600 -
scopus.contributor.auid 24332242800 -
scopus.contributor.auid 34977412400 -
scopus.contributor.auid 37666044700 -
scopus.contributor.auid 35956948800 -
scopus.contributor.auid 27568199800 -
scopus.contributor.auid 6507852707 -
scopus.contributor.auid 6602788919 -
scopus.contributor.country United Kingdom -
scopus.contributor.country United Kingdom -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country United Kingdom -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country United Kingdom -
scopus.contributor.country Italy -
scopus.contributor.country United Kingdom -
scopus.contributor.country Japan -
scopus.contributor.country Italy -
scopus.contributor.country United Kingdom -
scopus.contributor.country United Kingdom -
scopus.contributor.dptid 103240669 -
scopus.contributor.dptid 103240669 -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid 103240669 -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid 103240669 -
scopus.contributor.name Paul -
scopus.contributor.name John -
scopus.contributor.name Simonetta -
scopus.contributor.name Nicoletta -
scopus.contributor.name Riccardo -
scopus.contributor.name Vivian -
scopus.contributor.name Simone -
scopus.contributor.name Monica -
scopus.contributor.name Piotr -
scopus.contributor.name Valeria -
scopus.contributor.name CJ -
scopus.contributor.name Yutaka -
scopus.contributor.name Giulia -
scopus.contributor.name Dietrich -
scopus.contributor.name Sophia -
scopus.contributor.subaffiliation Manchester Interdisciplinary Biocentre; -
scopus.contributor.subaffiliation Manchester Interdisciplinary Biocentre; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation European Bioinformatics Institute; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation European Bioinformatics Institute; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation Manchester Interdisciplinary Biocentre; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation European Bioinformatics Institute; -
scopus.contributor.subaffiliation Manchester Interdisciplinary Biocentre; -
scopus.contributor.surname Thompson -
scopus.contributor.surname McNaught -
scopus.contributor.surname Montemagni -
scopus.contributor.surname Calzolari -
scopus.contributor.surname del Gratta -
scopus.contributor.surname Lee -
scopus.contributor.surname Marchi -
scopus.contributor.surname Monachini -
scopus.contributor.surname Pezik -
scopus.contributor.surname Quochi -
scopus.contributor.surname Rupp -
scopus.contributor.surname Sasaki -
scopus.contributor.surname Venturi -
scopus.contributor.surname Rebholz-Schuhmann -
scopus.contributor.surname Ananiadou -
scopus.date.issued 2011 *
scopus.description.abstracteng Background: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events.Results: This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard.Conclusions: The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring. © 2011 Thompson et al; licensee BioMed Central Ltd. *
scopus.description.allpeopleoriginal Thompson P.; McNaught J.; Montemagni S.; Calzolari N.; del Gratta R.; Lee V.; Marchi S.; Monachini M.; Pezik P.; Quochi V.; Rupp C.J.; Sasaki Y.; Venturi G.; Rebholz-Schuhmann D.; Ananiadou S. *
scopus.differences scopus.description.allpeopleoriginal *
scopus.differences scopus.description.abstracteng *
scopus.differences scopus.language.iso *
scopus.document.type ar *
scopus.document.types ar *
scopus.funding.funders 501100000276 - Department of Health and Social Care; 501100000265 - Medical Research Council; 501100000272 - National Institute for Health Research; 100010269 - Wellcome Trust; 501100000289 - Cancer Research UK; 501100000274 - British Heart Foundation; 501100000589 - Chief Scientist Office; 100014013 - UK Research and Innovation; 501100000268 - Biotechnology and Biological Sciences Research Council; 501100000780 - European Commission; *
scopus.funding.ids BB/G013160/1; FP6-028099; *
scopus.identifier.doi 10.1186/1471-2105-12-397 *
scopus.identifier.eissn 1471-2105 *
scopus.identifier.pmid 21992002 *
scopus.identifier.pui 51667516 *
scopus.identifier.scopus 2-s2.0-80053915290 *
scopus.journal.sourceid 17929 *
scopus.language.iso eng *
scopus.relation.article 397 *
scopus.relation.volume 12 *
scopus.title The BioLexicon: A large-scale terminological resource for biomedical text mining *
scopus.titleeng The BioLexicon: A large-scale terminological resource for biomedical text mining *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/175344
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 50
  • ???jsp.display-item.citation.isi??? 38
social impact