Nowadays a variety of tools for automatic language identification are available. Regardless of the approach used, at least two features can be identified as crucial to evaluate the performances of such tools: the precision of the presented results and the range of languages that can be detected. In this work we shall focus on a subtask of written language identification that is important to preserve and enhance multilinguality in the Web, i.e. detecting the language of a Web page given its URL. Most specifically, the final aim is to verify to which extent under-represented languages are recognized by available tools. The main specificity of Web Language Identification (WLI) lies in the fact that often an HTML page can provide interesting extralinguistic clues (URL domain name, metadata, encoding, etc) that can enhance accuracy. We shall first provide some data and statistics on the presence of languages on the web, secondly discuss existing practices and tools for language identification according to different metrics - for instance the approaches used and the number of supported languages - and finally make some proposals on how to improve current Web Language Identifiers. We shall also present a preliminary WLI service that builds on the Google Chromium Compact Language Detector; the WLI tool allows us to test the Google n-gram based algorithm against an adhoc gold standard of pages in various languages. The gold standard, based on a selection of Wikipedia projects, contains samples in languages for which no automatic recognition has been attempted; it can thus be used by specialists to develop and evaluate WLI systems.

Web Language Identification Testing Tool

Abrate Matteo;Bacciu Clara;Frontini Francesca;Marchetti Andrea;Monachini Monica
2012

Abstract

Nowadays a variety of tools for automatic language identification are available. Regardless of the approach used, at least two features can be identified as crucial to evaluate the performances of such tools: the precision of the presented results and the range of languages that can be detected. In this work we shall focus on a subtask of written language identification that is important to preserve and enhance multilinguality in the Web, i.e. detecting the language of a Web page given its URL. Most specifically, the final aim is to verify to which extent under-represented languages are recognized by available tools. The main specificity of Web Language Identification (WLI) lies in the fact that often an HTML page can provide interesting extralinguistic clues (URL domain name, metadata, encoding, etc) that can enhance accuracy. We shall first provide some data and statistics on the presence of languages on the web, secondly discuss existing practices and tools for language identification according to different metrics - for instance the approaches used and the number of supported languages - and finally make some proposals on how to improve current Web Language Identifiers. We shall also present a preliminary WLI service that builds on the Google Chromium Compact Language Detector; the WLI tool allows us to test the Google n-gram based algorithm against an adhoc gold standard of pages in various languages. The gold standard, based on a selection of Wikipedia projects, contains samples in languages for which no automatic recognition has been attempted; it can thus be used by specialists to develop and evaluate WLI systems.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di informatica e telematica - IIT -
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Abrate Matteo it
dc.authority.people Bacciu Clara it
dc.authority.people Frontini Francesca it
dc.authority.people Lapolla Mariantonietta Noemi it
dc.authority.people Marchetti Andrea it
dc.authority.people Monachini Monica it
dc.collection.id.s 33fc2b58-b895-438b-9d2a-2c5bc86a83a6 *
dc.collection.name 04.04 Presentazione/Comunicazione non pubblicata in atti di convegno *
dc.contributor.appartenenza Istituto di informatica e telematica - IIT *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 912 *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/16 15:48:37 -
dc.date.available 2024/02/16 15:48:37 -
dc.date.issued 2012 -
dc.description.abstracteng Nowadays a variety of tools for automatic language identification are available. Regardless of the approach used, at least two features can be identified as crucial to evaluate the performances of such tools: the precision of the presented results and the range of languages that can be detected. In this work we shall focus on a subtask of written language identification that is important to preserve and enhance multilinguality in the Web, i.e. detecting the language of a Web page given its URL. Most specifically, the final aim is to verify to which extent under-represented languages are recognized by available tools. The main specificity of Web Language Identification (WLI) lies in the fact that often an HTML page can provide interesting extralinguistic clues (URL domain name, metadata, encoding, etc) that can enhance accuracy. We shall first provide some data and statistics on the presence of languages on the web, secondly discuss existing practices and tools for language identification according to different metrics - for instance the approaches used and the number of supported languages - and finally make some proposals on how to improve current Web Language Identifiers. We shall also present a preliminary WLI service that builds on the Google Chromium Compact Language Detector; the WLI tool allows us to test the Google n-gram based algorithm against an adhoc gold standard of pages in various languages. The gold standard, based on a selection of Wikipedia projects, contains samples in languages for which no automatic recognition has been attempted; it can thus be used by specialists to develop and evaluate WLI systems. -
dc.description.affiliations [1] CNR-IIT, Pisa; [2] CNR-ILC, Pisa -
dc.description.allpeople Abrate, Matteo; Bacciu, Clara; Frontini, Francesca; Lapolla Mariantonietta, Noemi; Marchetti, Andrea; Monachini, Monica -
dc.description.allpeopleoriginal Abrate, Matteo [1]; Bacciu, Clara [1]; Frontini, Francesca [2]; Lapolla, Mariantonietta Noemi [1]; Marchetti, Andrea [1]; Monachini, Monica [2] -
dc.description.fulltext none en
dc.description.note id_puma: /cnr.ilc/2012-A3-002 -
dc.description.numberofauthors 6 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/128221 -
dc.language.iso eng -
dc.relation.conferencedate 15 - 16 March 2012 -
dc.relation.conferencename The Multilingual Web - the Way Ahead -
dc.relation.conferenceplace Luxembourg -
dc.subject.keywords Multilingual Web -
dc.subject.singlekeyword Multilingual Web *
dc.title Web Language Identification Testing Tool en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.04 Presentazione/Comunicazione non pubblicata in atti di convegno it
dc.type.miur -2.0 -
dc.ugov.descaux1 220733 -
iris.orcid.lastModifiedDate 2024/04/04 13:52:47 *
iris.orcid.lastModifiedMillisecond 1712231567146 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 04.04 Presentazione/Comunicazione non pubblicata (convegno, evento, webinar...)
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/128221
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact