Web Language Identification Testing  Tool

Frontini, F; Monachini, M; N Lapolla, M; Marchetti, A; Abrate, M; Bacciu, C

Nowadays a variety of tools for automatic language identification are available. Regardless of the approach used, at least two features can be identified as crucial to evaluate the performances of such tools: the precision of the presented results and the range of languages that can be detected. In this work we shall focus on a subtask of written language identification that is important to preserve and enhance multilinguality in the Web, i.e. detecting the language of a Web page given its URL. Most specifically, the final aim is to verify to which extent under-represented languages are recognized by available tools. The main specificity of Web Language Identification (WLI) lies in the fact that often an HTML page can provide interesting extralinguistic clues (URL domain name, metadata, encoding, etc) that can enhance accuracy. We shall first provide some data and statistics on the presence of languages on the web, secondly discuss existing practices and tools for language identification according to different metrics - for instance the approaches used and the number of supported languages - and finally make some proposals on how to improve current Web Language Identifiers. We shall also present a preliminary WLI service that builds on the Google Chromium Compact Language Detector; the WLI tool allows us to test the Google n-gram based algorithm against an ad-hoc gold standard of pages in various languages. The gold standard, based on a selection of Wikipedia projects, contains samples in languages for which no automatic recognition has been attempted; it can thus be used by specialists to develop and evaluate WLI systems.

Web Language Identification Testing Tool

F Frontini;M Monachini;M N LaPolla;A Marchetti;M Abrate;C Bacciu

2012

Abstract

Nowadays a variety of tools for automatic language identification are available. Regardless of the approach used, at least two features can be identified as crucial to evaluate the performances of such tools: the precision of the presented results and the range of languages that can be detected. In this work we shall focus on a subtask of written language identification that is important to preserve and enhance multilinguality in the Web, i.e. detecting the language of a Web page given its URL. Most specifically, the final aim is to verify to which extent under-represented languages are recognized by available tools. The main specificity of Web Language Identification (WLI) lies in the fact that often an HTML page can provide interesting extralinguistic clues (URL domain name, metadata, encoding, etc) that can enhance accuracy. We shall first provide some data and statistics on the presence of languages on the web, secondly discuss existing practices and tools for language identification according to different metrics - for instance the approaches used and the number of supported languages - and finally make some proposals on how to improve current Web Language Identifiers. We shall also present a preliminary WLI service that builds on the Google Chromium Compact Language Detector; the WLI tool allows us to test the Google n-gram based algorithm against an ad-hoc gold standard of pages in various languages. The gold standard, based on a selection of Wikipedia projects, contains samples in languages for which no automatic recognition has been attempted; it can thus be used by specialists to develop and evaluate WLI systems.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di informatica e telematica - IIT	-
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	-
dc.authority.people	F Frontini	it
dc.authority.people	M Monachini	it
dc.authority.people	M N LaPolla	it
dc.authority.people	A Marchetti	it
dc.authority.people	M Abrate	it
dc.authority.people	C Bacciu	it
dc.collection.id.s	010b2614-196f-4b19-86fc-88182f427232	*
dc.collection.name	04.03 Poster in Atti di convegno	*
dc.contributor.appartenenza	Istituto di informatica e telematica - IIT	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	912	*
dc.contributor.appartenenza.mi	918	*
dc.date.accessioned	2024/02/20 12:45:52	-
dc.date.available	2024/02/20 12:45:52	-
dc.date.issued	2012	-
dc.description.abstracteng	Nowadays a variety of tools for automatic language identification are available. Regardless of the approach used, at least two features can be identified as crucial to evaluate the performances of such tools: the precision of the presented results and the range of languages that can be detected. In this work we shall focus on a subtask of written language identification that is important to preserve and enhance multilinguality in the Web, i.e. detecting the language of a Web page given its URL. Most specifically, the final aim is to verify to which extent under-represented languages are recognized by available tools. The main specificity of Web Language Identification (WLI) lies in the fact that often an HTML page can provide interesting extralinguistic clues (URL domain name, metadata, encoding, etc) that can enhance accuracy. We shall first provide some data and statistics on the presence of languages on the web, secondly discuss existing practices and tools for language identification according to different metrics - for instance the approaches used and the number of supported languages - and finally make some proposals on how to improve current Web Language Identifiers. We shall also present a preliminary WLI service that builds on the Google Chromium Compact Language Detector; the WLI tool allows us to test the Google n-gram based algorithm against an ad-hoc gold standard of pages in various languages. The gold standard, based on a selection of Wikipedia projects, contains samples in languages for which no automatic recognition has been attempted; it can thus be used by specialists to develop and evaluate WLI systems.	-
dc.description.affiliations	CNR-ILC, Pisa, Italy; CNR-ILC, Pisa, Italy; CNR-IIT, Pisa, Italy; CNR-IIT, Pisa, Italy; CNR-IIT, Pisa, Italy; CNR-IIT, Pisa, Italy	-
dc.description.allpeople	Frontini, F; Monachini, M; N LaPolla, M; Marchetti, A; Abrate, M; Bacciu, C	-
dc.description.allpeopleoriginal	F. Frontini, M. Monachini, M. N. LaPolla, A. Marchetti, M. Abrate, C. Bacciu	-
dc.description.fulltext	none	en
dc.description.numberofauthors	6	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/314751	-
dc.language.iso	eng	-
dc.relation.conferencedate	15-16/03/2012	-
dc.relation.conferencename	W3C Workshop, Call for Participation: The Multilingual Web - The Way Ahead	-
dc.relation.conferenceplace	Luxembourg	-
dc.relation.firstpage	1	-
dc.relation.lastpage	1	-
dc.subject.keywords	Language Identification Tools	-
dc.subject.keywords	Multilingual Web	-
dc.subject.singlekeyword	Language Identification Tools	*
dc.subject.singlekeyword	Multilingual Web	*
dc.title	Web Language Identification Testing Tool	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.03 Poster in Atti di convegno	it
dc.type.miur	275	-
dc.type.referee	Sì, ma tipo non specificato	-
dc.ugov.descaux1	348940	-
iris.orcid.lastModifiedDate	2024/04/04 15:50:43	*
iris.orcid.lastModifiedMillisecond	1712238643307	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.03 Poster in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/314751

Citazioni

ND

ND

ND

CNR Institutional Research Information System

Web Language Identification Testing Tool

F Frontini;M Monachini;M N LaPolla;A Marchetti;M Abrate;C Bacciu

2012

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CNR Institutional Research Information System

Web Language Identification Testing Tool

F Frontini;M Monachini;M N LaPolla;A Marchetti;M Abrate;C Bacciu

2012

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)