Web Language Identier (WLI) is a service that, startingfrom the URL of a Web page or a plain text and exploiting a pool oflanguage identification tools, returns a set of candidate languages witha confidence score. Currently embedded tools are Chromium CompactLanguage Detector, Lingua::Identify, and a simple one based on HTML attributes. The service can be exploited through a Web application orvia an API. To globally evaluate the identifiers, we constructed a test set of Web pages extracted from 146 Wikipedia projects. This allows using WLI also as a service to compare language identification tools in terms of supported languages and precision of the results. The charts summarizing the comparison can be visualized in the WLI Web application. We plan to extend the service making it possible for the users to add their own identifier.
WLI: a Web application for Language Identification and evaluation of available tools
A Marchetti;C Bacciu;M Abrate
2012
Abstract
Web Language Identier (WLI) is a service that, startingfrom the URL of a Web page or a plain text and exploiting a pool oflanguage identification tools, returns a set of candidate languages witha confidence score. Currently embedded tools are Chromium CompactLanguage Detector, Lingua::Identify, and a simple one based on HTML attributes. The service can be exploited through a Web application orvia an API. To globally evaluate the identifiers, we constructed a test set of Web pages extracted from 146 Wikipedia projects. This allows using WLI also as a service to compare language identification tools in terms of supported languages and precision of the results. The charts summarizing the comparison can be visualized in the WLI Web application. We plan to extend the service making it possible for the users to add their own identifier.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.