Towards a new generation of Language Resources in the Semantic Web Vision

Calzolari, N

In this contribution I touch on issues related to: language resources (LR) and semantics, dynamic resources automatically acquired, and how to go for a new generation of LRs compliant with the Semantic Web (SW) vision, pointing at the potentialities and the need for cross-fertilisation between the two communities of Human Language Technology (HLT) and SW/ontologies. Many of these issues are related to Yorick's work on preferences, lexicons, semantic annotation, and recently to his ideas on the relation between HLT and SW Large scale LRs are unanimously recognised as the necessary infrastructure underlying language technology (LT) (Varile and Zampolli (eds.) 1997). Discussing a few major European initiatives for building harmonised LRs, I highlight how computational lexicons and textual corpora should be considered as complementary views on the lexical space, in the perspective of modelling a new type of resource which is both a lexicon and a corpus together. A "complete" computational lexicon should incorporate and represent our "knowledge of the world". I claim that it is theoretically impossible to achieve completeness within any "static" lexicon. Moreover, choices on the syntagmatic axis are pervasive in language. A sound language infrastructure must encompass both "static" lexicons, as the traditional ones, and "dynamic" systems able to enrich the lexicon with information acquired on-line from large corpora, thus capturing the "actually realised" potentialities, the large range of variation, and the flexibility inherent in the language as it is used. These are the challenges for semantic tagging, which is at the core of the SW vision of giving meaning, in a manner understandable by machines, to the content of Web documents Broadening our perspective into the future, the need for more and more "knowledge intensive" large-size LRs for effective content processing requires a change in the paradigm, and the design of a new generation of LRs, based on open content interoperability standards. The SW notion may be helpful in determining the shape of the LRs of the future, consistent with the vision of an open distributed space of sharable knowledge available on the Web for processing The approach to realise the necessary world-wide linguistic infrastructure requires coverage not only of a range of technical aspects, but also - and maybe most critically - of a number of organisational aspects. An essential aspect for ensuring an integrated basis is to enhance the interchange and cooperation among many communities that act now separately, such as LR and LT developers, Terminology, Semantic Web and Ontology experts, content providers, linguists and so on. This is one of the challenges for the next years, for a usable and useful "language" scenario in the global network

CNR Institutional Research Information System