The Petit Larousse Illustré (PLI) is a monolingual French dictionary which has been published every year since the 1906 edition, and which is therefore a fundamental record of the evolution of the French language. As a consequence of the pre-1948 editions of the PLI entering the public domain in 2018 the Nénufar (Nouvelle édition numérique de fac-similés de référence) project was launched at the Praxiling laboratory in Montpellier with the aim of digitizing and making these editions available electronically. The project is still ongoing; various selected editions from each decade are going to be fully digitized (so far the 1906, 1924 and 1925 editions have been completed), and changes backtracked and dated to the specific year. Nénufar's primary aim is to make the editions available and searchable via an advanced search interface which will not only enable the selective querying of text by lemma and type of content (definitions, examples, ...), but crucially also detect and study changes by comparing different editions. In order to do so, a specific web interface has been put in place . Alongside the digitized text, the Nénufar website contains high quality scans for each page. In compliance with current open data best practices (Wilkinson et al., 2016), the project also aims to make the source data available separately from the querying interface both for research and for A similar project which presents data and scans from subsequent editions of the same legacy dictionary has been carried out by the team behind the Swedish Academy's Wordlist (see Holmer, Malmgren, and Martens (2016) and http://spraakdata.gu.se/saolhist/). eLex 2019: Book of Abstracts 36 long-term preservation. The primary encoding format is TEI-XML; however in our case the TEI encoding is closely inspired by the latest version of the TEI-Lex0 (Ba?ski et al., 2017, Romary & Tasovac, 2018) guidelines for encoding lexicographic resources, which are based upon TEI. The choice of a TEI based approach allows the Nénufar project to align itself to other pre-existing initiatives and tools. By aligning ourselves to TEI-Lex0 we will be able to make use of digitisation tools such as Grobid (Khemakhem et al., 2017) which have TEI-Lex0 as their native format and which have already been tested and used within the Nénufar project to speed up the digitization of new editions. In addition we will be able to make use of ongoing initiatives to convert TEI-Lex0 datasets to RDF using the W3C recommendation for publishing lexicons as Linked Data, namely OntoLex-Lemon (McCrae et al., 2017; Bosque-Gil et al., 2016) which will allow for the publication of the Nénufar dataset as an LOD graph. The LOD version of the Nénufar dataset, now currently being developed, will be queryable from the available SPARQL endpoint and contain all available editions as one single graph, allowing for expert users to perform complex queries that could detect systematic changes in the dataset. The LOD version is particularly adapted to be linked to other datasets; more recent editions, once added, could also be of interest for NLP applications
Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource
Francesca Frontini;
2019
Abstract
The Petit Larousse Illustré (PLI) is a monolingual French dictionary which has been published every year since the 1906 edition, and which is therefore a fundamental record of the evolution of the French language. As a consequence of the pre-1948 editions of the PLI entering the public domain in 2018 the Nénufar (Nouvelle édition numérique de fac-similés de référence) project was launched at the Praxiling laboratory in Montpellier with the aim of digitizing and making these editions available electronically. The project is still ongoing; various selected editions from each decade are going to be fully digitized (so far the 1906, 1924 and 1925 editions have been completed), and changes backtracked and dated to the specific year. Nénufar's primary aim is to make the editions available and searchable via an advanced search interface which will not only enable the selective querying of text by lemma and type of content (definitions, examples, ...), but crucially also detect and study changes by comparing different editions. In order to do so, a specific web interface has been put in place . Alongside the digitized text, the Nénufar website contains high quality scans for each page. In compliance with current open data best practices (Wilkinson et al., 2016), the project also aims to make the source data available separately from the querying interface both for research and for A similar project which presents data and scans from subsequent editions of the same legacy dictionary has been carried out by the team behind the Swedish Academy's Wordlist (see Holmer, Malmgren, and Martens (2016) and http://spraakdata.gu.se/saolhist/). eLex 2019: Book of Abstracts 36 long-term preservation. The primary encoding format is TEI-XML; however in our case the TEI encoding is closely inspired by the latest version of the TEI-Lex0 (Ba?ski et al., 2017, Romary & Tasovac, 2018) guidelines for encoding lexicographic resources, which are based upon TEI. The choice of a TEI based approach allows the Nénufar project to align itself to other pre-existing initiatives and tools. By aligning ourselves to TEI-Lex0 we will be able to make use of digitisation tools such as Grobid (Khemakhem et al., 2017) which have TEI-Lex0 as their native format and which have already been tested and used within the Nénufar project to speed up the digitization of new editions. In addition we will be able to make use of ongoing initiatives to convert TEI-Lex0 datasets to RDF using the W3C recommendation for publishing lexicons as Linked Data, namely OntoLex-Lemon (McCrae et al., 2017; Bosque-Gil et al., 2016) which will allow for the publication of the Nénufar dataset as an LOD graph. The LOD version of the Nénufar dataset, now currently being developed, will be queryable from the available SPARQL endpoint and contain all available editions as one single graph, allowing for expert users to perform complex queries that could detect systematic changes in the dataset. The LOD version is particularly adapted to be linked to other datasets; more recent editions, once added, could also be of interest for NLP applicationsI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


