The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.
Towards a preschooler corpus of Italian: an experimental journey
Chiara Bolognesi
;Alessandra Cinini;Paola Cutugno;Melissa Ferretti;Davide Chiarella
2025
Abstract
The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.ancejournal | RESEARCH METHODS IN APPLIED LINGUISTICS | en |
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Chiara Bolognesi | en |
| dc.authority.people | Alessandra Cinini | en |
| dc.authority.people | Paola Cutugno | en |
| dc.authority.people | Melissa Ferretti | en |
| dc.authority.people | Davide Chiarella | en |
| dc.authority.project | 2022NPXYHH | en |
| dc.collection.id.s | b3f88f24-048a-4e43-8ab1-6697b90e068e | * |
| dc.collection.name | 01.01 Articolo in rivista | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2025/12/09 15:09:22 | - |
| dc.date.available | 2025/12/09 15:09:22 | - |
| dc.date.firstsubmission | 2025/09/02 17:08:24 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2025/09/02 17:08:24 | * |
| dc.description.abstracteng | The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus. | - |
| dc.description.allpeople | Bolognesi, Chiara; Cinini, Alessandra; Cutugno, Paola; Ferretti, Melissa; Chiarella, Davide | - |
| dc.description.allpeopleoriginal | Chiara Bolognesi; Alessandra Cinini; Paola Cutugno; Melissa Ferretti; Davide Chiarella | en |
| dc.description.fulltext | open | en |
| dc.description.international | no | en |
| dc.description.numberofauthors | 5 | - |
| dc.identifier.doi | 10.1016/j.rmal.2025.100252 | en |
| dc.identifier.scopus | 2-s2.0-105014013432 | en |
| dc.identifier.source | orcid | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/552644 | - |
| dc.identifier.url | https://www.sciencedirect.com/science/article/pii/S2772766125000734 | en |
| dc.language.iso | eng | en |
| dc.relation.issue | 3 | en |
| dc.relation.projectAcronym | CIP | en |
| dc.relation.projectAwardNumber | CUP N° B53D23014720006 | en |
| dc.relation.projectAwardTitle | Corpus of Italian language for Preschoolers. Lexicon directed to Italian preschool children from 3 to 6 years collected from heterogeneous sources in Italian and Italian Sign Language | en |
| dc.relation.projectFunderName | MUR | en |
| dc.relation.projectFundingStream | PRIN2022 | en |
| dc.relation.volume | 4 | en |
| dc.subject.keywords | Child-directed speech | - |
| dc.subject.keywords | Children's literature | - |
| dc.subject.keywords | Corpus linguistics | - |
| dc.subject.keywords | Natural language processing | - |
| dc.subject.keywords | Preschool children language acquisition | - |
| dc.subject.keywords | Written Italian | - |
| dc.subject.singlekeyword | Child-directed speech | * |
| dc.subject.singlekeyword | Children's literature | * |
| dc.subject.singlekeyword | Corpus linguistics | * |
| dc.subject.singlekeyword | Natural language processing | * |
| dc.subject.singlekeyword | Preschool children language acquisition | * |
| dc.subject.singlekeyword | Written Italian | * |
| dc.title | Towards a preschooler corpus of Italian: an experimental journey | en |
| dc.type.driver | info:eu-repo/semantics/article | - |
| dc.type.full | 01 Contributo su Rivista::01.01 Articolo in rivista | it |
| dc.type.impactfactor | si | en |
| dc.type.miur | 262 | - |
| dc.type.referee | Esperti anonimi | en |
| iris.mediafilter.data | 2025/12/10 03:53:20 | * |
| iris.orcid.lastModifiedDate | 2025/12/09 15:09:22 | * |
| iris.orcid.lastModifiedMillisecond | 1765289362516 | * |
| iris.scopus.extIssued | 2025 | - |
| iris.scopus.extTitle | Towards a preschooler corpus of Italian: an experimental journey | - |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.bestoahost | publisher | * |
| iris.unpaywall.bestoaversion | publishedVersion | * |
| iris.unpaywall.doi | 10.1016/j.rmal.2025.100252 | * |
| iris.unpaywall.hosttype | publisher | * |
| iris.unpaywall.isoa | true | * |
| iris.unpaywall.journalisindoaj | false | * |
| iris.unpaywall.landingpage | https://doi.org/10.1016/j.rmal.2025.100252 | * |
| iris.unpaywall.license | cc-by | * |
| iris.unpaywall.metadataCallLastModified | 10/12/2025 04:00:04 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1765335604653 | - |
| iris.unpaywall.oastatus | hybrid | * |
| scopus.authority.ancejournal | RESEARCH METHODS IN APPLIED LINGUISTICS###2772-7661 | * |
| scopus.category | 3301 | * |
| scopus.category | 3310 | * |
| scopus.contributor.affiliation | National Research Council Institute of Computational Linguistics | - |
| scopus.contributor.affiliation | National Research Council Institute of Computational Linguistics | - |
| scopus.contributor.affiliation | National Research Council Institute of Computational Linguistics | - |
| scopus.contributor.affiliation | National Research Council Institute of Computational Linguistics | - |
| scopus.contributor.affiliation | National Research Council Institute of Computational Linguistics | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.auid | 60059627300 | - |
| scopus.contributor.auid | 36866071100 | - |
| scopus.contributor.auid | 6505755173 | - |
| scopus.contributor.auid | 57203499432 | - |
| scopus.contributor.auid | 25930765400 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.name | Chiara | - |
| scopus.contributor.name | Alessandra | - |
| scopus.contributor.name | Paola | - |
| scopus.contributor.name | Melissa | - |
| scopus.contributor.name | Davide | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.surname | Bolognesi | - |
| scopus.contributor.surname | Cinini | - |
| scopus.contributor.surname | Cutugno | - |
| scopus.contributor.surname | Ferretti | - |
| scopus.contributor.surname | Chiarella | - |
| scopus.date.issued | 2025 | * |
| scopus.description.abstracteng | The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus. | * |
| scopus.description.allpeopleoriginal | Bolognesi C.; Cinini A.; Cutugno P.; Ferretti M.; Chiarella D. | * |
| scopus.differences | scopus.subject.keywords | * |
| scopus.differences | scopus.description.allpeopleoriginal | * |
| scopus.document.type | ar | * |
| scopus.document.types | ar | * |
| scopus.funding.funders | 501100000780 - European Commission; 501100000780 - European Commission; | * |
| scopus.funding.ids | CUP B53D23014720006; | * |
| scopus.identifier.doi | 10.1016/j.rmal.2025.100252 | * |
| scopus.identifier.eissn | 2772-7661 | * |
| scopus.identifier.pui | 2040156059 | * |
| scopus.identifier.scopus | 2-s2.0-105014013432 | * |
| scopus.journal.sourceid | 21101160600 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Elsevier B.V. | * |
| scopus.relation.article | 100252 | * |
| scopus.relation.issue | 3 | * |
| scopus.relation.volume | 4 | * |
| scopus.subject.keywords | Child-directed speech; Children's literature; Corpus linguistics; Natural language processing; Preschool children language acquisition; Written Italian; | * |
| scopus.title | Towards a preschooler corpus of Italian: an experimental journey | * |
| scopus.titleeng | Towards a preschooler corpus of Italian: an experimental journey | * |
| Appare nelle tipologie: | 01.01 Articolo in rivista | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
1-s2.0-S2772766125000734-main.pdf
accesso aperto
Descrizione: Versione editoriale dell'articolo
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
1.53 MB
Formato
Adobe PDF
|
1.53 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


