The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.

Towards a preschooler corpus of Italian: an experimental journey

Chiara Bolognesi
;
Alessandra Cinini;Paola Cutugno;Melissa Ferretti;Davide Chiarella
2025

Abstract

The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.
Campo DC Valore Lingua
dc.authority.ancejournal RESEARCH METHODS IN APPLIED LINGUISTICS en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Chiara Bolognesi en
dc.authority.people Alessandra Cinini en
dc.authority.people Paola Cutugno en
dc.authority.people Melissa Ferretti en
dc.authority.people Davide Chiarella en
dc.authority.project 2022NPXYHH en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2025/12/09 15:09:22 -
dc.date.available 2025/12/09 15:09:22 -
dc.date.firstsubmission 2025/09/02 17:08:24 *
dc.date.issued 2025 -
dc.date.submission 2025/09/02 17:08:24 *
dc.description.abstracteng The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus. -
dc.description.allpeople Bolognesi, Chiara; Cinini, Alessandra; Cutugno, Paola; Ferretti, Melissa; Chiarella, Davide -
dc.description.allpeopleoriginal Chiara Bolognesi; Alessandra Cinini; Paola Cutugno; Melissa Ferretti; Davide Chiarella en
dc.description.fulltext open en
dc.description.international no en
dc.description.numberofauthors 5 -
dc.identifier.doi 10.1016/j.rmal.2025.100252 en
dc.identifier.scopus 2-s2.0-105014013432 en
dc.identifier.source orcid *
dc.identifier.uri https://hdl.handle.net/20.500.14243/552644 -
dc.identifier.url https://www.sciencedirect.com/science/article/pii/S2772766125000734 en
dc.language.iso eng en
dc.relation.issue 3 en
dc.relation.projectAcronym CIP en
dc.relation.projectAwardNumber CUP N° B53D23014720006 en
dc.relation.projectAwardTitle Corpus of Italian language for Preschoolers. Lexicon directed to Italian preschool children from 3 to 6 years collected from heterogeneous sources in Italian and Italian Sign Language en
dc.relation.projectFunderName MUR en
dc.relation.projectFundingStream PRIN2022 en
dc.relation.volume 4 en
dc.subject.keywords Child-directed speech -
dc.subject.keywords Children's literature -
dc.subject.keywords Corpus linguistics -
dc.subject.keywords Natural language processing -
dc.subject.keywords Preschool children language acquisition -
dc.subject.keywords Written Italian -
dc.subject.singlekeyword Child-directed speech *
dc.subject.singlekeyword Children's literature *
dc.subject.singlekeyword Corpus linguistics *
dc.subject.singlekeyword Natural language processing *
dc.subject.singlekeyword Preschool children language acquisition *
dc.subject.singlekeyword Written Italian *
dc.title Towards a preschooler corpus of Italian: an experimental journey en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.impactfactor si en
dc.type.miur 262 -
dc.type.referee Esperti anonimi en
iris.mediafilter.data 2025/12/10 03:53:20 *
iris.orcid.lastModifiedDate 2025/12/09 15:09:22 *
iris.orcid.lastModifiedMillisecond 1765289362516 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Towards a preschooler corpus of Italian: an experimental journey -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost publisher *
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.1016/j.rmal.2025.100252 *
iris.unpaywall.hosttype publisher *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://doi.org/10.1016/j.rmal.2025.100252 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 10/12/2025 04:00:04 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1765335604653 -
iris.unpaywall.oastatus hybrid *
scopus.authority.ancejournal RESEARCH METHODS IN APPLIED LINGUISTICS###2772-7661 *
scopus.category 3301 *
scopus.category 3310 *
scopus.contributor.affiliation National Research Council Institute of Computational Linguistics -
scopus.contributor.affiliation National Research Council Institute of Computational Linguistics -
scopus.contributor.affiliation National Research Council Institute of Computational Linguistics -
scopus.contributor.affiliation National Research Council Institute of Computational Linguistics -
scopus.contributor.affiliation National Research Council Institute of Computational Linguistics -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.auid 60059627300 -
scopus.contributor.auid 36866071100 -
scopus.contributor.auid 6505755173 -
scopus.contributor.auid 57203499432 -
scopus.contributor.auid 25930765400 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.name Chiara -
scopus.contributor.name Alessandra -
scopus.contributor.name Paola -
scopus.contributor.name Melissa -
scopus.contributor.name Davide -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.surname Bolognesi -
scopus.contributor.surname Cinini -
scopus.contributor.surname Cutugno -
scopus.contributor.surname Ferretti -
scopus.contributor.surname Chiarella -
scopus.date.issued 2025 *
scopus.description.abstracteng The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus. *
scopus.description.allpeopleoriginal Bolognesi C.; Cinini A.; Cutugno P.; Ferretti M.; Chiarella D. *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.description.allpeopleoriginal *
scopus.document.type ar *
scopus.document.types ar *
scopus.funding.funders 501100000780 - European Commission; 501100000780 - European Commission; *
scopus.funding.ids CUP B53D23014720006; *
scopus.identifier.doi 10.1016/j.rmal.2025.100252 *
scopus.identifier.eissn 2772-7661 *
scopus.identifier.pui 2040156059 *
scopus.identifier.scopus 2-s2.0-105014013432 *
scopus.journal.sourceid 21101160600 *
scopus.language.iso eng *
scopus.publisher.name Elsevier B.V. *
scopus.relation.article 100252 *
scopus.relation.issue 3 *
scopus.relation.volume 4 *
scopus.subject.keywords Child-directed speech; Children's literature; Corpus linguistics; Natural language processing; Preschool children language acquisition; Written Italian; *
scopus.title Towards a preschooler corpus of Italian: an experimental journey *
scopus.titleeng Towards a preschooler corpus of Italian: an experimental journey *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S2772766125000734-main.pdf

accesso aperto

Descrizione: Versione editoriale dell'articolo
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.53 MB
Formato Adobe PDF
1.53 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/552644
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact