The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.

Towards a preschooler corpus of Italian: an experimental journey

Chiara Bolognesi
;
Alessandra Cinini;Paola Cutugno;Melissa Ferretti;Davide Chiarella
2025

Abstract

The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.
2025
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
Child-directed speech
Children's literature
Corpus linguistics
Natural language processing
Preschool children language acquisition
Written Italian
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S2772766125000734-main.pdf

accesso aperto

Descrizione: Versione editoriale dell'articolo
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.53 MB
Formato Adobe PDF
1.53 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/552644
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact