CNR Institutional Research Information System

The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.

Towards a preschooler corpus of Italian: an experimental journey

Chiara Bolognesi;Alessandra Cinini;Paola Cutugno;Melissa Ferretti;Davide Chiarella

2025

Abstract

The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.ancejournal	RESEARCH METHODS IN APPLIED LINGUISTICS	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Chiara Bolognesi	en
dc.authority.people	Alessandra Cinini	en
dc.authority.people	Paola Cutugno	en
dc.authority.people	Melissa Ferretti	en
dc.authority.people	Davide Chiarella	en
dc.authority.project	2022NPXYHH	en
dc.collection.id.s	b3f88f24-048a-4e43-8ab1-6697b90e068e	*
dc.collection.name	01.01 Articolo in rivista	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2025/12/09 15:09:22	-
dc.date.available	2025/12/09 15:09:22	-
dc.date.firstsubmission	2025/09/02 17:08:24	*
dc.date.issued	2025	-
dc.date.submission	2025/09/02 17:08:24	*
dc.description.abstracteng	The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.	-
dc.description.allpeople	Bolognesi, Chiara; Cinini, Alessandra; Cutugno, Paola; Ferretti, Melissa; Chiarella, Davide	-
dc.description.allpeopleoriginal	Chiara Bolognesi; Alessandra Cinini; Paola Cutugno; Melissa Ferretti; Davide Chiarella	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	5	-
dc.identifier.doi	10.1016/j.rmal.2025.100252	en
dc.identifier.isi	WOS:001571927900001	-
dc.identifier.scopus	2-s2.0-105014013432	en
dc.identifier.source	orcid	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/552644	-
dc.identifier.url	https://www.sciencedirect.com/science/article/pii/S2772766125000734	en
dc.language.iso	eng	en
dc.relation.issue	3	en
dc.relation.projectAcronym	CIP	en
dc.relation.projectAwardNumber	CUP N° B53D23014720006	en
dc.relation.projectAwardTitle	Corpus of Italian language for Preschoolers. Lexicon directed to Italian preschool children from 3 to 6 years collected from heterogeneous sources in Italian and Italian Sign Language	en
dc.relation.projectFunderName	MUR	en
dc.relation.projectFundingStream	PRIN2022	en
dc.relation.volume	4	en
dc.subject.keywords	Child-directed speech	-
dc.subject.keywords	Children's literature	-
dc.subject.keywords	Corpus linguistics	-
dc.subject.keywords	Natural language processing	-
dc.subject.keywords	Preschool children language acquisition	-
dc.subject.keywords	Written Italian	-
dc.subject.singlekeyword	Child-directed speech	*
dc.subject.singlekeyword	Children's literature	*
dc.subject.singlekeyword	Corpus linguistics	*
dc.subject.singlekeyword	Natural language processing	*
dc.subject.singlekeyword	Preschool children language acquisition	*
dc.subject.singlekeyword	Written Italian	*
dc.title	Towards a preschooler corpus of Italian: an experimental journey	en
dc.type.driver	info:eu-repo/semantics/article	-
dc.type.full	01 Contributo su Rivista::01.01 Articolo in rivista	it
dc.type.impactfactor	si	en
dc.type.miur	262	-
dc.type.referee	Esperti anonimi	en
iris.isi.ideLinkStatusDate	2026/05/27 10:19:17	*
iris.isi.ideLinkStatusMillisecond	1779869957325	*
iris.isi.metadataErrorDescription	0	-
iris.isi.metadataErrorType	ERROR_NO_MATCH	-
iris.isi.metadataStatus	ERROR	-
iris.mediafilter.data	2025/12/10 03:53:20	*
iris.orcid.lastModifiedDate	2026/05/27 10:19:17	*
iris.orcid.lastModifiedMillisecond	1779869957306	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	Towards a preschooler corpus of Italian: an experimental journey	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoahost	publisher	*
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.1016/j.rmal.2025.100252	*
iris.unpaywall.hosttype	publisher	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.landingpage	https://doi.org/10.1016/j.rmal.2025.100252	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	29/05/2026 03:56:28	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1780019788521	-
iris.unpaywall.oastatus	hybrid	*
scopus.authority.ancejournal	RESEARCH METHODS IN APPLIED LINGUISTICS###2772-7661	*
scopus.category	3301	*
scopus.category	3310	*
scopus.contributor.affiliation	National Research Council Institute of Computational Linguistics	-
scopus.contributor.affiliation	National Research Council Institute of Computational Linguistics	-
scopus.contributor.affiliation	National Research Council Institute of Computational Linguistics	-
scopus.contributor.affiliation	National Research Council Institute of Computational Linguistics	-
scopus.contributor.affiliation	National Research Council Institute of Computational Linguistics	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.auid	60059627300	-
scopus.contributor.auid	36866071100	-
scopus.contributor.auid	6505755173	-
scopus.contributor.auid	57203499432	-
scopus.contributor.auid	25930765400	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.name	Chiara	-
scopus.contributor.name	Alessandra	-
scopus.contributor.name	Paola	-
scopus.contributor.name	Melissa	-
scopus.contributor.name	Davide	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Bolognesi	-
scopus.contributor.surname	Cinini	-
scopus.contributor.surname	Cutugno	-
scopus.contributor.surname	Ferretti	-
scopus.contributor.surname	Chiarella	-
scopus.date.issued	2025	*
scopus.description.abstracteng	The paper surveys the process and reasonings behind the written sources section of the Corpus of Italian for Preschoolers (CIP), a corpus collecting child-directed speech targeted at Italian children aged 3–6. Beginning from an overview of the available child-speech and child-directed speech corpora, the article underlines the need for an Italian Corpus focusing on children's passive vocabulary and how such a tool would be useful for future comparative studies on children's own production and as a tool for professionals in children's needs. The CIP aims at collecting 250,000 linguistic tokens across a selection of different sources (Written, Spoken, Signed) gathered with the help of schools and families. This paper focuses specifically on the selection criteria for the written sources and the first steps of their linguistic processing, explaining through a set of three experiments how three different linguistic annotation tools performed on the tasks of tokenizing, lemmatizing and POS-tagging three different children's literature texts. The last part presents the results of the experiments with insight on the NLP tools’ performances, as well as the reasons for our choice of tool for the large-scale annotation process and the still-ongoing challenges for the finalization of our corpus.	*
scopus.description.allpeopleoriginal	Bolognesi C.; Cinini A.; Cutugno P.; Ferretti M.; Chiarella D.	*
scopus.differences	scopus.subject.keywords	*
scopus.differences	scopus.description.allpeopleoriginal	*
scopus.document.type	ar	*
scopus.document.types	ar	*
scopus.funding.funders	501100000780 - European Commission; 501100000780 - European Commission;	*
scopus.funding.ids	CUP B53D23014720006;	*
scopus.identifier.doi	10.1016/j.rmal.2025.100252	*
scopus.identifier.eissn	2772-7661	*
scopus.identifier.pui	2040156059	*
scopus.identifier.scopus	2-s2.0-105014013432	*
scopus.journal.sourceid	21101160600	*
scopus.language.iso	eng	*
scopus.publisher.name	Elsevier B.V.	*
scopus.relation.article	100252	*
scopus.relation.issue	3	*
scopus.relation.volume	4	*
scopus.subject.keywords	Child-directed speech; Children's literature; Corpus linguistics; Natural language processing; Preschool children language acquisition; Written Italian;	*
scopus.title	Towards a preschooler corpus of Italian: an experimental journey	*
scopus.titleeng	Towards a preschooler corpus of Italian: an experimental journey	*
Appare nelle tipologie:	01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S2772766125000734-main.pdf accesso aperto Descrizione: Versione editoriale dell'articolo Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.53 MB Formato Adobe PDF Visualizza/Apri	1.53 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/552644

Citazioni

ND

0

0

social impact