CNR Institutional Research Information System

The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don't provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with Part of Speech (Pos) and morpho-syntactic information, using and adapting an already existing tool trained on Italian written resources (PiTagger, developed by Eugenio Picchi, ILC-CNR Pisa). The incidence of spoken domain on the performance is within a 10% of errors detected in the manual evaluation procedure. Some issues concerning spoken language emerged. The definition of significant contexts for PoS statistics is to be provided by utterance boundaries; moreover, the relevance of a series of phenomena related to the prosodic parsing has been highlighted: fragmentation phenomena, a relative lack of information for all word adjacent to utterance boundaries; under-specification of PoS for words in connection to secondary prosodic breaks and one word utterances.

Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian

Panuzzi A;Picchi E;Moneglia M

2004

Abstract

The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don't provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with Part of Speech (Pos) and morpho-syntactic information, using and adapting an already existing tool trained on Italian written resources (PiTagger, developed by Eugenio Picchi, ILC-CNR Pisa). The incidence of spoken domain on the performance is within a 10% of errors detected in the manual evaluation procedure. Some issues concerning spoken language emerged. The definition of significant contexts for PoS statistics is to be provided by utterance boundaries; moreover, the relevance of a series of phenomena related to the prosodic parsing has been highlighted: fragmentation phenomena, a relative lack of information for all word adjacent to utterance boundaries; under-specification of PoS for words in connection to secondary prosodic breaks and one word utterances.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	-
dc.authority.people	Panuzzi A	it
dc.authority.people	Picchi E	it
dc.authority.people	Moneglia M	it
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.date.accessioned	2024/02/19 17:51:19	-
dc.date.available	2024/02/19 17:51:19	-
dc.date.issued	2004	-
dc.description.abstracteng	The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don't provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with Part of Speech (Pos) and morpho-syntactic information, using and adapting an already existing tool trained on Italian written resources (PiTagger, developed by Eugenio Picchi, ILC-CNR Pisa). The incidence of spoken domain on the performance is within a 10% of errors detected in the manual evaluation procedure. Some issues concerning spoken language emerged. The definition of significant contexts for PoS statistics is to be provided by utterance boundaries; moreover, the relevance of a series of phenomena related to the prosodic parsing has been highlighted: fragmentation phenomena, a relative lack of information for all word adjacent to utterance boundaries; under-specification of PoS for words in connection to secondary prosodic breaks and one word utterances.	-
dc.description.affiliations	CNR ILC Pisa	-
dc.description.allpeople	Panuzzi, A; Picchi, E; Moneglia, M	-
dc.description.allpeopleoriginal	Panuzzi A., Picchi E., Moneglia M.	-
dc.description.fulltext	none	en
dc.description.numberofauthors	3	-
dc.identifier.isbn	2-9517408-1-6	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/64234	-
dc.identifier.url	http://www.lrec-conf.org/lrec2004/	-
dc.language.iso	eng	-
dc.publisher.country	FRA	-
dc.publisher.name	European Language Resources Association (ELRA) - Evaluations and Language resources Distribution Agency (ELDA)	-
dc.publisher.place	Paris	-
dc.relation.conferencedate	26-27-28 May 2004	-
dc.relation.conferencename	LREC 2004: Fourth International Conference on Language Resources and Evaluation	-
dc.relation.conferenceplace	Lisbona	-
dc.relation.firstpage	563	-
dc.relation.ispartofbook	Proceedings: in LREC 2004: Fourth International Conference on Language Resources and Evaluation	-
dc.relation.lastpage	566	-
dc.relation.numberofpages	4	-
dc.subject.keywords	Lemmatization	-
dc.subject.keywords	Pos Tagging	-
dc.subject.singlekeyword	Lemmatization	*
dc.subject.singlekeyword	Pos Tagging	*
dc.title	Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
dc.type.referee	Sì, ma tipo non specificato	-
dc.ugov.descaux1	84613	-
iris.orcid.lastModifiedDate	2024/04/04 15:58:11	*
iris.orcid.lastModifiedMillisecond	1712239091490	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/64234

Citazioni

ND

ND

ND

social impact