The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don't provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with Part of Speech (Pos) and morpho-syntactic information, using and adapting an already existing tool trained on Italian written resources (PiTagger, developed by Eugenio Picchi, ILC-CNR Pisa). The incidence of spoken domain on the performance is within a 10% of errors detected in the manual evaluation procedure. Some issues concerning spoken language emerged. The definition of significant contexts for PoS statistics is to be provided by utterance boundaries; moreover, the relevance of a series of phenomena related to the prosodic parsing has been highlighted: fragmentation phenomena, a relative lack of information for all word adjacent to utterance boundaries; under-specification of PoS for words in connection to secondary prosodic breaks and one word utterances.
Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian
Picchi E;
2004
Abstract
The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don't provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with Part of Speech (Pos) and morpho-syntactic information, using and adapting an already existing tool trained on Italian written resources (PiTagger, developed by Eugenio Picchi, ILC-CNR Pisa). The incidence of spoken domain on the performance is within a 10% of errors detected in the manual evaluation procedure. Some issues concerning spoken language emerged. The definition of significant contexts for PoS statistics is to be provided by utterance boundaries; moreover, the relevance of a series of phenomena related to the prosodic parsing has been highlighted: fragmentation phenomena, a relative lack of information for all word adjacent to utterance boundaries; under-specification of PoS for words in connection to secondary prosodic breaks and one word utterances.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | - |
| dc.authority.people | Panuzzi A | it |
| dc.authority.people | Picchi E | it |
| dc.authority.people | Moneglia M | it |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.date.accessioned | 2024/02/19 17:51:19 | - |
| dc.date.available | 2024/02/19 17:51:19 | - |
| dc.date.issued | 2004 | - |
| dc.description.abstracteng | The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don't provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with Part of Speech (Pos) and morpho-syntactic information, using and adapting an already existing tool trained on Italian written resources (PiTagger, developed by Eugenio Picchi, ILC-CNR Pisa). The incidence of spoken domain on the performance is within a 10% of errors detected in the manual evaluation procedure. Some issues concerning spoken language emerged. The definition of significant contexts for PoS statistics is to be provided by utterance boundaries; moreover, the relevance of a series of phenomena related to the prosodic parsing has been highlighted: fragmentation phenomena, a relative lack of information for all word adjacent to utterance boundaries; under-specification of PoS for words in connection to secondary prosodic breaks and one word utterances. | - |
| dc.description.affiliations | CNR ILC Pisa | - |
| dc.description.allpeople | Panuzzi, A; Picchi, E; Moneglia, M | - |
| dc.description.allpeopleoriginal | Panuzzi A., Picchi E., Moneglia M. | - |
| dc.description.fulltext | none | en |
| dc.description.numberofauthors | 3 | - |
| dc.identifier.isbn | 2-9517408-1-6 | - |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/64234 | - |
| dc.identifier.url | http://www.lrec-conf.org/lrec2004/ | - |
| dc.language.iso | eng | - |
| dc.publisher.country | FRA | - |
| dc.publisher.name | European Language Resources Association (ELRA) - Evaluations and Language resources Distribution Agency (ELDA) | - |
| dc.publisher.place | Paris | - |
| dc.relation.conferencedate | 26-27-28 May 2004 | - |
| dc.relation.conferencename | LREC 2004: Fourth International Conference on Language Resources and Evaluation | - |
| dc.relation.conferenceplace | Lisbona | - |
| dc.relation.firstpage | 563 | - |
| dc.relation.ispartofbook | Proceedings: in LREC 2004: Fourth International Conference on Language Resources and Evaluation | - |
| dc.relation.lastpage | 566 | - |
| dc.relation.numberofpages | 4 | - |
| dc.subject.keywords | Lemmatization | - |
| dc.subject.keywords | Pos Tagging | - |
| dc.subject.singlekeyword | Lemmatization | * |
| dc.subject.singlekeyword | Pos Tagging | * |
| dc.title | Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| dc.type.referee | Sì, ma tipo non specificato | - |
| dc.ugov.descaux1 | 84613 | - |
| iris.orcid.lastModifiedDate | 2024/04/04 15:58:11 | * |
| iris.orcid.lastModifiedMillisecond | 1712239091490 | * |
| iris.sitodocente.maxattempts | 1 | - |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


