CNR Institutional Research Information System

Open-ended questions do not limit respondents' answers in terms of linguistic form and semantic content, but bring about severe problems in terms of cost and speed, since their coding requires trained professionals to manually identify and tag meaningful text segments. To overcome these problems, a few automatic approaches have been proposed in the past, some based on matching the answer with textual descriptions of the codes, others based on manually building rules that check the answer for the presence or absence of code-revealing words. While the former approach is scarcely effective, the major drawback of the latter approach is that the rules need to be developed manually, and before the actual observation of text data. We propose a new approach, inspired by work in information retrieval (IR), that overcomes these drawbacks. In this approach survey coding is viewed as a task of multiclass text categorization (MTC), and is tackled through techniques originally developed in the .eld of supervised machine learning. In MTC each text belonging to a given corpus has to be classi.ed into exactly one from a set of prede.ned categories. In the supervised machine learning approach to MTC, a set of categorization rules is built automatically by learning the characteristics that a text should have in order to be classified under a given category. Such characteristics are automatically learnt from a set of training examples, i.e. a set of texts whose category is known. For survey coding, we equate the set of codes with categories, and all the collected answers to a given question with texts. Giorgetti and Sebastiani have carried out automatic coding experiments with two di.erent supervised learning techniques, one based on a naÏve Bayesian method and the other based on multiclass support vector machines. Experiments have been run on a corpus of social surveys carried out by the National Opinion Research Center, University of Chicago (NORC). These experiments show that our methods outperform, in terms of accuracy, previous automated methods tested on the same corpus.

Automatic coding of open-ended surveys using text categorization techniques

Giorgetti D;Sebastiani F;Prodanof I

2003

Abstract

Open-ended questions do not limit respondents' answers in terms of linguistic form and semantic content, but bring about severe problems in terms of cost and speed, since their coding requires trained professionals to manually identify and tag meaningful text segments. To overcome these problems, a few automatic approaches have been proposed in the past, some based on matching the answer with textual descriptions of the codes, others based on manually building rules that check the answer for the presence or absence of code-revealing words. While the former approach is scarcely effective, the major drawback of the latter approach is that the rules need to be developed manually, and before the actual observation of text data. We propose a new approach, inspired by work in information retrieval (IR), that overcomes these drawbacks. In this approach survey coding is viewed as a task of multiclass text categorization (MTC), and is tackled through techniques originally developed in the .eld of supervised machine learning. In MTC each text belonging to a given corpus has to be classi.ed into exactly one from a set of prede.ned categories. In the supervised machine learning approach to MTC, a set of categorization rules is built automatically by learning the characteristics that a text should have in order to be classified under a given category. Such characteristics are automatically learnt from a set of training examples, i.e. a set of texts whose category is known. For survey coding, we equate the set of codes with categories, and all the collected answers to a given question with texts. Giorgetti and Sebastiani have carried out automatic coding experiments with two di.erent supervised learning techniques, one based on a naÏve Bayesian method and the other based on multiclass support vector machines. Experiments have been run on a corpus of social surveys carried out by the National Opinion Research Center, University of Chicago (NORC). These experiments show that our methods outperform, in terms of accuracy, previous automated methods tested on the same corpus.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	-
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	-
dc.authority.people	Giorgetti D	it
dc.authority.people	Sebastiani F	it
dc.authority.people	Prodanof I	it
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.date.accessioned	2024/02/21 05:23:01	-
dc.date.available	2024/02/21 05:23:01	-
dc.date.issued	2003	-
dc.description.abstracteng	Open-ended questions do not limit respondents' answers in terms of linguistic form and semantic content, but bring about severe problems in terms of cost and speed, since their coding requires trained professionals to manually identify and tag meaningful text segments. To overcome these problems, a few automatic approaches have been proposed in the past, some based on matching the answer with textual descriptions of the codes, others based on manually building rules that check the answer for the presence or absence of code-revealing words. While the former approach is scarcely effective, the major drawback of the latter approach is that the rules need to be developed manually, and before the actual observation of text data. We propose a new approach, inspired by work in information retrieval (IR), that overcomes these drawbacks. In this approach survey coding is viewed as a task of multiclass text categorization (MTC), and is tackled through techniques originally developed in the .eld of supervised machine learning. In MTC each text belonging to a given corpus has to be classi.ed into exactly one from a set of prede.ned categories. In the supervised machine learning approach to MTC, a set of categorization rules is built automatically by learning the characteristics that a text should have in order to be classified under a given category. Such characteristics are automatically learnt from a set of training examples, i.e. a set of texts whose category is known. For survey coding, we equate the set of codes with categories, and all the collected answers to a given question with texts. Giorgetti and Sebastiani have carried out automatic coding experiments with two di.erent supervised learning techniques, one based on a naÏve Bayesian method and the other based on multiclass support vector machines. Experiments have been run on a corpus of social surveys carried out by the National Opinion Research Center, University of Chicago (NORC). These experiments show that our methods outperform, in terms of accuracy, previous automated methods tested on the same corpus.	-
dc.description.affiliations	CNR-ILC, Pisa, Italy; CNR-ISTI, Pisa, Italy; CNR-ILC, Pisa, Italy	-
dc.description.allpeople	Giorgetti, D; Sebastiani, F; Prodanof, I	-
dc.description.allpeopleoriginal	Giorgetti D.; Sebastiani F.; Prodanof I.	-
dc.description.fulltext	restricted	en
dc.description.numberofauthors	3	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/57593	-
dc.language.iso	eng	-
dc.relation.conferencedate	17-19 September 2003	-
dc.relation.conferencename	The Impact of Technology on the Survey Process. Fourth International Conference on Survey and Statistical Computing	-
dc.relation.conferenceplace	The Univesity of Warwick, England, UK	-
dc.relation.firstpage	173	-
dc.relation.lastpage	184	-
dc.subject.keywords	Automatic coding	-
dc.subject.singlekeyword	Automatic coding	*
dc.title	Automatic coding of open-ended surveys using text categorization techniques	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
dc.type.referee	Sì, ma tipo non specificato	-
dc.ugov.descaux1	91138	-
iris.mediafilter.data	2025/04/20 02:48:49	*
iris.orcid.lastModifiedDate	2024/04/04 19:32:38	*
iris.orcid.lastModifiedMillisecond	1712251958024	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_91138-doc_123308.pdf solo utenti autorizzati Descrizione: Automatic coding of open-ended surveys using text categorization techniques Tipologia: Versione Editoriale (PDF) Dimensione 142.61 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	142.61 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/57593

Citazioni

ND

ND

ND

social impact