This paper outlines the strategies, rationale and potential uses motivating the construction of the House Corpus, a one-million-word corpus that can be accessed by authorised users through the MWSWeb site (Taibi et al. 2015a) at http://openmws.itd.cnr.it. Part 1 illustrates the tools and techniques used to index the corpus data - transcriptions of all 177 episodes in the House M.D. series (original US version). In particular, it describes the commercially available Elasticsearch (https://www.elastic.co), used as an indexing, annotational and search tool. Part 2 explains that this is a multimedia corpus allowing viewings of different types of scene. The 6000-plus scenes in the corpus have been annotated in terms of their typological features: Location type (e.g. patient's hospital room; medical lab etc.); Event type (handover; differential diagnosis; precipitating medical event; patient examination etc.) and Character Group type (doctor/doctor; doctor/patient; doctor/caregiver; patient/caregiver etc.). The project envisages the development of various retrieval interfaces, initially Words, Scenes and Dialogues. This will make it possible to carry out searches in terms of types of scene and their distribution across the corpus without necessarily involving any other form of searching. Part 3 suggests the value of multimedia corpora in encouraging students to advance their critical discourse analysis (CDA) skills. As an example, it shows how the corpus can illustrate the priority of (inter)textual over lexicogrammatical considerations when formulating tag questions in oral discourse. Finally, the Discussion section argues that a typology of scenes appears to be an essential prerequisite for the construction of other types of access to the corpus data in subsequent stages of the project.

Ain't that sweet. Reflections on scene level indexing and annotation in the House Corpus Project

Davide Taibi;
2019

Abstract

This paper outlines the strategies, rationale and potential uses motivating the construction of the House Corpus, a one-million-word corpus that can be accessed by authorised users through the MWSWeb site (Taibi et al. 2015a) at http://openmws.itd.cnr.it. Part 1 illustrates the tools and techniques used to index the corpus data - transcriptions of all 177 episodes in the House M.D. series (original US version). In particular, it describes the commercially available Elasticsearch (https://www.elastic.co), used as an indexing, annotational and search tool. Part 2 explains that this is a multimedia corpus allowing viewings of different types of scene. The 6000-plus scenes in the corpus have been annotated in terms of their typological features: Location type (e.g. patient's hospital room; medical lab etc.); Event type (handover; differential diagnosis; precipitating medical event; patient examination etc.) and Character Group type (doctor/doctor; doctor/patient; doctor/caregiver; patient/caregiver etc.). The project envisages the development of various retrieval interfaces, initially Words, Scenes and Dialogues. This will make it possible to carry out searches in terms of types of scene and their distribution across the corpus without necessarily involving any other form of searching. Part 3 suggests the value of multimedia corpora in encouraging students to advance their critical discourse analysis (CDA) skills. As an example, it shows how the corpus can illustrate the priority of (inter)textual over lexicogrammatical considerations when formulating tag questions in oral discourse. Finally, the Discussion section argues that a typology of scenes appears to be an essential prerequisite for the construction of other types of access to the corpus data in subsequent stages of the project.
2019
Istituto per le Tecnologie Didattiche - ITD - Sede Genova
House Corpus
indexing
scene annotation
functionality planning
CDA
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/424414
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact