CNR Institutional Research Information System

Grey literature can be a valuable source of information about organizations' activities and, to a certain extent, about their identity. Some of the major problems that hinder their full exploitation are the heterogeneity of formats, the lack of structure, the unpredictability of their content, the size of the document bases, which can quickly become huge. The collection and mining of grey literature can be applied to individual organizations or classes of organizations, thus enabling the analysis of the trends in particular fields. To this end, some techniques can be inherited from the best practices for the management of structured documents belonging to well identified categories, but something more is needed in our case. Obvious steps are: identifying sources, collecting items, cleansing and de-duplicating contents, assigning unique and persistent identifiers, adding metadata and augmenting the information using other sources. These phases are common to all digital libraries but further steps are required, in our opinion, in the case of grey literature in order to build document bases of value. In particular, we think that an iterative approach would be the most suitable in this context, one including an assessment of what has been collected in order to identify possible gaps and start over with the collection phase. We think that big data technologies, together with information retrieval and data and text mining techniques, will play a key role in this sector. This "bag of tools" will certainly facilitate the management, browsing and exploitation of large document bases that belong not only to a single organisation but also, for example, to a large number of organizations working in a particular sector. This on the one hand opens new scenarios regarding the type of information that can be extracted, but on the other hand introduces new problems regarding the homogenization of contents, formats and metadata, and additional issues related to quality control and confidentiality protection. We believe that in this context the incremental-iterative approach would help address, gradually and based on real cases, the problems mentioned above. In this paper we describe the process that, in our opinion, should be put in place and a high level ICT architecture for its dematerialization, along with the technologies that could be leveraged for its implementation.

Extracting value from grey literature: Processes and technologies for aggregating and analyzing the hidden "Big Data" treasure of organizations

Motta G;Puccinelli R;Reggiani L;Saccone M

2016

Abstract

Grey literature can be a valuable source of information about organizations' activities and, to a certain extent, about their identity. Some of the major problems that hinder their full exploitation are the heterogeneity of formats, the lack of structure, the unpredictability of their content, the size of the document bases, which can quickly become huge. The collection and mining of grey literature can be applied to individual organizations or classes of organizations, thus enabling the analysis of the trends in particular fields. To this end, some techniques can be inherited from the best practices for the management of structured documents belonging to well identified categories, but something more is needed in our case. Obvious steps are: identifying sources, collecting items, cleansing and de-duplicating contents, assigning unique and persistent identifiers, adding metadata and augmenting the information using other sources. These phases are common to all digital libraries but further steps are required, in our opinion, in the case of grey literature in order to build document bases of value. In particular, we think that an iterative approach would be the most suitable in this context, one including an assessment of what has been collected in order to identify possible gaps and start over with the collection phase. We think that big data technologies, together with information retrieval and data and text mining techniques, will play a key role in this sector. This "bag of tools" will certainly facilitate the management, browsing and exploitation of large document bases that belong not only to a single organisation but also, for example, to a large number of organizations working in a particular sector. This on the one hand opens new scenarios regarding the type of information that can be extracted, but on the other hand introduces new problems regarding the homogenization of contents, formats and metadata, and additional issues related to quality control and confidentiality protection. We believe that in this context the incremental-iterative approach would help address, gradually and based on real cases, the problems mentioned above. In this paper we describe the process that, in our opinion, should be put in place and a high level ICT architecture for its dematerialization, along with the technologies that could be leveraged for its implementation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2016
			
	Strutture organizzative
	
				ASR - Unità Performance
ASR - Direzione Generale
Istituto per il Lessico Intellettuale Europeo e Storia delle Idee - ILIESI
Dipartimento di Scienze Umane e Sociali, Patrimonio Culturale - DSU
			
	Parole chiave
	
				Big data
grey literature
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_376185-doc_126999.pdf solo utenti autorizzati Descrizione: Extracting value from grey literature Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.05 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.05 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/337654

Citazioni

ND

6

ND

social impact