CNR Institutional Research Information System

Managing bid documentation in large, evolving technology companies is inherently complex, often due to inconsistencies in information such as translations, file updates, and manual data extraction. These processes involve multiple departments, including software, hardware, products, infrastructure, materials, and regulations, requiring collaboration across geographically distributed teams with different native languages. This complexity is exacerbated by the need to trace requirements from bid offers to code and product development, and to perform similarity analysis when needed. Unstructured information comes from diverse sources like scans and/or editable texts with tables and images, written in various languages and using domain-specific terminology. Manual processing is error-prone, and translating data can lead to the loss of context-specific meanings or issues in safety-critical domains. This study combines Natural Language Processing (NLP) and Optical Character Recognition (OCR) to classify data into "information"or "requirement"while preserving multilingualism. A dual-pipeline approach is developed, featuring both a meta-classifier (an ensemble of Logistic Regression, Support Vector Machine, Multinomial Naive Bayes, and Random Forest) for robust and interpretable results, and a BERT model for capturing subtle linguistic patterns. The proposed pipeline is validated using a real-world case study in railway requirement annotation. Additionally, to demonstrate the methodology's flexibility, a second case study is conducted on topic classification of newspaper articles using publicly accessible data. The pipeline's output is a software solution that uses pre-trained models tailored to the respective domains. Future developments will include the creation of a graphical user interface (GUI), enabling distributed users to easily and efficiently search, update their requirements, and extract custom PDFs processed with translator and OCR.

From unstructured documents to annotated information: an optimized pipeline to process industrial requirements

Nocente A.;Risi R.;Pannocchia G.;Rossetti G.

2024

Abstract

Managing bid documentation in large, evolving technology companies is inherently complex, often due to inconsistencies in information such as translations, file updates, and manual data extraction. These processes involve multiple departments, including software, hardware, products, infrastructure, materials, and regulations, requiring collaboration across geographically distributed teams with different native languages. This complexity is exacerbated by the need to trace requirements from bid offers to code and product development, and to perform similarity analysis when needed. Unstructured information comes from diverse sources like scans and/or editable texts with tables and images, written in various languages and using domain-specific terminology. Manual processing is error-prone, and translating data can lead to the loss of context-specific meanings or issues in safety-critical domains. This study combines Natural Language Processing (NLP) and Optical Character Recognition (OCR) to classify data into "information"or "requirement"while preserving multilingualism. A dual-pipeline approach is developed, featuring both a meta-classifier (an ensemble of Logistic Regression, Support Vector Machine, Multinomial Naive Bayes, and Random Forest) for robust and interpretable results, and a BERT model for capturing subtle linguistic patterns. The proposed pipeline is validated using a real-world case study in railway requirement annotation. Additionally, to demonstrate the methodology's flexibility, a second case study is conducted on topic classification of newspaper articles using publicly accessible data. The pipeline's output is a software solution that uses pre-trained models tailored to the respective domains. Future developments will include the creation of a graphical user interface (GUI), enabling distributed users to easily and efficiently search, update their requirements, and extract custom PDFs processed with translator and OCR.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Codice ISBN
	
				979-8-4007-1738-3
			
	Parole chiave
	
				Multilingual Requirement Classification
Natural Language Processing
Railway Requirement Management
Tender Unstructured Documentation
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
3711542.3711576.pdf accesso aperto Descrizione: From Unstructured Documents to Annotated Information: An Optimized Pipeline to Process Industrial Requirements Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.26 MB Formato Adobe PDF Visualizza/Apri	1.26 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/563101

Citazioni

ND

1

0

social impact