CNR Institutional Research Information System

Synthetic data generation has been widely adopted in software testing, data privacy, imbalanced learning, artificial intelligence explanation, etc. In all such contexts, it is important to generate plausible data samples. A common assumption of approaches widely used for data generation is the independence of the features. However, typically, the variables of a dataset de-pend on one another, and these dependencies are not considered in data generation leading to the creation of implausible records. The main problem is that dependencies among variables are typically unknown. In this paper, we design a synthetic dataset generator for tabular data that is able to discover nonlinear causalities among the variables and use them at generation time. State-of-the-art methods for nonlinear causal discovery are typically inefficient. We boost them by restricting the causal discovery among the features appearing in the frequent patterns efficiently retrieved by a pattern mining algorithm. To validate our proposal, we design a framework for generating synthetic datasets with known causalities. Wide experimentation on many synthetic datasets and real datasets with known causalities shows the effectiveness of the proposed method.

Boosting synthetic data generation with effective nonlinear causal discovery

Cinquini M;Giannotti F;Guidotti R

2021

Abstract

Synthetic data generation has been widely adopted in software testing, data privacy, imbalanced learning, artificial intelligence explanation, etc. In all such contexts, it is important to generate plausible data samples. A common assumption of approaches widely used for data generation is the independence of the features. However, typically, the variables of a dataset de-pend on one another, and these dependencies are not considered in data generation leading to the creation of implausible records. The main problem is that dependencies among variables are typically unknown. In this paper, we design a synthetic dataset generator for tabular data that is able to discover nonlinear causalities among the variables and use them at generation time. State-of-the-art methods for nonlinear causal discovery are typically inefficient. We boost them by restricting the causal discovery among the features appearing in the frequent patterns efficiently retrieved by a pattern mining algorithm. To validate our proposal, we design a framework for generating synthetic datasets with known causalities. Wide experimentation on many synthetic datasets and real datasets with known causalities shows the effectiveness of the proposed method.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Codice ISBN
	
				978-1-6654-1621-4
			
	Parole chiave
	
				Data generation
Causal discovery
Pattern mining
Synthetic datasets
Explainability
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_468813-doc_189653.pdf solo utenti autorizzati Descrizione: Boosting synthetic data generation with effective nonlinear causal discovery Tipologia: Versione Editoriale (PDF) Dimensione 1.51 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.51 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
prod_468813-doc_199627.pdf accesso aperto Descrizione: Preprint - Boosting synthetic data generation with effective nonlinear causal discovery Tipologia: Versione Editoriale (PDF) Dimensione 1.48 MB Formato Adobe PDF Visualizza/Apri	1.48 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/414341

Citazioni

ND

6

6

social impact