Frontiers in Data Analysis Methods: from Causality Detection to Data Driven Experimental Design

Murari, A

Until recently, science has progressed mainly in a hypothesis driven way: based on established theories, new models have been developed mathematically and falsified with specifically designed experiments. This methodology has been very successful in the past but its limitations are evident in its application to open systems such as high temperature plasmas, due to their complexity, uncertainties and nonlinearities. Therefore, a lot of untapped knowledge remains buried in the large stored warehouses, consequence of the lack of adequate mathematical tools for data mining. The present contribution provides an overview of recent developments in machine learning (ML) and statistics, to address three of the most challenging issues in data analysis for the science of complex systems: causality detection, data driven modelling and the design of new experiments. In plasma physics, the majority of the signals consist of time series, whose causality interrelations are very often not straightforward to determine. On the other hand, recently an impressive series of methodologies have been devised to quantify the strength of causal influence between time series; they range from information flow and transfer entropy to complex networks and diffeomorphism on embedded manifolds. Their application to ICRH modulation and ELM pacing experiments has proved the role of the fast ion slowing down time in the triggering of sawteeth and quantified the efficiency of ELM triggering with pellets. For disruption predictions, the developed tools have provided unprecedented success rates on JET (errors of a few per thousand). The extraction of mathematical models from cross sectional data is a great challenge in case of large database such the one of JET, which is now approaching 0.5 Petabytes. A new data driven theory approach, called Symbolic Regression (SR) via Genetic Programming (GP), has been recently developed to address problems, for which it is difficult to derive models from first principles. A typical example of SR via GP application is the extraction of scaling laws and the identification of dimensionless quantities. The deployment of these tools to study large databases has shown that the traditional power laws are not the best mathematical forms to represent the data of the energy confinement time, motivating a complete revision of the underlying scale invariance assumptions. The developed methodologies have also allowed unifying the scaling laws for the Stellarator configuration, without making recourse to any renormalization factor. This new scaling can therefore be extrapolated to the reactor domain, proving that he configuration is competitive with the Tokamak in the L mode. SR via GP has also brought into question the traditional form of the dimensionless quantities derived from the Vlasov equation, emphasizing the contribution of non-plasma physics effects (such as atomic physics) and broken symmetries. Traditional ML and statistical tools are predicated on the assumption that the data are independently sampled from completely stationary systems. A typical violation is the planning of new experiments; the available models have to be applied to new regions of the operational space, not represented in the previous data. A new genetic programming procedure has been finalised to extract from past data to identify the best region of the operational space to plan new experiments, with potential savings even of the order of 50% of the experimental time.