Until recently, science has progressed mainly in a hypothesis driven way: based on already established theories, new models have been developed mathematically and have been falsified with specifically designed experiments. This methodology has been very successful in the study of deterministic linear phenomena but its limitations are evident in the case of complex systems such as high temperature plasmas. In particular, when dealing with complex non-linear phenomena, the interpretation of the experimental evidence is a quite delicate task. The high complexity and uncertainties of Tokamaks, for example, increase dramatically the difficulties of traditional analysis techniques, due to their rigidity and poor exploratory capability. Consequently, a lot of untapped knowledge remains buried in the large collected databases, which could profitably be used to formulate models and plan experiments. Thepresent contribution is indeed meant to provide an overview of recent developments inmachine learning (ML) and statistics, to address some two of the most challenging issues in data analysis for the science of complex systems: data driven model building and the design of new experiments. The extraction of mathematical models directly from cross sectional datais a great challenge in case of large database such as the one of JET, which is now approaching 0.5Petabytes. A new approach to data driven theory, called Symbolic Regression (SR) via Genetic Programming (GP), has been recently developed to address problems, for which it is difficult to develop models based on first principles. It is based on the manipulation of symbols, namely mathematical expressions, with genetic algorithms. Typical examples of SR via GP applications are the extraction of scaling laws and the identification of dimensionless quantities. The deployment of this approach to study large databases (devoted to the investigation of the energy confinement time, the L-H power threshold etc) has shown that the traditional power laws are not necessarily the best mathematical forms to represent the data of and has helped clarifying the limitations of the most widely used non-dimensional parameters. Traditional ML and statistical tools are predicated on the assumption that the data are independently sampledform the same distribution function in the training set and the final application. Their results are therefore strictly valid only for data acquired in absolutely stationary conditions. A typical violation of these hypotheses is the planning of new experiments; the available models have to be applied to new regions of the operational space, not represented in the previous data. A new genetic programming procedure has been developed to extract from past data the most appropriate candidate models and to identify the best region of the operational space to falsify theories and plan new experiments.In addition to exhaustive numerical tests to prove the generality of the techniques, specific applications to ITPA databases and data of metallic Tokamaks will be provided.
Data Driven Theory to Support Model Formulation and the Design of New Experiments
Murari A;
2019
Abstract
Until recently, science has progressed mainly in a hypothesis driven way: based on already established theories, new models have been developed mathematically and have been falsified with specifically designed experiments. This methodology has been very successful in the study of deterministic linear phenomena but its limitations are evident in the case of complex systems such as high temperature plasmas. In particular, when dealing with complex non-linear phenomena, the interpretation of the experimental evidence is a quite delicate task. The high complexity and uncertainties of Tokamaks, for example, increase dramatically the difficulties of traditional analysis techniques, due to their rigidity and poor exploratory capability. Consequently, a lot of untapped knowledge remains buried in the large collected databases, which could profitably be used to formulate models and plan experiments. Thepresent contribution is indeed meant to provide an overview of recent developments inmachine learning (ML) and statistics, to address some two of the most challenging issues in data analysis for the science of complex systems: data driven model building and the design of new experiments. The extraction of mathematical models directly from cross sectional datais a great challenge in case of large database such as the one of JET, which is now approaching 0.5Petabytes. A new approach to data driven theory, called Symbolic Regression (SR) via Genetic Programming (GP), has been recently developed to address problems, for which it is difficult to develop models based on first principles. It is based on the manipulation of symbols, namely mathematical expressions, with genetic algorithms. Typical examples of SR via GP applications are the extraction of scaling laws and the identification of dimensionless quantities. The deployment of this approach to study large databases (devoted to the investigation of the energy confinement time, the L-H power threshold etc) has shown that the traditional power laws are not necessarily the best mathematical forms to represent the data of and has helped clarifying the limitations of the most widely used non-dimensional parameters. Traditional ML and statistical tools are predicated on the assumption that the data are independently sampledform the same distribution function in the training set and the final application. Their results are therefore strictly valid only for data acquired in absolutely stationary conditions. A typical violation of these hypotheses is the planning of new experiments; the available models have to be applied to new regions of the operational space, not represented in the previous data. A new genetic programming procedure has been developed to extract from past data the most appropriate candidate models and to identify the best region of the operational space to falsify theories and plan new experiments.In addition to exhaustive numerical tests to prove the generality of the techniques, specific applications to ITPA databases and data of metallic Tokamaks will be provided.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.