Deterministic learning and an application in optimal control

Cervellera, C; Muselli, M

doi:10.1016/S1076-5670(05)40002-6

In a wide variety of real world situations a functional dependence y = g(x) has to be estimated from a set of observations (xL; yL) = f(xl; yl); l = 0; : : : ;L ¡ 1g concerning a phenomenon of interest. This is the case when the behavior of a continuous signal has to be forecast starting from its previous history or when the value of an unmeasurable quantity has to be inferred from the measurements of other related variables. If an insu±cient amount of a priori information about the form of the functional dependence g is available, its estimation must provide for two di®erent actions: 1. at ¯rst a su±ciently large class ¡ of functions must be properly selected (model selection); 2. then, the best element g 2 ¡ must be retrieved by adopting a suitable optimization algorithm (training phase). The model selection task is usually performed by taking a very general paradigm, whose complexity can be controlled by acting on a small number of constant values. For example, the usual polynomial series expansion can approximate arbitrarily well every measurable function, i.e., polynomials are universal approximators. However, by including in ¡ only the functions whose polynomial series expansion does not contain terms with exponent greater than a prescribed maximum k, we can control the richness of the class ¡. In particular, if k = 1 only linear functions are included in ¡, if k = 2 the expansion can realize only linear and quadratic functions, etc. Other general paradigms have been extensively used for model selection: neural networks have been shown to possess the universal approximation property [1, 2, 3, 4] and have been successfully applied in a lot of di®erent ¯elds. In this case, the complexity of the class ¡ can be controlled by acting on the architecture of the network (the number of layers and the number of neurons in the feedforward structure). Once the class ¡ has been chosen, the optimization algorithm to be employed in the training phase is selected accordingly. For example, the back-propagation technique (and its modi¯cations) is often adopted to retrieve the function in ¡ that best ¯ts the collection (xL; yL) of observations at our disposal, usually called training set. However, the basic goal to be pursued is to obtain a function g that generalizes well, i.e., that behaves correctly even in correspondence with other points of the domain not included in the training set. How can it be guaranteed that the element of ¡ that best ¯ts our observations also generalizes well? This is a fundamental question in the context of learning theory. Since in many practical situations the input vectors xl cannot be freely chosen and the training set (xL; yL) can be corrupted by noise, most results on learning theory are based on a statistical framework, which arose in the pattern recognition community [5, 6, 7, 8] and has been naturally extended to other inductive problems, like regression estimation [9, 10, 11] and probability density reconstruction [10]. In this framework, called Statistical Learning (SL), the input vectors xl, for l = 0; : : : ;L ¡ 1, are viewed as realizations of a random variable, generated according to an unknown (but ¯xed) probability density p. On the other hand, there are several cases where the position of the points xl in the input space can be suitably selected for the problem at hand. If a deterministic algorithm is employed to choose the input vectors xl, SL is no more the most appropriate approach. In this case a new framework, called Deterministic Learning (DL), is able to catch the peculiarities of the situation at hand, thus providing precise conditions about the generalization ability of thefunction g 2 ¡ that best ¯ts the observations of the training set (xL; yL). This chapter presents a survey of DL, comparing its results with those obtained by standard SL. In particular, basic quantities, like variation and discrepancy, are introduced, pointing out their centrality in the derivation of upper bounds for the generalization error that decrease as 1=L (apart from logarithmic factors) with the size L of the training set. This behavior outperforms the equivalent result obtained by SL, where a convergence rate of 1=L2 has been derived. An important application of DL concerns system control and, speci¯cally, the solution of multistage stochastic optimization (MOS) problems, a particular kind of Markovian decision processes. In such problems, the aim is to minimize a cost that depends on the evolution of a system, a®ected by random disturbances, through a horizon of an either ¯nite or in¯nite number of stages. This very classic framework is widely employed in many different contexts, such as Economics, Arti¯cial Intelligence, Engineering, etc. Since in most practical situations optimal control and cost functions cannot be obtained in an analytical form, a numerical approach is needed to solve MOS problems. The standard tool is Dynamic Programming (DP), introduced by Bellman [12], as is documented by the large amount of studies devoted to this method through the years. The basic idea underlying the DP procedure is to de¯ne, at each stage, a function, commonly named cost-to-go or value function which quanti¯es the cost that has to be paid from that stage on to the end of the time horizon. In this way, it is possible to transform the MOS problem into a sequence of simpler static optimization subproblems, which can be solved recursively. The basics of the recursive solution adopted in DP are introduced and discussed in several classic references (see, for example, [12, 13, 14]). Among the most recent surveys on DP techniques and Markov decision processes in general, two excellent monographs are [15, 16]. Although e±cient variations of the DP procedure exist for the \deterministic" version of the MOS problem, such as Di®erential Dynamic Programming [17], the general approach followed to implement DP in a practical situation requires to choose for each stage a number of sampling points in the d-dimensional state space, then approximating the cost-to-go functions outside these points. In this way, solving the original MOS problem implies the reconstruction of several functional dependencies, one for each stage. The most common sampling technique used in the literature is the \full uniform" grid, i.e., the uniform discretization of each component of the state space in a ¯xed number of values. This clearly leads to an exponential growth of the number of points commonly known as curse of dimensionality: if each of the d components of the state space is discretized by means of m equally spaced values, the number of points of the grid is equal to md. A non exponential complexity can be obtained by adopting ¯ner sampling scheme, like those proposed by SL and DL. In the former case, a uniform probability density is employed to generate the training set xl, l = 0; : : : ;L ¡ 1; in the latter approach points xl are selected by using a deterministic algorithm, which is able to guarantee an almost linear rate of convergence. Numerical simulations con¯rm the superiority of DL over SL when dealing with complex MOS problems, solved through the DP procedure.