A Comparison between Fixed-Basis and Variable-Basis Schemes for Function Approximation and Functional Optimization
Abstract
Fixed-basis and variable-basis approximation schemes are compared for the problems of function approximation and functional optimization (also known as infinite programming). Classes of problems are investigated for which variable-basis schemes with sigmoidal computational units perform better than fixed-basis ones, in terms of the minimum number of computational units needed to achieve a desired error in function approximation or approximate optimization. Previously known bounds on the accuracy are extended, with better rates, to families of d-variable functions whose actual dependence is on a subset of d′ ≪ d variables, where the indices of these d′ variables are not known a priori.
1. Introduction
In functional optimization problems, also known as infinite programming problems, functionals have to be minimized with respect to functions belonging to subsets of function spaces. Function-approximation problems, the classical problems of the calculus of variations [1] and, more generally, all optimization tasks in which one has to find a function that is optimal in a sense specified by a cost functional belong to this family of problems. Such functions may express, for example, the routing strategies in communication networks, the decision functions in optimal control problems and economic ones, and the input/output mappings of devices that learn from examples.
Experience has shown that optimization of functionals over admissible sets of functions made up of linear combinations of relatively few basis functions with a simple structure and depending nonlinearly on a set of “inner” parameters (e.g., feedforward neural networks with one hidden layer and linear output activation units) often provides surprisingly good suboptimal solutions. In such approximation schemes, each function depends on both external parameters (the coefficients of the linear combination) and inner parameters (the ones inside the basis functions). These are examples of variable-basis approximators since the basis functions are not fixed but their choice depends on the one of the inner parameters. In contrast, classical approximation schemes (such as the Ritz method in the calculus of variations [1]) do not use inner parameters but employ fixed basis functions, and the corresponding approximators exhibit only a linear dependence on the external parameters. Then, they are called fixed-basis or linear approximators. In [2], certain variable-basis approximators were applied to obtain approximate solutions to functional optimization problems. This technique was later formalized as the extended Ritz method (ERIM) [3] and was motivated by the innovative and successful application of feedforward neural networks in the late 80 s. For experimental results and theoretical investigations about the ERIM, see [2–7] and the references therein.
The basic motivation to search for suboptimal solutions of these forms is quite intuitive: when the number of basis functions becomes sufficiently large, the convergence of the sequence of suboptimal solutions to an optimal one may be ensured by suitable properties of the set of basis functions, the admissible set of functions, and the functional to be optimized [1, 5, 8]. Computational feasibility requirements (i.e., memory occupancy and time needed to find sufficiently good values for the parameters) make it crucial to estimate the minimum number of computational units needed by an approximator to guarantee that suboptimal solutions are “sufficiently close” to an optimal one. Such a number plays the role of “model complexity” of the approximator and can be studied with tools from linear and nonlinear approximation theory [9, 10].
As compared to fixed-basis approximators, in variable-basis ones the nonlinearity of the parametrization of the variable basis functions may cause the loss of useful properties of best approximation operators [11], such as uniqueness, homogeneity, and continuity, but may allow improved rates of approximation or approximate optimization [9, 12–14]. Then, to justify the use of variable-basis schemes instead of fixed-basis ones, it is crucial to investigate families of function-approximation and functional optimization problems for which, for a given desired accuracy, variable-basis schemes require a smaller number of computational units than fixed-basis ones. This is the aim of this work.
In the paper, the approximate solution of certain function-approximation and functional optimization problems via fixed- and variable-basis schemes is investigated. In particular, families of problems are presented, for which variable-basis schemes of a certain kind perform better than any fixed-basis one, in terms of the minimum number of computational units needed to achieve a desired worst-case error. Propositions 2.4, 2.7, 2.8, and 3.2 are the main contributions, which are presented after the exposition of results available in the literature.
The paper is organized as follows. Section 2 compares variable- and fixed-basis approximation schemes for function-approximation problems, which are particular instances of functional optimization. Section 3 extends the estimates to some more general families of functional optimization problems through the concepts of modulus of continuity and modulus of convexity of a functional. Section 4 is a short discussion.
2. Comparison of Bounds for Fixed- and Variable-Basis Approximation
Here and in the following, the “big O,” “big Ω,” and “big Θ” notations [18] are used. For two functions f, g : (0, +∞) → ℝ, one writes f = O(g) if and only if there exist M > 0 and x0 > 0 such that |f(x)| ≤ M | g(x)| for all x > x0, f = Ω(g) if and only if g = O(f), and f = Θ(g) if and only if both f = O(g) and f = Ω(g) hold. In order to be able to use such notations also for multivariable functions, in the following it is assumed that all their arguments are fixed with the exception of one of them (more precisely, the argument ɛ).
Two approaches have been adopted in the literature to compare the approximation capabilities of fixed- and variable-basis approximation schemes (see also [15] for a discussion on this topic). In the first one, one fixes the family of functions to be approximated (e.g., the unit ball in a Sobolev space [16]), then one finds bounds on the worst-case approximation error for functions belonging to such a family, for various approximation schemes (fixed- and variable-basis ones). The second approach, initiated by Barron [12, 17], fixes a variable-basis approximation scheme (e.g., the set of one-hidden-layer perceptrons with a given upper bound on the number of sigmoidal computational units) and searches for families of functions that are well approximated by such an approximation scheme. Then, for these families of functions, the approximation capability of the variable-basis approximation scheme is compared with the ones of fixed-basis approximation schemes. In this context, one is interested in finding cases for which, the number of computational units being the same, one has upper bounds on the worst-case approximation error for certain variable-basis approximation schemes that are smaller than corresponding lower bounds for any fixed-basis one, implying that such variable-basis schemes have better approximation capabilities than every fixed-basis one.
One problem of the first approach is that, for certain families of smooth functions to be approximated, the bounds on the worst-case approximation error obtained for fixed- and variable-basis approximation schemes are very similar. In particular, typically one obtains the so-called Jackson rate of approximation [4] n = Θ(ɛ−d/m), where n is the number of computational units, ɛ > 0 is the worst-case approximation error, m is a measure of smoothness, and d is the number of variables on which such functions depend. Following the second approach, it was shown in [12, 17] that, for certain function-approximation problems, variable-basis schemes exhibit some advantages over fixed-basis ones (see Sections 2.1 and 2.2, where extensions of some results from [12, 17] are also derived).
In Section 2.1, some bounds in the ℒ2-norm are considered, whereas Section 2.2 investigates bounds in the supnorm. Estimates in the ℒ2-norm can be applied, for example, to investigate the approximation of the optimal policies in static team optimization problems [19]. Estimates in the supnorm are required, for example, to investigate the approximation of the optimal policies in dynamic optimization problems with a finite number of stages [20]. Indeed, for such problems, the supnorm can be used to analyze the error propagation from one stage to the next one, while this is not the case for the ℒ2-norm [20]. Moreover, it provides guarantees on the approximation errors in the design of the optimal decision laws.
2.1. Bounds in the ℒ2-Norm
For a probability measure μ on B, we denote by ℒ2(B, μ) the Hilbert space of functions g : B → ℝ with inner product and induced norm . When there is no risk of confusion, the simpler notation is used instead of .
Theorem 2.1 (see [12], Theorem 1.)For every f ∈ ΓB,C, every sigmoidal function σ : ℝ → ℝ, every probability measure μ on B, and every n ≥ 1, there exist ak ∈ ℝd, bk, ck ∈ ℝ, and fn : B → ℝ of the form
In contrast to this, Theorem 2.2 from [12] shows that, when B is the unit hypercube [0,1] d and μ = μu is the uniform probability measure on [0,1] d, for the same set of functions ΓB,C the best linear approximation scheme requires Ω(ɛ−d) computational units in order to achieve the same worst-case approximation error ɛ. The set of all linear combinations of n fixed basis functions h1, h2, …, hn in a linear space is denoted by span (h1, h2, …, hn).
Theorem 2.2 (see [12], Theorem 6.)For every n ≥ 1 and every choice of fixed basis functions h1, h2, …, hn ∈ ℒ2([0,1] d, μu), one has
Remark 2.3. Inspection of the proof of [12, Theorem 6] shows that the factors 1/8 and 1/n, which appear in the original statement of the theorem, have to be replaced by 1/16 and 1/2n in (2.7), respectively.
Inspection of the proof of Theorem 2.2 in [12] shows also that the lower bound (2.7) still holds if the set is replaced by either
It should be noted that, for fixed C and ɛ, the estimate (2.6) is constant with respect to d, whereas the one (2.10) goes to 0 as d goes to +∞. So, a too small value of in the bound (2.10) for fixed-basis approximation may make the theoretical advantage of variable-basis approximation of impractical use, since for large d it would be guaranteed only for sufficiently small ɛ (depending on C, too). In the following, families of d-variable functions are considered, for which this drawback is mitigated. These are families of d-variable functions whose actual dependence is on a subset of d′ ≪ d variables, where the indices of these d′ variables are not known a priori. These families are of interest, for example, in machine learning applications, for problems with redundant or correlated features. In this context, each of the d real variables represents a feature (e.g., a measure of some physical property of an object), and one is interested in learning a function of these features on the basis of a set of supervised examples. As it often happens in applications, only a small subset of the features is useful for the specific task (typically, classification or regression), due to the presence of redundant or correlated features. Then, one may assume that the function to be learned depends only on subset of d′ ≪ d features but one may not know a priori which particular subset is. The problem of finding such a subset (or finding a subset of features of sufficiently small cardinality d′ on which the function mostly depends, when the function depends on all the d features) is called the feature-selection problem [22].
For d′ a positive integer and d its multiple, denotes the subset of functions in that depend only on d′ of their possible d arguments.
Proposition 2.4. For every n ≥ 1 and every choice of fixed basis functions h1, h2, …, hn ∈ ℒ2([0,1] d, μu), for n ≤ (d + 1)/2 one has
Proof. The proof is similar to the one of [12, Theorem 6]. The following is a list of the changes to that proof, needed to derive (2.11) and (2.12). We denote by ∥l∥0 the number of nonzero components of the multi-index l. Proceeding likewise in the proof of [12, Theorem 6], we get
Then we get
In the following, we apply (2.14) and (2.15) for m = 1 and m > 1, respectively. For m = 1, the condition becomes
Now, likewise in the proof of [12, Theorem 6], for m > 1 we exploit a bound from Stirling’s formula, according to which , so the condition holds if we impose
Remark 2.5. The quantity d′ in Proposition 2.4 has to be interpreted as an effective number of variables for the family of functions to be approximated. Roughly speaking, the flexibility of the neural network architecture (2.4) allows one to identify, for each , the d′ variables on which it actually depends, whereas fixed-basis approximation schemes have not this flexibility. Indeed, differently from the lower bound (2.10), for fixed C, ɛ, and d′ the lower bound (2.20) goes to +∞ as d goes to +∞. Finally, similar remarks as in Remark 2.3 apply to Proposition 2.4.
2.2. Bounds in the Supnorm
The next result is from [17] and is analogous to Theorem 2.1, but it measures the worst-case approximation error in the supnorm.
Theorem 2.6 (see [17], Theorem 2.)For every f ∈ ΓB,C and every n ≥ 1, there exists fn : B → ℝ of the form (2.4) such that
Upper bounds in the supnorm similar to the one from Theorem 2.6 are given, for example, in [24, 25]. Moreover, for , the following estimate holds.
Proposition 2.7. For every and every n ≥ 1, there exists fn : [0,1] d → ℝ of the form (2.4) such that
Proof. Each function depends on d′ arguments; let be their indices. Let be defined by , where , and all the other components of x are arbitrary in . Then , so by Theorem 2.6 there exists an approximation made up of n sigmoidal computational units and a constant term such that . Finally, we observe that can be extended to a function fn : [0,1] d → ℝ of the form (2.4) such that , then one obtains (2.22).
The next proposition, combined with Theorem 2.6 and Proposition 2.7, allows one to compare the approximation capabilities of fixed- and variable-basis schemes in the supnorm, showing cases for which the upper bounds (2.21) and (2.22) are smaller than one of the corresponding lower bounds (2.24)–(2.26), at least for n sufficiently large.
Proposition 2.8. For every n ≥ 1 and every choice of fixed bounded and μu-measurable basis functions h1, h2, …, hn : [0,1] d → ℝ, the following hold.
- (i)
For the approximation of functions in , one has
() - (ii)
For the approximation of functions in , for n ≤ (d + 1)/2, one has
()whereas for n > (d + 1)/2
Proof. For each bounded and μu-measurable function g : [0,1] d → ℝ, we get
Similar remarks as in Remark 2.3 can be made about the bounds in the supnorm derived in this section.
3. Application to Functional Optimization Problems
The results of Section 2 can be extended, with the same rates of approximation or similar ones, to the approximate solution of certain functional optimization problems. This can be done by exploiting the concepts of modulus of continuity and modulus of convexity of a functional, provided that continuity and uniform convexity assumptions are satisfied. The basic ideas are the following (see also [5] for a similar analysis).
3.1. Rates of Approximate Optimization in Terms of the Modulus of Continuity
3.2. Rates of Approximate Optimization in Terms of the Modulus of Convexity
Remark 3.1. In [29], a greedy algorithm is proposed to construct a sequence of sets Xn corresponding to variable-basis schemes and functions that achieve the rate (3.15) for certain uniformly convex functional optimization problems. Such an algorithm can be interpreted as an extension to functional optimization of the greedy algorithm proposed in [12] for function approximation by sigmoidal neural networks.
Finally, it should be noted that the rate (3.15) is achieved in general by imposing some structure on the sets X and Xn. For instance, the set X in [29] is the convex hull of some set of functions G ⊂ 𝒳, that is,
3.3. Comparison between Fixed- and Variable-Basis Schemes for Functional Optimization
The proposition follows by combining the results derived in Sections 2.1, 3.1, and 3.2.
Proposition 3.2. Let the functional Φ be Lipschitz continuous with Lipschitz constant KΦ and uniformly convex with modulus of convexity of the form (3.7), X = ΓB,C, μ any probability measure on B, 𝒳 = ℒ2(B, μ), and suppose that there exists a minimizer . Then the following hold.
- (i)
For every n ≥ 1 there exists fn of the form (2.4) such that
()For each such fn one has()and if of the form (2.4) is such that()then() - (ii)
For B = [0,1] d, μu equal to the uniform probability measure on [0,1] d, every n ≥ 1, and every choice of fixed-basis functions h1, …, hn ∈ ℒ2([0,1] d, μu), there exists a uniformly convex functional (such a functional can be also chosen to be Lipschitz continuous with Lipschitz constant , but this is not needed in the inequalities (3.24)–(3.29), since they do not contain ) with modulus of convexity of the form (3.7) and minimizer such that for every 0 < χ < 1 one has
()() - (iii)
The statements (i) and (ii) still hold by replacing the set ΓB,C by , for d a multiple of d′. The only difference is that the estimates (3.24) and (3.25) are replaced, respectively, by
()()for n ≤ (d + 1)/2 and by()()for n > (d + 1)/2.
Proof. (i) The estimate (3.20) follows by Theorem 2.1. The bound (3.21) follows by (3.20), the definition of modulus of continuity, and the assumption of Lipschitz continuity of Φ. Finally, (3.23) is obtained by property (3.11) of the modulus of convexity and its expression (3.7).
(ii) (3.24) comes from Theorem 2.2: the constant χ is introduced in order to remove the supremum with respect to in formula (2.7) and replace it with the choice , where is any function that achieves the bound (2.7) up to the constant factor χ; (3.25) follows from (3.24), (3.11), and (3.7), choosing as any functional that is uniformly convex with modulus of convexity of the form (3.7), and such that .
(iii) The estimates (3.20), (3.21), (3.23) still hold when the set ΓB,C is replaced by since for B = [0,1] d, whereas formulas (3.26)–(3.29) are obtained likewise formulas (3.24) and (3.25), by applying Proposition 2.4 instead of Theorem 2.2.
4. Discussion
Classes of function-approximation and functional optimization problems have been investigated for which, for a given desired error, certain variable-basis approximation schemes with sigmoidal computational units require less parameters than fixed-basis ones. Previously known bounds on the accuracy have been extended, with better rates, to families of functions whose effective number of variables d′ is much smaller than the number of their arguments d.
Proposition 3.2 shows that there is a strict connection between certain problems of function approximation and functional optimization. For such two classes of problems, indeed, the approximation error rates for the first class can be converted into rates of approximate optimization for the second one and vice versa. In particular, for d > 2, , and any linear approximation scheme span {h1, h2, …, hn}, the estimates (3.21) and (3.25) show families of functional optimization problems for which the error in approximate optimization with variable-basis schemes of sigmoidal type is smaller than the one associated with the linear scheme. For d′ > 2 and , a similar remark can be made for the estimates (3.21) and (3.27) and for the bounds (3.21) and (3.29). Finally, the bound (3.23) shows that for large n any approximate minimizer of the form (2.4) differs slightly from the true minimizer f∘, even though the error in approximate optimization (3.22) and the associated approximation error (3.23) have different rates. In contrast, the estimates (3.24), (3.26), and (3.28) show that, for any linear approximation scheme span {h1, h2, …, hn}, there exists a functional optimization problem whose minimizer cannot be approximated with the same accuracy by the linear scheme.
The results presented in the paper provide some theoretical justification for the use of variable-basis approximation schemes (instead of fixed-basis ones) in function approximation and functional optimization.
Acknowledgment
The author was partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Adaptive State Estimation and Optimal Control.”