Fixed-basis and variable-basis approximation schemes are compared for the problems of function approximation and functional optimization (also known as infinite programming). Classes of problems are investigated for which variable-basis schemes with sigmoidal computational units perform better than fixed-basis ones, in terms of the minimum number of computational units needed to achieve a desired error in function approximation or approximate optimization. Previously known bounds on the accuracy are extended, with better rates, to families of d-variable functions whose actual dependence is on a subset of d^′ ≪ d variables, where the indices of these d^′ variables are not known a priori.

1. Introduction

In functional optimization problems, also known as infinite programming problems, functionals have to be minimized with respect to functions belonging to subsets of function spaces. Function-approximation problems, the classical problems of the calculus of variations [1] and, more generally, all optimization tasks in which one has to find a function that is optimal in a sense specified by a cost functional belong to this family of problems. Such functions may express, for example, the routing strategies in communication networks, the decision functions in optimal control problems and economic ones, and the input/output mappings of devices that learn from examples.

Experience has shown that optimization of functionals over admissible sets of functions made up of linear combinations of relatively few basis functions with a simple structure and depending nonlinearly on a set of “inner” parameters (e.g., feedforward neural networks with one hidden layer and linear output activation units) often provides surprisingly good suboptimal solutions. In such approximation schemes, each function depends on both external parameters (the coefficients of the linear combination) and inner parameters (the ones inside the basis functions). These are examples of variable-basis approximators since the basis functions are not fixed but their choice depends on the one of the inner parameters. In contrast, classical approximation schemes (such as the Ritz method in the calculus of variations [1]) do not use inner parameters but employ fixed basis functions, and the corresponding approximators exhibit only a linear dependence on the external parameters. Then, they are called fixed-basis or linear approximators. In [2], certain variable-basis approximators were applied to obtain approximate solutions to functional optimization problems. This technique was later formalized as the extended Ritz method (ERIM) [3] and was motivated by the innovative and successful application of feedforward neural networks in the late 80 s. For experimental results and theoretical investigations about the ERIM, see [2–7] and the references therein.

The basic motivation to search for suboptimal solutions of these forms is quite intuitive: when the number of basis functions becomes sufficiently large, the convergence of the sequence of suboptimal solutions to an optimal one may be ensured by suitable properties of the set of basis functions, the admissible set of functions, and the functional to be optimized [1, 5, 8]. Computational feasibility requirements (i.e., memory occupancy and time needed to find sufficiently good values for the parameters) make it crucial to estimate the minimum number of computational units needed by an approximator to guarantee that suboptimal solutions are “sufficiently close” to an optimal one. Such a number plays the role of “model complexity” of the approximator and can be studied with tools from linear and nonlinear approximation theory [9, 10].

As compared to fixed-basis approximators, in variable-basis ones the nonlinearity of the parametrization of the variable basis functions may cause the loss of useful properties of best approximation operators [11], such as uniqueness, homogeneity, and continuity, but may allow improved rates of approximation or approximate optimization [9, 12–14]. Then, to justify the use of variable-basis schemes instead of fixed-basis ones, it is crucial to investigate families of function-approximation and functional optimization problems for which, for a given desired accuracy, variable-basis schemes require a smaller number of computational units than fixed-basis ones. This is the aim of this work.

In the paper, the approximate solution of certain function-approximation and functional optimization problems via fixed- and variable-basis schemes is investigated. In particular, families of problems are presented, for which variable-basis schemes of a certain kind perform better than any fixed-basis one, in terms of the minimum number of computational units needed to achieve a desired worst-case error. Propositions 2.4, 2.7, 2.8, and 3.2 are the main contributions, which are presented after the exposition of results available in the literature.

The paper is organized as follows. Section 2 compares variable- and fixed-basis approximation schemes for function-approximation problems, which are particular instances of functional optimization. Section 3 extends the estimates to some more general families of functional optimization problems through the concepts of modulus of continuity and modulus of convexity of a functional. Section 4 is a short discussion.

2. Comparison of Bounds for Fixed- and Variable-Basis Approximation

Here and in the following, the “big O,” “big Ω,” and “big Θ” notations [18] are used. For two functions f, g : (0, +∞) → ℝ, one writes f = O(g) if and only if there exist M > 0 and x₀ > 0 such that |f(x)| ≤ M | g(x)| for all x > x₀, f = Ω(g) if and only if g = O(f), and f = Θ(g) if and only if both f = O(g) and f = Ω(g) hold. In order to be able to use such notations also for multivariable functions, in the following it is assumed that all their arguments are fixed with the exception of one of them (more precisely, the argument ɛ).

Two approaches have been adopted in the literature to compare the approximation capabilities of fixed- and variable-basis approximation schemes (see also [15] for a discussion on this topic). In the first one, one fixes the family of functions to be approximated (e.g., the unit ball in a Sobolev space [16]), then one finds bounds on the worst-case approximation error for functions belonging to such a family, for various approximation schemes (fixed- and variable-basis ones). The second approach, initiated by Barron [12, 17], fixes a variable-basis approximation scheme (e.g., the set of one-hidden-layer perceptrons with a given upper bound on the number of sigmoidal computational units) and searches for families of functions that are well approximated by such an approximation scheme. Then, for these families of functions, the approximation capability of the variable-basis approximation scheme is compared with the ones of fixed-basis approximation schemes. In this context, one is interested in finding cases for which, the number of computational units being the same, one has upper bounds on the worst-case approximation error for certain variable-basis approximation schemes that are smaller than corresponding lower bounds for any fixed-basis one, implying that such variable-basis schemes have better approximation capabilities than every fixed-basis one.

One problem of the first approach is that, for certain families of smooth functions to be approximated, the bounds on the worst-case approximation error obtained for fixed- and variable-basis approximation schemes are very similar. In particular, typically one obtains the so-called Jackson rate of approximation [4] n = Θ(ɛ^−d/m), where n is the number of computational units, ɛ > 0 is the worst-case approximation error, m is a measure of smoothness, and d is the number of variables on which such functions depend. Following the second approach, it was shown in [12, 17] that, for certain function-approximation problems, variable-basis schemes exhibit some advantages over fixed-basis ones (see Sections 2.1 and 2.2, where extensions of some results from [12, 17] are also derived).

In Section 2.1, some bounds in the ℒ₂-norm are considered, whereas Section 2.2 investigates bounds in the supnorm. Estimates in the ℒ₂-norm can be applied, for example, to investigate the approximation of the optimal policies in static team optimization problems [19]. Estimates in the supnorm are required, for example, to investigate the approximation of the optimal policies in dynamic optimization problems with a finite number of stages [20]. Indeed, for such problems, the supnorm can be used to analyze the error propagation from one stage to the next one, while this is not the case for the ℒ₂-norm [20]. Moreover, it provides guarantees on the approximation errors in the design of the optimal decision laws.

2.1. Bounds in the ℒ₂-Norm

The following Theorem 2.1 from [12] describes a quite general set of functions of d real variables (described in terms of their Fourier distributions) whose approximation from variable-basis approximation schemes with sigmoidal computational units requires O(ɛ⁻²) computational units, where ɛ > 0 is the desired worst-case approximation error measured in the ℒ₂-norm. Recall that a sigmoidal function is defined in general as a bounded measurable function σ : ℝ → ℝ such that σ(y) → 1 as y → +∞ and σ(y) → 0 as y → −∞ [21]. For C > 0, d a positive integer, and B a bounded subset of ℝ^d containing 0, by Γ_B,C we denote the set of functions f : ℝ^d → ℝ having a Fourier representation of the form

()

for some complex-valued measure

(where F(dω) and θ(ω) are the magnitude distribution and the phase at the pulsation ω, resp.) such that

()

where 〈·, ·〉 is the standard inner product on ℝ^d. Functions in Γ_B,C are continuously differentiable on B [12]. When B is the hypercube [−1,1] ^d, the inequality (2.2) reduces to

()

where ∥·∥₁ denotes the l₁-norm.

For a probability measure μ on B, we denote by ℒ₂(B, μ) the Hilbert space of functions g : B → ℝ with inner product and induced norm . When there is no risk of confusion, the simpler notation is used instead of .

Theorem 2.1 (see [12], Theorem 1.)For every f ∈ Γ_B,C, every sigmoidal function σ : ℝ → ℝ, every probability measure μ on B, and every n ≥ 1, there exist a_k ∈ ℝ^d, b_k, c_k ∈ ℝ, and f_n : B → ℝ of the form

()

such that

()

Variable-basis approximators of the form (2.4) are called one-hidden-layer perceptrons with n computational units. Formula (2.5) shows that at most

()

computational units are required to guarantee a desired worst-case approximation error ɛ in the ℒ₂-norm, when variable-basis approximation schemes of the form (2.4) are used to approximate functions belonging to the set Γ_B,C.

In contrast to this, Theorem 2.2 from [12] shows that, when B is the unit hypercube [0,1] ^d and μ = μ_u is the uniform probability measure on [0,1] ^d, for the same set of functions Γ_B,C the best linear approximation scheme requires Ω(ɛ^−d) computational units in order to achieve the same worst-case approximation error ɛ. The set of all linear combinations of n fixed basis functions h₁, h₂, …, h_n in a linear space is denoted by span (h₁, h₂, …, h_n).

Theorem 2.2 (see [12], Theorem 6.)For every n ≥ 1 and every choice of fixed basis functions h₁, h₂, …, h_n ∈ ℒ₂([0,1] ^d, μ_u), one has

()

Remark 2.3. Inspection of the proof of [12, Theorem 6] shows that the factors 1/8 and 1/n, which appear in the original statement of the theorem, have to be replaced by 1/16 and 1/2n in (2.7), respectively.

Inspection of the proof of Theorem 2.2 in [12] shows also that the lower bound (2.7) still holds if the set is replaced by either

()

where l denotes any multi-index and ∥l∥₁ its norm (i.e., the sum of the components of l, which are nonnegative). Obviously, when B is the unit hypercube [0,1] ^d, the upper bound (2.5) still holds under one of these two replacements, since

The inequality (2.7) implies that for a uniform probability measure on [0,1] ^d, at least

()

computational units are required to guarantee a desired worst-case approximation error ɛ in the ℒ₂-norm, when fixed-basis approximation schemes of the form span (h₁, h₂, …, h_n) are used to approximate functions in

. Then, at least for a sufficiently small value of ɛ, Theorems 2.1 and 2.2 show that for d > 2, variable-basis approximators of the form (2.4) provide a smaller approximation error than any fixed-basis one for functions in

, the number of computational units being the same.

It should be noted that, for fixed C and ɛ, the estimate (2.6) is constant with respect to d, whereas the one (2.10) goes to 0 as d goes to +∞. So, a too small value of in the bound (2.10) for fixed-basis approximation may make the theoretical advantage of variable-basis approximation of impractical use, since for large d it would be guaranteed only for sufficiently small ɛ (depending on C, too). In the following, families of d-variable functions are considered, for which this drawback is mitigated. These are families of d-variable functions whose actual dependence is on a subset of d^′ ≪ d variables, where the indices of these d^′ variables are not known a priori. These families are of interest, for example, in machine learning applications, for problems with redundant or correlated features. In this context, each of the d real variables represents a feature (e.g., a measure of some physical property of an object), and one is interested in learning a function of these features on the basis of a set of supervised examples. As it often happens in applications, only a small subset of the features is useful for the specific task (typically, classification or regression), due to the presence of redundant or correlated features. Then, one may assume that the function to be learned depends only on subset of d^′ ≪ d features but one may not know a priori which particular subset is. The problem of finding such a subset (or finding a subset of features of sufficiently small cardinality d^′ on which the function mostly depends, when the function depends on all the d features) is called the feature-selection problem [22].

For d^′ a positive integer and d its multiple, denotes the subset of functions in that depend only on d^′ of their possible d arguments.

Proposition 2.4. For every n ≥ 1 and every choice of fixed basis functions h₁, h₂, …, h_n ∈ ℒ₂([0,1] ^d, μ_u), for n ≤ (d + 1)/2 one has

()

and for n > (d + 1)/2

()

Proof. The proof is similar to the one of [12, Theorem 6]. The following is a list of the changes to that proof, needed to derive (2.11) and (2.12). We denote by ∥l∥₀ the number of nonzero components of the multi-index l. Proceeding likewise in the proof of [12, Theorem 6], we get

()

where m^* is the smallest positive integer m such that the number

of multi-indices l ∈ {0,1, …} ^d with norm ∥l∥₁ ≤ m and that satisfy the constraint ∥l∥₀ ≤ d^′ is larger than or equal to 2n. More precisely, (2.13) is obtained by observing that for such an integer m the set

contains at least 2n orthogonal cosinusoidal functions with ℒ₂([0,1] ^d, μ_u)-norm equal to C/4πm and applying [12, Lemma 6], which states that for any orthonormal basis of a 2n-dimensional space, there does not exist a linear subspace of dimension n having distance smaller than 1/2 from every basis function in such an orthonormal basis. The constraint ∥l∥₀ ≤ d^′ is not present in the proof of [12, Theorem 6] and is due to the specific form of the set

. Because of such a constraint, the functions in 𝒮₂ with ∥l∥₀ > d^′ do not belong to

Then we get

()

Indeed, for d = d^′ the equality (2.14) follows recalling that the number of different ways of placing N_o identical objects in N_b distinct boxes is

[23, Theorem 5.1], and for this case it is the same estimate as the one obtained in the proof of [12, Theorem 6]. Similarly, for 1 ≤ m ≤ d′ the constraint ∥l∥₀ ≤ d^′ is redundant and we get again (2.14). Finally, for d/d′ a positive integer larger than 1 and m > 1, the upper bound in (2.15) is obtained ignoring the constraint ∥l∥₀ ≤ d^′, whereas the lower bound is obtained as follows. First, we partition the set of d variables into d/d^′ subsets of cardinality d^′, and then we apply to each subset the estimate

obtained by replacing d by d^′ in (2.14). In this way, the multi-index l = 0 is counted d/d′ times (one for each subset), but the final estimate

so obtained holds since for m > 1 there are at least other d/d′ − 1 multi-indices that have been not counted in this process.

In the following, we apply (2.14) and (2.15) for m = 1 and m > 1, respectively. For m = 1, the condition becomes

()

so m^* = 1 for n ≤ (d + 1)/2. This, combined with (2.13), proves (2.11).

Now, likewise in the proof of [12, Theorem 6], for m > 1 we exploit a bound from Stirling’s formula, according to which , so the condition holds if we impose

()

which is equivalent to

()

(note that, for n > (d + 1)/2, the value of m provided by (2.18) is indeed larger than 1, as required for the application of (2.15)). Since

()

we conclude that

for n > (d + 1)/2. This, together with (2.13), proves the statement (2.12).

For the case considered by Proposition 2.4, an uniform probability measure on [0,1] ^d, and 0 < ɛ < C/8π, formulas (2.11) and (2.12) show that at least

()

Remark 2.5. The quantity d^′ in Proposition 2.4 has to be interpreted as an effective number of variables for the family of functions to be approximated. Roughly speaking, the flexibility of the neural network architecture (2.4) allows one to identify, for each , the d^′ variables on which it actually depends, whereas fixed-basis approximation schemes have not this flexibility. Indeed, differently from the lower bound (2.10), for fixed C, ɛ, and d^′ the lower bound (2.20) goes to +∞ as d goes to +∞. Finally, similar remarks as in Remark 2.3 apply to Proposition 2.4.

2.2. Bounds in the Supnorm

The next result is from [17] and is analogous to Theorem 2.1, but it measures the worst-case approximation error in the supnorm.

Theorem 2.6 (see [17], Theorem 2.)For every f ∈ Γ_B,C and every n ≥ 1, there exists f_n : B → ℝ of the form (2.4) such that

()

Upper bounds in the supnorm similar to the one from Theorem 2.6 are given, for example, in [24, 25]. Moreover, for , the following estimate holds.

Proposition 2.7. For every and every n ≥ 1, there exists f_n : [0,1] ^d → ℝ of the form (2.4) such that

()

Proof. Each function depends on d^′ arguments; let be their indices. Let be defined by , where , and all the other components of x are arbitrary in . Then , so by Theorem 2.6 there exists an approximation made up of n sigmoidal computational units and a constant term such that . Finally, we observe that can be extended to a function f_n : [0,1] ^d → ℝ of the form (2.4) such that , then one obtains (2.22).

The estimates (2.21) and (2.22) show that at most

()

computational units, respectively, are required to guarantee a desired worst-case approximation error ɛ in the supnorm, when variable-basis approximation schemes of the form (2.4) are used to approximate functions belonging to the sets Γ_B,C and

, respectively.

The next proposition, combined with Theorem 2.6 and Proposition 2.7, allows one to compare the approximation capabilities of fixed- and variable-basis schemes in the supnorm, showing cases for which the upper bounds (2.21) and (2.22) are smaller than one of the corresponding lower bounds (2.24)–(2.26), at least for n sufficiently large.

Proposition 2.8. For every n ≥ 1 and every choice of fixed bounded and μ_u-measurable basis functions h₁, h₂, …, h_n : [0,1] ^d → ℝ, the following hold.

(i)
For the approximation of functions in , one has
()
(ii)
For the approximation of functions in , for n ≤ (d + 1)/2, one has
()
whereas for n > (d + 1)/2

()

Proof. For each bounded and μ_u-measurable function g : [0,1] ^d → ℝ, we get

()

Then we get the lower bounds (2.24)–(2.26) by (2.7), (2.11), and (2.12), respectively.

For the case considered by Proposition 2.8, the estimate (2.24) implies that at least

computational units are required to guarantee a desired worst-case approximation error ɛ in the supnorm, when fixed-basis approximation schemes of the form span (h₁, h₂, …, h_n) are used to approximate functions in

. Similarly, for 0 < ɛ < C/8π, the bounds (2.25) and (2.26) imply that at least

computational units are required when

is replaced by

. One can observe that, for each d, d^′ and C, each of the lower bounds (2.25) and (2.26) is larger than (2.24). Moreover, all the other parameters being fixed, the lower bound (2.24) goes to 0 as d tends to +∞, whereas for d ≥ 2n − 1, the lower bound (2.25) holds, and it does not depend on the specific value of d. Finally, for d > 2, the upper bound (2.21) is smaller than the lower bound (2.24) for n sufficiently large, and similarly, for d^′ > 2, the upper bound (2.22) is smaller than the lower bounds (2.25) and (2.26) for n sufficiently large. For instance, in the latter case and for d′ sufficiently small with respect to d, this happens for ⌈225d^′2/π²⌉ ≤ n ≤ (d + 1)/2 and for

()

where

and K₂ = 2/(d^′ − 2).

Similar remarks as in Remark 2.3 can be made about the bounds in the supnorm derived in this section.

3. Application to Functional Optimization Problems

The results of Section 2 can be extended, with the same rates of approximation or similar ones, to the approximate solution of certain functional optimization problems. This can be done by exploiting the concepts of modulus of continuity and modulus of convexity of a functional, provided that continuity and uniform convexity assumptions are satisfied. The basic ideas are the following (see also [5] for a similar analysis).

3.1. Rates of Approximate Optimization in Terms of the Modulus of Continuity

Let 𝒳 be a normed linear space, X⊆𝒳, and Φ : X → ℝ a functional. Suppose that the functional optimization problem

()

has a solution f^∘, and let X₁⊆X₂⊆⋯⊆X_n⊆⋯⊆X be a nested sequence of subsets of X such that

()

for some ɛ_n > 0, where ɛ_n → 0 as n → +∞. Then, if the functional Φ is continuous, too, one has

()

where

defined by

is the modulus of continuity of Φ at f^∘. For instance, if Φ is Lipschitz continuous with Lipschitz constant K_Φ, one has

, and by(3.2)

()

Then, if an upper bound on ɛ_n in terms of n is known (e.g., ɛ_n = O(n^−1/2) under the assumptions of Theorem 2.1, where X = Γ_B,C ⊂ ℒ₂(B, μ) = 𝒳 and X_n is the set of functions of the form (2.4)), then the same upper bound (up to a multiplicative constant) holds on

. So, investigating the approximating capabilities of the sets X_n is useful for functional optimization purposes, too.

3.2. Rates of Approximate Optimization in Terms of the Modulus of Convexity

When dealing with suboptimal solutions from a set X_n⊆X, the following question arises: suppose that

is such that

()

for some γ_n > 0, where γ_n → 0 as n → +∞. This can be guaranteed, for example, if the functional is continuous, the sets X_n satisfy the property (3.2), and one chooses

assuming, almost without loss of generality, that such a set is nonempty. If this is not the case, then one can proceed as follows. For ϵ > 0, let

. Then one obtains estimates similar to the ones of this section (obtained assuming that

is nonempty) by choosing

, where η > 1 is a constant. Does the estimate (3.5) imply an upper bound on the approximation error

? A positive answer can be given when the functional Φ is uniformly convex. Recall that a functional Φ : X → ℝ is called convex on a convex set X⊆𝒳 if and only if for all h, g ∈ X and all λ ∈ [0,1], one has Φ(λh + (1 − λ)g) ≤ λΦ(h)+(1 − λ)Φ(g) and it is called uniformly convex if and only if there exists a nonnegative function δ : [0, +∞)→[0, +∞) such that δ(0) = 0, δ(t) > 0 for all t > 0, and for all h, g ∈ X and all λ ∈ [0,1], one has

()

Any such function δ is called a modulus of convexity of Φ [26]. The terminology is not unified: some authors use the term “strictly uniformly convex” instead of “uniformly convex” and reserve the term “uniformly convex” for the case where δ : [0, +∞)→[0, +∞) merely satisfies δ(0) = 0 and δ(t₀) > 0 for some t₀ > 0 (see, e.g., [27, 28]). Note that when 𝒳 is a Hilbert space and δ(t) has the quadratic expression

()

for some constant c > 0, the condition (3.6) is equivalent to the convexity of the functional

. Indeed, the latter property means that, for all h, g ∈ X and all λ ∈ [0,1], one has

()

and this is equivalent to

()

since one can show through straightforward computations that, for 𝒳 a Hilbert space, one has

()

One of the most useful properties of uniform convexity is that f^∘ ∈ argmin_f∈XΦ(f) implies the lower bound

()

for any f ∈ X (see, e.g., [5, Proposition 2.1(iii)]). When the modulus of convexity has the form (3.7), this implies (together with (3.5))

()

When (3.2) holds, too, and Φ has modulus of continuity

at f^∘, one can take

()

in (3.12), thus obtaining

()

Again, this allows one to extend rates of function approximation to functional optimization, supposing, as in Section 3.1, that Φ is also Lipschitz continuous with Lipschitz constant K_Φ and that ɛ_n = O(n^−1/2). Then, one obtains (from the choice (3.13) for γ_n and formula (3.14))

()

Remark 3.1. In [29], a greedy algorithm is proposed to construct a sequence of sets X_n corresponding to variable-basis schemes and functions that achieve the rate (3.15) for certain uniformly convex functional optimization problems. Such an algorithm can be interpreted as an extension to functional optimization of the greedy algorithm proposed in [12] for function approximation by sigmoidal neural networks.

Finally, it should be noted that the rate (3.15) is achieved in general by imposing some structure on the sets X and X_n. For instance, the set X in [29] is the convex hull of some set of functions G ⊂ 𝒳, that is,

()

whereas, for each n ∈ ℤ⁺, the set X_n in [29] is

()

Functional optimization problems have in general a natural domain X larger than co G (or its closure

in the norm of the ambient space 𝒳). Therefore, the choice of a set X of the form (3.17) as the domain of the functional Φ might seem unmotivated. This is not the case, because there are several examples of functional optimization problems for which, for suitable sets G and a natural domain X larger than co G (resp.,

), the set

()

has a nonempty intersection with co G (resp.,

), or it is contained in it. This issue is studied in [20] for dynamic optimization problems and in [19] for static team optimization ones, where structural properties (e.g., smoothness) of the minimizers are studied.

3.3. Comparison between Fixed- and Variable-Basis Schemes for Functional Optimization

The proposition follows by combining the results derived in Sections 2.1, 3.1, and 3.2.

Proposition 3.2. Let the functional Φ be Lipschitz continuous with Lipschitz constant K_Φ and uniformly convex with modulus of convexity of the form (3.7), X = Γ_B,C, μ any probability measure on B, 𝒳 = ℒ₂(B, μ), and suppose that there exists a minimizer . Then the following hold.

(i)
For every n ≥ 1 there exists f_n of the form (2.4) such that
()
For each such f_n one has
()
and if of the form (2.4) is such that
()
then
()
(ii)
For B = [0,1] ^d, μ_u equal to the uniform probability measure on [0,1] ^d, every n ≥ 1, and every choice of fixed-basis functions h₁, …, h_n ∈ ℒ₂([0,1] ^d, μ_u), there exists a uniformly convex functional (such a functional can be also chosen to be Lipschitz continuous with Lipschitz constant , but this is not needed in the inequalities (3.24)–(3.29), since they do not contain ) with modulus of convexity of the form (3.7) and minimizer such that for every 0 < χ < 1 one has
()

()
(iii)
The statements (i) and (ii) still hold by replacing the set Γ_B,C by , for d a multiple of d^′. The only difference is that the estimates (3.24) and (3.25) are replaced, respectively, by
()

()
for n ≤ (d + 1)/2 and by
()

()
for n > (d + 1)/2.

Proof. (i) The estimate (3.20) follows by Theorem 2.1. The bound (3.21) follows by (3.20), the definition of modulus of continuity, and the assumption of Lipschitz continuity of Φ. Finally, (3.23) is obtained by property (3.11) of the modulus of convexity and its expression (3.7).

(ii) (3.24) comes from Theorem 2.2: the constant χ is introduced in order to remove the supremum with respect to in formula (2.7) and replace it with the choice , where is any function that achieves the bound (2.7) up to the constant factor χ; (3.25) follows from (3.24), (3.11), and (3.7), choosing as any functional that is uniformly convex with modulus of convexity of the form (3.7), and such that .

(iii) The estimates (3.20), (3.21), (3.23) still hold when the set Γ_B,C is replaced by since for B = [0,1] ^d, whereas formulas (3.26)–(3.29) are obtained likewise formulas (3.24) and (3.25), by applying Proposition 2.4 instead of Theorem 2.2.

4. Discussion

Classes of function-approximation and functional optimization problems have been investigated for which, for a given desired error, certain variable-basis approximation schemes with sigmoidal computational units require less parameters than fixed-basis ones. Previously known bounds on the accuracy have been extended, with better rates, to families of functions whose effective number of variables d^′ is much smaller than the number of their arguments d.

Proposition 3.2 shows that there is a strict connection between certain problems of function approximation and functional optimization. For such two classes of problems, indeed, the approximation error rates for the first class can be converted into rates of approximate optimization for the second one and vice versa. In particular, for d > 2, , and any linear approximation scheme span {h₁, h₂, …, h_n}, the estimates (3.21) and (3.25) show families of functional optimization problems for which the error in approximate optimization with variable-basis schemes of sigmoidal type is smaller than the one associated with the linear scheme. For d^′ > 2 and , a similar remark can be made for the estimates (3.21) and (3.27) and for the bounds (3.21) and (3.29). Finally, the bound (3.23) shows that for large n any approximate minimizer of the form (2.4) differs slightly from the true minimizer f^∘, even though the error in approximate optimization (3.22) and the associated approximation error (3.23) have different rates. In contrast, the estimates (3.24), (3.26), and (3.28) show that, for any linear approximation scheme span {h₁, h₂, …, h_n}, there exists a functional optimization problem whose minimizer cannot be approximated with the same accuracy by the linear scheme.

The results presented in the paper provide some theoretical justification for the use of variable-basis approximation schemes (instead of fixed-basis ones) in function approximation and functional optimization.

Acknowledgment

The author was partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Adaptive State Estimation and Optimal Control.”

References

1 Gelfand I. M. and Fomin S. V., Calculus of Variations, 1963, Prentice-Hall, Englewood Cliffs, NJ, USA, 0160139.
Google Scholar
2 Zoppoli R. and Parisini T., A. Isidori and T. J. Tarn, Learning techniques and neural networks for the solution of N-stage nonlinear nonquadratic optimal control problems, Systems, Models and Feedback: Theory and Applications, 1992, Birkhäuser, Boston, Mass, USA, 193–210.
10.1007/978-1-4757-2204-8_15
Web of Science® Google Scholar
3 Zoppoli R., Sanguineti M., and Parisini T., Approximating networks and extended Ritz method for the solution of functional optimization problems, Journal of Optimization Theory and Applications. (2002) 112, no. 2, 403–440, https://doi.org/10.1023/A:1013662124879.
10.1023/A:1013662124879
Web of Science® Google Scholar
4 Giulini S. and Sanguineti M., Approximation schemes for functional optimization problems, Journal of Optimization Theory and Applications. (2009) 140, no. 1, 33–54, https://doi.org/10.1007/s10957%2D008%2D9471%2D6, 2475906, ZBL1176.90658.
10.1007/s10957-008-9471-6
Web of Science® Google Scholar
5 Kůrková V. and Sanguineti M., Error estimates for approximate optimization by the extended Ritz method, SIAM Journal on Optimization. (2005) 15, no. 2, 461–487, https://doi.org/10.1137/S1052623403426507, 2144176.
10.1137/S1052623403426507
Web of Science® Google Scholar
6 Zolezzi T., [email protected], Condition numbers and Ritz type methods in unconstrained optimization, Control and Cybernetics. (2007) 36, no. 3, 811–822.
Google Scholar
7 Zoppoli R., Parisini T., Sanguineti M., and Baglietto M., Neural Approximations for Optimal Control and Decision, Springer, London, UK.
Google Scholar
8 Daniel J. W., The Approximate Minimization of Functionals, 1971, Prentice-Hall Inc., Englewood Cliffs, NJ, USA, 0272398.
Google Scholar
9 Kůrková V. and Sanguineti M., Comparison of worst case errors in linear and neural network approximation, IEEE Transactions on Information Theory. (2002) 48, no. 1, 264–275, https://doi.org/10.1109/18.971754, 1872179.
10.1109/18.971754
Web of Science® Google Scholar
10 Pinkus A., n-Widths in Approximation Theory, 1985, 7, Springer, Berlin, Germany, 774404.
10.1007/978-3-642-69894-1
Google Scholar
11 Singer I., Best Approximation in Normed Linear Spaces by Elements of Linear Subspaces, 1970, Springer, Berlin, Germany, 0270044.
10.1007/978-3-662-41583-2
Google Scholar
12 Barron A. R., Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactions on Information Theory. (1993) 39, no. 3, 930–945, https://doi.org/10.1109/18.256500, 1237720, ZBL0818.68126.
10.1109/18.256500
Web of Science® Google Scholar
13 Gnecco G., [email protected], Kůrková V., and [email protected], Sanguineti M., [email protected], Can dictionary-based computational models outperform the best linear ones?, Neural Networks. (2011) 24, no. 8, 881–887, https://doi.org/10.1016/j.neunet.2011.05.014.
10.1016/j.neunet.2011.05.014
PubMed Web of Science® Google Scholar
14 Gnecco G., Kůrková V., and [email protected], Sanguineti M., [email protected], Some comparisons of complexity in dictionary-based and linear computational models, Neural Networks. (2011) 24, 172–182.
Web of Science® Google Scholar
15 Pinkus A., Approximation theory of the MLP model in neural networks, Acta Numerica, 1999, 1999, 8, Cambridge University Press, Cambridge, UK, 143–195, 1819645, https://doi.org/10.1017/S0962492900002919, ZBL0959.68109.
10.1017/S0962492900002919
Google Scholar
16 Adams R. A. and Fournier J. J. F., Sobolev Spaces, 2003, 140, 2nd edition, Elsevier/Academic Press, Amsterdam, The Netherlands, Pure and Applied Mathematics (Amsterdam), 2424078, ZBL1186.76501.
Web of Science® Google Scholar
17 Barron A. R., K. Narendra, Neural net approximation, Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems, 1992, Yale University Press, 69–72.
Google Scholar
18 Knuth D. E., Big omicron and big omega and big theta, SIGACT News. (1976) 8, 18–24.
10.1145/1008328.1008329
Google Scholar
19 Gnecco G. and Sanguineti M., Suboptimal solutions to network team optimization problems, Proceedings of the International Network Optimization Conference (INOC ′09), April 2009.
Google Scholar
20 Gnecco G. and Sanguineti M., Suboptimal solutions to dynamic optimization problems via approximations of the policy functions, Journal of Optimization Theory and Applications. (2010) 146, no. 3, 764–794, https://doi.org/10.1007/s10957%2D010%2D9680%2D7, 2720612.
10.1007/s10957-010-9680-7
Web of Science® Google Scholar
21 Cybenko G., Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems. (1989) 2, no. 4, 303–314, https://doi.org/10.1007/BF02551274, 1015670, ZBL0679.94019.
10.1007/BF02551274
Google Scholar
22 Guyon I. and Elisseeff A., An introduction to variable and feature selection, Journal of Machine Learning Research. (2003) 3, 1157–1182.
10.1162/153244303322753616
Google Scholar
23 Bóna M., A Walk through Combinatorics: An Introduction to Enumeration and Graph Theory, 2002, World Scientific, River Edge, NJ, USA, 1936456.
10.1142/4918
Google Scholar
24 Makovoz Y., Uniform approximation by neural networks, Journal of Approximation Theory. (1998) 95, no. 2, 215–228, 1652888, https://doi.org/10.1006/jath.1997.3217, ZBL0932.41016.
10.1006/jath.1997.3217
Web of Science® Google Scholar
25 Yukich Joseph E., Stinchcombe Maxwell B., and White, Sup-norm approximation bounds for networks through probabilistic methods, IEEE Transactions on Information Theory. (1995) 41, no. 4, 1021–1027, https://doi.org/10.1109/18.391247.
10.1109/18.391247
Web of Science® Google Scholar
26 Levitin E. S. and Polyak B. T., Convergence of minimizing sequences in conditional extremum problems, Doklady Akademii Nauk SSSR. (1966) 168, 764–767.
Google Scholar
27 Vladimirov A. A., Nesterov Y. E., and Chekanov Y. N., On uniformly convex functionals, Vestnik Moskovskogo Universiteta. Seriya 15—Vychislitel’naya Matematika i Kibernetika. (1978) 3, 12–23, English translation: Moscow University Computational Mathematics and Cybernetics, pp. 10–21, 1979.
Google Scholar
28 Dontchev A. L., Perturbations, Approximations and Sensitivity Analysis of Optimal Control Systems, 1983, 52, Springer, Berlin, Germany, Lecture Notes in Control and Information Sciences, https://doi.org/10.1007/BFb0043612, 790847.
10.1007/BFb0043612
Google Scholar
29 Zhang T., [email protected], Sequential greedy approximation for certain convex optimization problems, IEEE Transactions on Information Theory. (2003) 49, no. 3, 682–691, https://doi.org/10.1109/TIT.2002.808136.
10.1109/TIT.2002.808136
Web of Science® Google Scholar

Citing Literature

All articles

A Comparison between Fixed-Basis and Variable-Basis Schemes for Function Approximation and Functional Optimization

Abstract

1. Introduction

2. Comparison of Bounds for Fixed- and Variable-Basis Approximation

2.1. Bounds in the ℒ₂-Norm

2.2. Bounds in the Supnorm

3. Application to Functional Optimization Problems

3.1. Rates of Approximate Optimization in Terms of the Modulus of Continuity

3.2. Rates of Approximate Optimization in Terms of the Modulus of Convexity

3.3. Comparison between Fixed- and Variable-Basis Schemes for Functional Optimization

4. Discussion

Acknowledgment

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A Comparison between Fixed-Basis and Variable-Basis Schemes for Function Approximation and Functional Optimization

Abstract

1. Introduction

2. Comparison of Bounds for Fixed- and Variable-Basis Approximation

2.1. Bounds in the ℒ2-Norm

2.2. Bounds in the Supnorm

3. Application to Functional Optimization Problems

3.1. Rates of Approximate Optimization in Terms of the Modulus of Continuity

3.2. Rates of Approximate Optimization in Terms of the Modulus of Convexity

3.3. Comparison between Fixed- and Variable-Basis Schemes for Functional Optimization

4. Discussion

Acknowledgment

References

Citing Literature

References

Related

Information

2.1. Bounds in the ℒ₂-Norm