We investigate a coefficient-based least squares regression problem with indefinite kernels from non-identical unbounded sampling processes. Here non-identical unbounded sampling means the samples are drawn independently but not identically from unbounded sampling processes. The kernel is not necessarily symmetric or positive semi-definite. This leads to additional difficulty in the error analysis. By introducing a suitable reproducing kernel Hilbert space (RKHS) and a suitable intermediate integral operator, elaborate analysis is presented by means of a novel technique for the sample error. This leads to satisfactory results.

1. Introduction and Preliminary

We study coefficient-based least squares regression with indefinite kernels from non-identical unbounded sampling processes. In our setting, functions are defined on a compact subset X of ℝⁿ and take values in Y = ℝ. Let ρ be a Borel probability measure on Z = X × Y. A sample

is drawn independently from different Borel probability measures ρ^(t) (t = 1, …, T), ρ^(t)(·∣x) = ρ(·∣x). Let

be the marginal distribution of ρ^(t) on X and ρ_X the marginal distribution of ρ on X. We assume that the sequence

converges exponentially fast in the dual of the Hölder space C^s(X). Here the Hölder space C^s(X) (0 ≤ s ≤ 1) is defined as the space of all continuous functions on X with the following norm finite [1]:

()

where

()

Definition 1. Let 0 ≤ s ≤ 1; we say that the sequence converges exponentially fast in (C^s(X)) ^* to a probability measure ρ_X on X or converges exponentially in short if there exist C₁ > 0 and 0 < α< 1 such that

()

By the definition of the dual space (C^s(X)) ^*, the decay condition (3) can be expressed as

()

The regression function f_ρ : X → Y is given by

()

where ρ(y∣x) is the conditional distribution of y at x ∈ X and ρ is unknown, so f_ρ cannot be obtained directly. The aim of regression problem is to learn a good approximation of f_ρ from sample z. This is an ill-posed problem and regularization scheme is needed.

Classical learning algorithm is conducted by a scheme in a reproducing kernel Hilbert space (RKHS) [2] associated with a Mercer kernel K : X × X → ℝ, which is defined to be a continuous, symmetric, and positive semi-definite (p.s.d.) function. RKHS ℋ_K is defined to be the completion of the linear span of {K_x = K(·, x) : x ∈ X} with the inner product 〈K_x, K_y〉 _K = K(x, y). Define κ = sup _x,v∈X | K(x, v)| < ∞; then the regularized regression problem is given by

()

It has been well understood due to lots of the literature ([3–5] and the references therein). Here we consider the indefinite kernel scheme in a hypothesis space ℋ_K,z depending on the sample z; this space is defined by

()

And the regularized penalty term is imposed on the coefficients of function f. Indefinite kernel K means K does not need to satisfy symmetry and p.s.d. condition except for continuity and boundedness. Define

; then

is a Mercer kernel. For more introductions about learning with indefinite kernels, please see [6–8]. For all x ∈ X, if we define

and

, since X is compact and K is continous,

and its adjoint

are both compact operators. Hence

. The learning algorithm we are interested in this paper takes the following form:

()

We define the following coefficient-based regularizer:

()

Then we have . By using the integral operator technique from [4], in [9], Sun and Wu gave the capacity independent estimate for the convergence rate of , where ∥f∥_ρ = (∫_X |f(x)|²dρ_X(x)) ^1/2. Shi investigated the error analysis in a data dependent hypothesis space for general kernels [10]. Sun and Guo conducted error analysis for the Mercer kernels by uniform bounded non-i.i.d. sampling [11]. In this paper, we study learning algorithm (8) by non-identical unbounded sampling processes with indefinite kernels.

If K is a Mercer kernel, from [3] we know that ℋ_K is in the range of . For an indefinite kernel K, recall . Based on the polar decomposition of compact operators ([12]).

Lemma 2. Let H be a separable Hilbert space and T a compact operator on H; then T can be factored as

()

where A = (T^*T) ^1/2 and Γ is a partial isometry on H with Γ^*Γ being orthogonal projection onto

We immediately have the following proposition [10].

Proposition 3. Consider as a subspace of ; then and , where U is a partial isometry on with U^*U being the orthogonal projection onto .

We use the RKHS

to approximate f_ρ, hence define

()

In order to estimate f_z,λ − f_ρ, we construct

()

where

. Then we can decompose the error term into the following three parts:

()

We will conduct the error analysis in several steps. The major contribution we make is on the sample error estimate; the main difficulty is the non-identical unbounded sampling of the samples; we overcome this difficulty by introducing a suitable intermediate operator.

2. Key Analysis and Main Results

In order to give the error analysis, we assume that the kernel satisfies the following kernel condition [1, 11].

Definition 4. We say that the Mercer kernel satisfies the kernel condition of order s if, for some constant , and for all u, v ∈ X,

()

Since sample z is drawn from unbounded sampling processes, we will assume the following moment hypothesis condition [13].

Moment Hypothesis. There exist constants M > 0 and C₂ > 0 such that

()

There is a large literature on error analysis for learning algorithm (6); see, for example, [4, 5, 14–17]. But most obtained results are presented under the standard assumption that |y | ≤ M almost surely (M is a constant). This excludes the case of Gaussian noise. The moment hypothesis condition is a natural generalization of the condition |y | ≤ M. Wang and Zhou considered error analysis for algorithm (6) under condition (15). Our main results are about learning rates of algorithm (8) under conditions (3), (14) and the approximation ability of in terms of f_ρ.

Now we can state our general results on learning rates for algorithm (8).

Theorem 5. Assume moment hypothesis condition (15); satisfies condition (3) and satisfies condition (14); , for 1/2 < r ≤ 3/2; take λ = T^−θ with 0 < θ < 1/3; then

()

where

is a constant depending on κ, s, and α, but not on T or δ, and will be given explicitly in Section 3.3.

Remark 6. If we take λ = T^−1/2(r+1), then our rate is O(T^{−(2r−1)/4(r+1)}). The proof of Theorem 5 will be conducted in Section 3, where the error term is decomposed into three parts. In [11], the authors consider the coefficient-based regression with the Mercer kernels by uniform bounded non-i.i.d sampling; the best rate of order O(T^−2r/(1+2r)) was obtained.

When the samples are drawn i.i.d from measure ρ, we have the following result.

Theorem 7. Assume moment hypothesis condition (15); satisfies condition (14); ; then if 0 < r ≤ 1, take λ = T^−1/(2r+3); one see that

()

where

is a constant depending on κ, s, and α but not on T or δ. And if r > 1, take λ = T^−1/5; we have

()

Here we get the same learning rate as the one in [9]. But our rate is derived under a relaxation condition of the sampling output.

3. Error Analysis

In this section, we will state the error analysis in several steps.

3.1. Regularization Error Estimation

In this subsection, we address a bound for the regularization error . The error estimate for regularization error has been investigated in lots of the literature in learning theory ([4, 18] and the references therein); we will omit the proof and quote it directly.

Proposition 8. Assume for some and r > 0; the following bound for approximation error holds:

()

where

, q = min {r, 1}, and when 1/2 < r ≤ 3/2,

()

where

3.2. Estimate for the Measure Error

This subsection is devoted to the analysis of the term caused by the difference of measures, which we called measure error. The ideas of proof are from [1]. Before giving the result, let us state a crucial lemma first.

Lemma 9. Assume satisfies condition (14); then

()

Proof. For any h ∈ C^s(X), we see that

()

where

. Now we need to estimate ∥g∥_∞ and

, respectively. For the term ∥g∥_∞, it is easy to see that

()

The estimation of

is more involved:

()

Since

()

then

()

Therefore

()

Combining the estimation of ∥g∥_∞ and

, we get

()

When condition (14) is satisfied, it was proved in [19] that

is included in C^s(X) with the inclusion bounded

()

Then

()

This completes the proof.

Proposition 10. Assume for some 1/2 < r ≤ 3/2; satisfies condition (14); the following bound for measure error holds:

()

where

and C₃ = C₁C_κC_r α/(1 − α).

Proof. From (11), simple calculation shows that . Recalling (12), we can see that

()

Applying Lemma 9 to the case

, we get

()

By the definition of

and noticing (3), we can see

()

This in connection with Proposition 8 yields the conclusion.

3.3. Sample Error Estimation

In this subsection we will conduct the estimation of the term

. At first, we give some notations. Let C(X) be the space of bounded continuous functions on X with supremum norm ∥·∥_∞. Define sampling operator S_x : C(X) → ℝ^T by S_x(f) = (f(x₁), …, f(x_T)) ([18]). For β = (β₁, …, β_T), let U_x and

be operators from ℝ^T to C(X) defined as

()

It is easy to see that both U_x and

are bounded operators. Recall the definition of f_z,λ; then

()

Computing the gradient of the above equation, we immediately have [9]

()

Hence

. Employing the method as shown in [9], we can decompose the sample error into two parts:

()

Now we state our estimation for the sample error. The estimates are more involved since the sample is drawn by non-identical unbounded sampling processes. We overcome the difficulty by introducing a stepping integral operator which plays an intermediate role in the estimates, and the definition of it will be given later.

Theorem 11. Let f_z,λ be given by (8), assume moment hypothesis condition (15), and the marginal distribution sequence , satisfies condition (3); then

()

where

Proof. We will estimate I and II, respectively:

()

Then

()

where

and

is the l² norm on ℝ^T, for

; noticing (36), we can have

()

This means

()

According to the definition of U_x, for any β ∈ ℝ^T,

; this implies that

. Therefore

()

Hence

()

For the term

, let

()

and

. Then |η_t,j,l | ≤ 2κ³ and

()

Applying the same method as shown in the proof of Lemma 4.1 in [9], we can see that when all the indices t, j, w, τ, and l are pairwise different, there holds 𝔼_x(η_t,j,l(x)η_w,τ,l(x)) = 0; therefore

()

This together with (45) yields

()

The term II is more involved; recall that

()

Hence

()

If we define η_tj(x) = y_tK(x_t, x_j)K(x, x_j) and

()

therefore

()

If t, j, w, and τ are pairwise distinct, then 𝔼_z(ξ_tj(x)ξ_wτ(x)) = 0. If t = j or w = τ,

()

By the Cauchy-Schwartz inequality, for any t, j, w, τ = 1, …, T,

()

Hence we only need to give a bound for

. Simple calculation shows

()

By the same method, we know that

()

Applying the conclusion as shown in [9] and together with the above bound, we can see that

()

Hence

()

This yields

()

then

()

This together with (49) yields the conclusion.

Now we are in a position to give the proofs of Theorems 5 and 7.

Proof of Theorem 5. Theorem 11 ensures that

()

For 1/2 < r ≤ 3/2, Proposition 10 tells that

()

and Proposition 8 shows that

()

since

()

Combining all the bounds together and noting that λ = T^−θ with 0 < θ < 1/3, we can get the conclusion of Theorem 5 by taking .

Proof of Theorem 7. When the samples are drawn i.i.d. from measure ρ, then . Hence

()

Let λ = T^−θ; then

()

The conclusion follows by discussing the relationship between r and 1.

Acknowledgments

The author would like to thank Professor Hongwei Sun for useful discussions which have helped to improve the presentation of the paper. The work described in this paper is supported partially by National Natural Science Foundation of China (Grant no. 11001247) and Doctor Grants of Guangdong University of Business Studies (Grant no. 11BS11001).

References

1 Smale S. and Zhou D.-X., Online learning with Markov sampling, Analysis and Applications. (2009) 7, no. 1, 87–113, https://doi.org/10.1142/S0219530509001293, MR2488871, ZBL1170.68022.
10.1142/S0219530509001293
Web of Science® Google Scholar
2 Aronszajn N., Theory of reproducing kernels, Transactions of the American Mathematical Society. (1950) 68, 337–404, MR0051437, https://doi.org/10.1090/S0002-9947-1950-0051437-7, ZBL0037.20701.
10.1090/S0002-9947-1950-0051437-7
Web of Science® Google Scholar
3 Cucker F. and Zhou D.-X., Learning Theory: An Approximation Theory Viewpoint, 2007, Cambridge University Press, Cambridge, UK, https://doi.org/10.1017/CBO9780511618796, MR2354721.
10.1017/CBO9780511618796
Google Scholar
4 Smale S. and Zhou D.-X., Learning theory estimates via integral operators and their approximations, Constructive Approximation. (2007) 26, no. 2, 153–172, https://doi.org/10.1007/s00365-006-0659-y, MR2327597, ZBL1127.68088.
10.1007/s00365-006-0659-y
Web of Science® Google Scholar
5 Wu Q., Ying Y., and Zhou D.-X., Learning rates of least-square regularized regression, Foundations of Computational Mathematics. (2006) 6, no. 2, 171–192, https://doi.org/10.1007/s10208-004-0155-9, MR2228738, ZBL1100.68100.
10.1007/s10208-004-0155-9
Web of Science® Google Scholar
6 Sun H. and Wu Q., Indefinite kernel network with dependent sampling, Analysis and Applications. Accepted.
Google Scholar
7 Wu Q. and Zhou D.-X., Learning with sample dependent hypothesis spaces, Computers and Mathematics with Applications. (2008) 56, no. 11, 2896–2907, https://doi.org/10.1016/j.camwa.2008.09.014, MR2467678, ZBL1165.68388.
10.1016/j.camwa.2008.09.014
Web of Science® Google Scholar
8 Wu Q., Regularization networks with indefinite kernels, Journal of Approximation Theory. (2013) 166, 1–18, https://doi.org/10.1016/j.jat.2012.10.001, MR3003945.
10.1016/j.jat.2012.10.001
Web of Science® Google Scholar
9 Sun H. and Wu Q., Least square regression with indefinite kernels and coefficient regularization, Applied and Computational Harmonic Analysis. (2011) 30, no. 1, 96–109, https://doi.org/10.1016/j.acha.2010.04.001, MR2737935, ZBL1225.65015.
10.1016/j.acha.2010.04.001
Web of Science® Google Scholar
10 Shi L., Learning theory estimate for coefficient-based regularized regression, Applied and Computational Harmonic Analysis. (2013) 34, no. 2, 252–265, https://doi.org/10.1016/j.acha.2012.05.001.
10.1016/j.acha.2012.05.001
Web of Science® Google Scholar
11 Sun H. and Guo Q., Coefficient regularized regression with non-iid sampling, International Journal of Computer Mathematics. (2011) 88, no. 15, 3113–3124, https://doi.org/10.1080/00207160.2011.587511, MR2834509, ZBL1237.68165.
10.1080/00207160.2011.587511
Web of Science® Google Scholar
12 Conway J. B., A Course in Operator Theory, 2000, American Mathematical Society, MR1721402.
Google Scholar
13 Wang C. and Zhou D.-X., Optimal learning rates for least squares regularized regression with unbounded sampling, Journal of Complexity. (2011) 27, no. 1, 55–67, https://doi.org/10.1016/j.jco.2010.10.002, MR2745300, ZBL1217.65024.
10.1016/j.jco.2010.10.002
Web of Science® Google Scholar
14 Caponnetto A. and De Vito E., Optimal rates for the regularized least-squares algorithm, Foundations of Computational Mathematics. (2007) 7, no. 3, 331–368, https://doi.org/10.1007/s10208-006-0196-8, MR2335249, ZBL1129.68058.
10.1007/s10208-006-0196-8
Web of Science® Google Scholar
15 De Vito E., Caponnetto A., and Rosasco L., Model selection for regularized least-squares algorithm in learning theory, Foundations of Computational Mathematics. (2005) 5, no. 1, 59–85, https://doi.org/10.1007/s10208-004-0134-1, MR2125691.
10.1007/s10208-004-0134-1
Web of Science® Google Scholar
16 Mendelson S. and Neeman J., Regularization in kernel learning, The Annals of Statistics. (2010) 38, no. 1, 526–565, https://doi.org/10.1214/09-AOS728, MR2590050, ZBL1191.68356.
10.1214/09-AOS728
Web of Science® Google Scholar
17 Steinwart I., Hush D., and Scovel C., Optimal rates for regularized least-squares regression, Proceedings of the 22nd Annual Conference on Learning Theory, 2009, 79–93.
Google Scholar
18 Smale S. and Zhou D.-X., Shannon sampling. II. Connections to learning theory, Applied and Computational Harmonic Analysis. (2005) 19, no. 3, 285–302, https://doi.org/10.1016/j.acha.2005.03.001, MR2186447, ZBL1107.94008.
10.1016/j.acha.2005.03.001
Web of Science® Google Scholar
19 Zhou D.-X., Capacity of reproducing kernel spaces in learning theory, IEEE Transactions on Information Theory. (2003) 49, no. 7, 1743–1752, https://doi.org/10.1109/TIT.2003.813564, MR1985575.
10.1109/TIT.2003.813564
Web of Science® Google Scholar

All articles

Coefficient-Based Regression with Non-Identical Unbounded Sampling

Abstract

1. Introduction and Preliminary

2. Key Analysis and Main Results

3. Error Analysis

3.1. Regularization Error Estimation

3.2. Estimate for the Measure Error

3.3. Sample Error Estimation

Acknowledgments

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Coefficient-Based Regression with Non-Identical Unbounded Sampling

Abstract

1. Introduction and Preliminary

2. Key Analysis and Main Results

3. Error Analysis

3.1. Regularization Error Estimation

3.2. Estimate for the Measure Error

3.3. Sample Error Estimation

Acknowledgments

References

References

Related

Information