Volume 2013, Issue 1 134727
Research Article
Open Access

Coefficient-Based Regression with Non-Identical Unbounded Sampling

Jia Cai

Corresponding Author

Jia Cai

School of Mathematics and Computational Science, Guangdong University of Business Studies, Guangzhou, Guangdong 510320, China gdcc.edu.cn

Search for more papers by this author
First published: 03 June 2013
Citations: 5
Academic Editor: Qiang Wu

Abstract

We investigate a coefficient-based least squares regression problem with indefinite kernels from non-identical unbounded sampling processes. Here non-identical unbounded sampling means the samples are drawn independently but not identically from unbounded sampling processes. The kernel is not necessarily symmetric or positive semi-definite. This leads to additional difficulty in the error analysis. By introducing a suitable reproducing kernel Hilbert space (RKHS) and a suitable intermediate integral operator, elaborate analysis is presented by means of a novel technique for the sample error. This leads to satisfactory results.

1. Introduction and Preliminary

We study coefficient-based least squares regression with indefinite kernels from non-identical unbounded sampling processes. In our setting, functions are defined on a compact subset X of n and take values in Y = . Let ρ be a Borel probability measure on Z = X × Y. A sample is drawn independently from different Borel probability measures ρ(t) (t = 1, …, T), ρ(t)(·∣x) = ρ(·∣x). Let be the marginal distribution of ρ(t) on X and ρX the marginal distribution of ρ on X. We assume that the sequence converges exponentially fast in the dual of the Hölder space Cs(X). Here the Hölder space Cs(X) (0 ≤ s ≤ 1) is defined as the space of all continuous functions on X with the following norm finite [1]:
()
where
()

Definition 1. Let 0 ≤ s ≤ 1; we say that the sequence converges exponentially fast in (Cs(X)) * to a probability measure ρX on X or converges exponentially in short if there exist C1 > 0 and 0 < α< 1 such that

()

By the definition of the dual space (Cs(X)) *, the decay condition (3) can be expressed as
()
The regression function fρ : XY is given by
()
where ρ(yx) is the conditional distribution of y at xX and ρ is unknown, so fρ cannot be obtained directly. The aim of regression problem is to learn a good approximation of fρ from sample z. This is an ill-posed problem and regularization scheme is needed.
Classical learning algorithm is conducted by a scheme in a reproducing kernel Hilbert space (RKHS) [2] associated with a Mercer kernel K : X × X, which is defined to be a continuous, symmetric, and positive semi-definite (p.s.d.) function. RKHS K is defined to be the completion of the linear span of {Kx = K(·, x)  :  xX} with the inner product 〈Kx, Ky〉 K = K(x, y). Define κ = sup x,vX | K(x, v)| < ; then the regularized regression problem is given by
()
It has been well understood due to lots of the literature ([35] and the references therein). Here we consider the indefinite kernel scheme in a hypothesis space K,z depending on the sample z; this space is defined by
()
And the regularized penalty term is imposed on the coefficients of function f. Indefinite kernel K means K  does not need to satisfy symmetry and p.s.d. condition except for continuity and boundedness. Define ; then is a Mercer kernel. For more introductions about learning with indefinite kernels, please see [68]. For all  xX, if we define and , since X is compact and K is continous, and its adjoint are both compact operators. Hence , . The learning algorithm we are interested in this paper takes the following form:
()
We define the following coefficient-based regularizer:
()

Then we have . By using the integral operator technique from [4], in [9], Sun and Wu gave the capacity independent estimate for the convergence rate of , where ∥fρ = (∫X |f(x)|2dρX(x)) 1/2. Shi investigated the error analysis in a data dependent hypothesis space for general kernels [10]. Sun and Guo conducted error analysis for the Mercer kernels by uniform bounded non-i.i.d. sampling [11]. In this paper, we study learning algorithm (8) by non-identical unbounded sampling processes with indefinite kernels.

If K is a Mercer kernel, from [3] we know that K is in the range of . For an indefinite kernel K, recall . Based on the polar decomposition of compact operators ([12]).

Lemma 2. Let H be a separable Hilbert space and T a compact operator on H; then T can be factored as

()
where A = (T*T) 1/2 and Γ is a partial isometry on H with Γ*Γ being orthogonal projection onto .

We immediately have the following proposition [10].

Proposition 3. Consider as a subspace of ; then and , where U is a partial isometry on with U*U being the orthogonal projection onto .

We use the RKHS to approximate fρ, hence define
()
In order to estimate fz,λfρ, we construct
()
where . Then we can decompose the error term into the following three parts:
()

We will conduct the error analysis in several steps. The major contribution we make is on the sample error estimate; the main difficulty is the non-identical unbounded sampling of the samples; we overcome this difficulty by introducing a suitable intermediate operator.

2. Key Analysis and Main Results

In order to give the error analysis, we assume that the kernel satisfies the following kernel condition [1, 11].

Definition 4. We say that the Mercer kernel satisfies the kernel condition of order s if, for some constant , and for all u, vX,

()

Since sample z is drawn from unbounded sampling processes, we will assume the following moment hypothesis condition [13].

Moment Hypothesis. There exist constants M > 0 and C2 > 0 such that
()

There is a large literature on error analysis for learning algorithm (6); see, for example, [4, 5, 1417]. But most obtained results are presented under the standard assumption that |y | ≤ M almost surely (M is a constant). This excludes the case of Gaussian noise. The moment hypothesis condition is a natural generalization of the condition |y | ≤ M. Wang and Zhou considered error analysis for algorithm (6) under condition (15). Our main results are about learning rates of algorithm (8) under conditions (3), (14) and the approximation ability of in terms of fρ.

Now we can state our general results on learning rates for algorithm (8).

Theorem 5. Assume moment hypothesis condition (15); satisfies condition (3) and satisfies condition (14); , for 1/2 < r ≤ 3/2; take λ = Tθ with 0 < θ < 1/3; then

()
where is a constant depending on κ, s,   and  α, but not on T or δ, and will be given explicitly in Section 3.3.

Remark 6. If we take λ = T−1/2(r+1), then our rate is O(T−(2r−1)/4(r+1)). The proof of Theorem 5 will be conducted in Section 3, where the error term is decomposed into three parts. In [11], the authors consider the coefficient-based regression with the Mercer kernels by uniform bounded non-i.i.d sampling; the best rate of order O(T−2r/(1+2r)) was obtained.

When the samples are drawn i.i.d from measure ρ, we have the following result.

Theorem 7. Assume moment hypothesis condition (15); satisfies condition (14); ; then if 0 < r ≤ 1, take λ = T−1/(2r+3); one see that

()
where is a constant depending on κ, s,   and  α but not on T or δ. And if r > 1, take λ = T−1/5; we have
()

Here we get the same learning rate as the one in [9]. But our rate is derived under a relaxation condition of the sampling output.

3. Error Analysis

In this section, we will state the error analysis in several steps.

3.1. Regularization Error Estimation

In this subsection, we address a bound for the regularization error . The error estimate for regularization error has been investigated in lots of the literature in learning theory ([4, 18] and the references therein); we will omit the proof and quote it directly.

Proposition 8. Assume for some and r > 0; the following bound for approximation error holds:

()
where , q = min {r, 1}, and when 1/2 < r ≤ 3/2,
()
where .

3.2. Estimate for the Measure Error

This subsection is devoted to the analysis of the term caused by the difference of measures, which we called measure error. The ideas of proof are from [1]. Before giving the result, let us state a crucial lemma first.

Lemma 9. Assume satisfies condition (14); then

()

Proof. For any hCs(X), we see that

()
where . Now we need to estimate ∥g and , respectively. For the term ∥g, it is easy to see that
()
The estimation of is more involved:
()
Since
()
then
()
Therefore
()
Combining the estimation of ∥g and , we get
()
When condition (14) is satisfied, it was proved in [19] that is included in Cs(X) with the inclusion bounded
()
Then
()
This completes the proof.

Proposition 10. Assume for some 1/2 < r ≤ 3/2; satisfies condition (14); the following bound for measure error holds:

()
where and  C3 = C1CκCrα/(1 − α).

Proof. From (11), simple calculation shows that . Recalling (12), we can see that

()
Applying Lemma 9 to the case , we get
()
By the definition of and noticing (3), we can see
()

This in connection with Proposition 8 yields the conclusion.

3.3. Sample Error Estimation

In this subsection we will conduct the estimation of the term . At first, we give some notations. Let C(X) be the space of bounded continuous functions on X with supremum norm ∥·∥. Define sampling operator Sx   :   C(X) → T by Sx(f) = (f(x1), …, f(xT)) ([18]). For β = (β1, …, βT), let Ux and be operators from T to C(X) defined as
()
It is easy to see that both Ux and are bounded operators. Recall the definition of fz,λ; then
()
Computing the gradient of the above equation, we immediately have [9]
()
Hence . Employing the method as shown in [9], we can decompose the sample error into two parts:
()

Now we state our estimation for the sample error. The estimates are more involved since the sample is drawn by non-identical unbounded sampling processes. We overcome the difficulty by introducing a stepping integral operator which plays an intermediate role in the estimates, and the definition of it will be given later.

Theorem 11. Let fz,λ be given by (8), assume moment hypothesis condition (15), and the marginal distribution sequence , satisfies condition (3); then

()
where , .

Proof. We will estimate I and II, respectively:

()
Then
()
where and is the l2 norm on T, for ; noticing (36), we can have
()
This means
()
According to the definition of Ux, for any βT, ; this implies that . Therefore
()
Hence
()
For the term , let
()
and . Then |ηt,j,l | ≤ 2κ3 and
()
Applying the same method as shown in the proof of Lemma 4.1 in [9], we can see that when all the indices t,   j,   w,   τ,   and  l are pairwise different, there holds 𝔼x(ηt,j,l(x)ηw,τ,l(x)) = 0; therefore
()

This together with (45) yields

()

The term II is more involved; recall that

()
Hence
()

If we define ηtj(x) = ytK(xt, xj)K(x, xj) and

()
therefore
()
If t,   j,   w,   and  τ are pairwise distinct, then 𝔼z(ξtj(x)ξwτ(x)) = 0. If t = j or w = τ,
()
By the Cauchy-Schwartz inequality, for any t,   j,   w,   τ = 1, …, T,
()
Hence we only need to give a bound for . Simple calculation shows
()
By the same method, we know that
()

Applying the conclusion as shown in [9] and together with the above bound, we can see that

()
Hence
()

This yields

()
then
()
This together with (49) yields the conclusion.

Now we are in a position to give the proofs of Theorems 5 and 7.

Proof of Theorem 5. Theorem 11 ensures that

()

For 1/2 < r ≤ 3/2, Proposition 10 tells that

()
and Proposition 8 shows that
()
since
()

Combining all the bounds together and noting that λ = Tθ with 0 < θ < 1/3, we can get the conclusion of Theorem 5 by taking .

Proof of Theorem 7. When the samples are drawn i.i.d. from measure ρ, then . Hence

()
Let λ = Tθ; then
()

The conclusion follows by discussing the relationship between r and 1.

Acknowledgments

The author would like to thank Professor Hongwei Sun for useful discussions which have helped to improve the presentation of the paper. The work described in this paper is supported partially by National Natural Science Foundation of China (Grant no. 11001247) and Doctor Grants of Guangdong University of Business Studies (Grant no. 11BS11001).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.