This paper considers the ERM scheme for quantile regression. We conduct error analysis for this learning algorithm by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The learning rates are derived by applying concentration techniques involving the ℓ²-empirical covering numbers.

1. Introduction

In this paper, we study empirical risk minimization scheme (ERM) for quantile regression. Let X be a compact metric space (input space) and Y = ℝ. Let ρ be a fixed but unknown probability distribution on Z : = X × Y which describes the noise of sampling. The conditional quantile regression aims at producing functions to estimate quantile regression functions. With a prespecified quantile parameter τ ∈ (0,1), a quantile regression function f_τ,ρ is defined by its value f_τ,ρ(x) to be a τ-quantile of ρ(·∣x), that is, a value t ∈ Y satisfying

()

where ρ(·∣x) is the conditional distribution of ρ at x ∈ X.

We consider a learning algorithm generated by ERM scheme associated with pinball loss and hypothesis space ℋ. The pinball loss L_τ : ℝ → [0, ∞) is defined by

()

The hypothesis space ℋ is a compact subset of C(X). So there exists some M > 0 such that ∥f∥_C(X) ≤ M for any f ∈ ℋ. We assume without loss of generality ∥f∥_C(X) ≤ 1 for any f ∈ ℋ.

The ERM scheme for quantile regression is defined with a sample

drawn independently from ρ as follows:

()

A family of kernel based learning algorithms for quantile regression has been widely studied in a large literature [1–4] and references therein. The form of the algorithms is a regularized scheme in a reproducing kernel Hilbert space ℋ_K (RKHS, see [5] for details) associated with a Mercer kernel K. Given a sample z the kernel based regularized scheme for quantile regression is defined by

()

In [1, 3, 4], error analysis for general ℋ_K has been done. Learning with varying Gaussian kernel was studied in [2].

ERM scheme (3) is very different from kernel based regularized scheme (4). The output function f_z produced by the ERM scheme has a uniform bound, under our assumption, ∥f_z∥_C(X) ≤ 1. However, we cannot expect it for f_z,λ. It is easy to see that by choosing f = 0. It happens often that ∥f_z,λ∥_K → ∞ as λ → 0. The lack of a uniform bound for f_z,λ has a serious negative impact on the learning rates. So in the literature of kernel based regularized schemes for quantile regression, values of the output function f_z,λ are always projected onto the interval [−1,1], and error analysis is conducted for the projected function, not f_z,λ itself.

In this paper, we aim at establishing convergence and learning rates for the error in the space . Here r > 0 depends on the pair (ρ, τ) which will be decided in Section 2 and ρ_X is the marginal distribution of ρ on X. In the rest of this paper, we assume Y = [−1,1] which in turn leads to values of the target function f_τ,ρ lie in the same interval.

2. Noise Condition and Main Results

There has been a large literature in learning theory (see [6] and references therein) devoted to the least square regression. It aims at learning the regression function f_ρ(x) = ∫_Y ydρ(y∣x). The identity for the generalization error ℰ_ls(f) = ∫_Z (y − f(x)) ²dρ leads to a variance-expectation bound with the form of Eξ² ≤ 4Eξ, where ξ = (y − f(x)) ² − (y − f_ρ(x)) ² on (Z, ρ). It plays an essential role in error analysis of kernel based regularized schemes.

However, this identity relation and expectation-variance bound fail in the setting of the quantile regression. The reason is that the pinball loss is lack of strong convexity. If we add some noise condition on distribution ρ named τ-quantile of p-average type q (see Definition 1), we can also get a similar identity relation which in turn enables us to have a variance-expectation bound stated in the following which is proved by Steinwart and Christman [1].

Definition 1. Let p ∈ (0, ∞] and q ∈ [1, ∞). A distribution ρ on X × [−1,1] is said to have a τ-quantile of p-average type q if for ρ_X-almost every x ∈ X, there exist a τ-quantile t^* ∈ ℝ and constants α_ρ(·∣x) ∈ (0,2], b_ρ(·∣x) > 0 such that for each s ∈ [0, α_ρ(·∣x)],

()

and that the function γ on X defined by

satisfies

We also need capacity of the hypothesis space ℋ for our learning rates. Here in this paper, we measure the capacity by empirical covering numbers.

Definition 2. Let (𝔐, d) be a pseudometric space and S be a subset of 𝔐. For every ε > 0, the covering number 𝒩(S, ε, d) of S with respect to ε and d is defined as the minimal number of balls of radius ε whose union covers S, that is,

()

where B(s_j, ε) = {s ∈ 𝔐 : d(s, s_j) ≤ ε} is a ball in 𝔐.

Definition 3. Let ℱ be a set of functions on X, and . Set 𝒩_2,x(ℱ, ε) = 𝒩(ℱ|_x, ε, d₂). The ℓ²-empirical covering number of ℱ is defined by

()

Here d₂ is the normalized ℓ²-metric on the Euclidean space ℝ^k given by

()

Assumption. Assume that the empirical covering number of the hypothesis space ℋ is bounded for some a > 0 and ι ∈ (0,2),

()

Theorem 4. Assume that ρ satisfies (5) with some p ∈ (0, ∞] and q ∈ [1, ∞). Denote r = pq/(p + 1). One further assumes that f_τ,ρ is uniquely defined. If f_τ,ρ ∈ ℋ and ℋ satisfies (9) with ι ∈ (0,2), then for any 0 < δ < 1, with confidence 1 − δ, one has

()

where

()

and

is a constant independent of m and δ.

Remark 5. In the ERM scheme, we can choose f_τ,ρ ∈ ℋ which in turn makes the approximation error described by (23) equal to zero. However, it is impossible for the kernel based regularized scheme because of the appearance of the penalty term .

If q ≤ 2, all conditional distributions around the quantile behave similar to the uniform distribution. In this case r = pq/(p + 1) ≤ 2 and θ = min {2/q, p/(p + 1)} = p/(p + 1) for all p > 0. Hence, ϑ = 2(p + 1)/((2 + ι)pq + 4q). Furthermore, when p is large enough, the parameter r tends to q and the power index for the above learning rate arbitrarily approaches to 2/(2 + ι)q which shows that the learning rate power index for is arbitrarily close to 2/(2 + ι) independent of q. In particular, ι can be arbitrarily small when ℋ is smooth enough. In this case, the power index of the learning rates 2/(2 + ι) can be arbitrarily close to 1 which is the optimal learning rate for the least square regression.

Let us take some examples to demonstrate the above main result.

Example 6. Let ℋ be a unit ball of the sobolev space H^s with s > 0. Observe that the empirical covering number is bounded above by the uniform covering number defined in Definition 2. Hence we have (see [6, 7])

()

where n is the dimension of the input space X and C_s > 0.

Under the same assumptions as Theorem 4, we get that by replacing ι by n/s, for any 0 < δ < 1, with confidence 1 − δ,

()

where

()

and

is a constant independent of m and δ.

We carry out the same discussions on the case of q ≤ 2 and large enough p as Remark 5. Therefore the power index of the learning rates for is arbitrarily close to 2/(2 + n/s) independent of q. Furthermore, s can be arbitrarily large if the Sobolev space is smooth enough. In this special case, the learning rate power index arbitrarily approaches to 1.

Example 7. Let ℋ be a unit ball of the reproducing kernel Hilbert space ℋ_σ generated by a Gaussian kernel (see [5]). Reference [7] tells us that

()

where C_n,σ > 0 depends only on n and σ > 0. Obviously, the right-hand side of (15) is bounded by C_n,σ(1/ε) ⁿ⁺¹.

So from Theorem 4, we can get different learning rates with power index

()

If q ≤ 2 and p is large enough, the power index of the learning rates for

is arbitrarily close to 2/(3 + n) which is very slow if n is large. However, in most data sets the data are concentrated on a much lower dimensional manifold embedded in the high dimensional space. In this setting an analysis that replaces n by the intrinsic dimension of the manifold would be of great interest (see [8] and references therein).

3. Error Analysis

Define the noise-free error called generalization error associated with the pinball loss L_τ as

()

Then the measurable function f_τ,ρ is a minimizer of ℰ_τ. Obviously, f_τ,ρ(x)∈[−1,1].

We need the following results from [1] for our error analysis.

Proposition 8. Let L_τ be the pinball loss. Assume that ρ satisfies (5) with some p ∈ (0, ∞] and q ∈ [1, ∞). Then for all f : X → [−1,1] one has

()

Furthermore, with

()

one has

()

The above result implies that we can get convergence rates of f_z in the space by bounding the excess generalization error ℰ_τ(f_z) − ℰ_τ(f_τ,ρ).

To bound ℰ_τ(f_z) − ℰ_τ(f_τ,ρ), we need a standard error decomposition procedure [6] and a concentration inequality.

3.1. Error Decomposition

Define the empirical error associated with the pinball loss L_τ as

()

Define

()

Lemma 9. Let L_τ be the pinball loss, f_z be defined by (3) and f_ℋ ∈ ℋ by (22). Then

()

Proof. The excess generalization error can be written as

()

The definition of f_z implies that ℰ_z,τ(f_z) − ℰ_z,τ(f_ℋ) ≤ 0. Furthermore, by subtracting and adding ℰ_τ(f_τ,ρ) and ℰ_z,τ(f_τ,ρ) in the first term and third term, we see that Lemma 9 holds true.

We call the term (23) approximation error. It has been studied in [9].

3.2. Concentration Inequality and Sample Error

Let us recall the one-sided Bernstein inequality as follows.

Lemma 10. Let ξ be a random variable on a probability space Z with variance σ² satisfying |ξ − 𝔼(ξ)| ≤ M_ξ for some constant M_ξ. Then for any 0 < δ < 1, with confidence 1 − δ, one has

()

Proposition 11. Let f_ℋ ∈ ℋ. Assume that ρ on X × [−1,1] satisfies the variance bound (20) with index θ indicated in (19). For any 0 < δ < 1, with confidence 1 − δ/2, (3.6) can be bounded as

()

Proof. Let ξ(z) = L_τ(y − f_ℋ(x)) − L_τ(y − f_τ,ρ(x)) which satisfies |ξ | ≤ 2 and in turn |ξ − E(ξ)| ≤ 2. The variance bound (20) implies that

()

Using (25) on the random variable ξ(z) = L_τ(y − f_ℋ(x)) − L_τ(y − f_τ,ρ(x)), we can get the desired bound (28) with the help of Young’s inequality.

Let us turn to estimate the sample error (3.5) involving the function f_z which runs over a set of functions since z is a random sample itself. To estimate it, we use a concentration inequality below involving empirical covering numbers [10–12].

Lemma 12. Let ℱ be a class of measurable functions on Z. Assume that there are constants B, c > 0 and α ∈ [0,1] and ∥f∥_∞ ≤ B and 𝔼f² ≤ c(𝔼f) ^α for every f ∈ ℱ. If (7) holds, then there exists a constant depending only on ι such that for any t > 0, with probability at least 1 − e^−t, there holds

()

where

()

We apply Lemma 12 to a function set ℱ, where

()

Proposition 13. Assume ρ on X × [−1,1] satisfies the variance bound (20) with index θ indicated in (19). If ℋ satisfies (9) with ι ∈ (0,2), then for any 0 < δ < 1, with confidence 1 − δ/2, one has

()

where

()

Proof. Take g ∈ ℱ with the form g(z) = L_τ(y − f(x)) − L_τ(y − f_τ,ρ(x)) where f ∈ ℋ. Hence 𝔼g = ℰ_τ(f) − ℰ_τ(f_τ,ρ) and .

The Lipschitz property of the pinball loss L_τ implies that

()

For g₁, g₂ ∈ ℱ, we have

()

where f₁, f₂ ∈ ℋ. It follows that

()

Hence

()

Applying Lemma 12 with B = 2, α = θ, and c = C_θ, we know that for any 0 < δ < 1, with confidence 1 − δ/2, there holds

()

Here

()

Note that

()

where C_ι,θ is indicated in (32). Then our desired bound holds true.

Proposition 14. Assume ρ on X × [−1,1] satisfies the variance bound (20) with index θ indicated in (19). If ℋ satisfies (9) with ι ∈ (0,2) Then for any 0 < δ < 1, with confidence 1 − δ, there holds

()

The above bound follows directly from Propositions 11 and 13 with the fact that f_z ∈ ℋ.

3.3. Bounding the Total Error

Now we are in a position to present our general result on error analysis for algorithm (3).

Theorem 15. Assume that ρ satisfies (5) with some p ∈ (0, ∞] and q ∈ [1, ∞). Denote γ = pq/(p + 1). Further assume that ℋ satisfies (9) with ι ∈ (0,2) and f_τ,ρ is uniquely defined. Then for any 0 < δ < 1, with confidence 1 − δ, one has

()

where

()

and

are constant independent of m and δ.

Proof. Combining (18), (19), and (40), with confidence 1 − δ, we have

()

where

()

Proof of Theorem 4. The assumption f_τ,ρ ∈ ℋ implies that

()

Therefore, our desired result comes directly from Theorem 15.

4. Further Discussions

In this paper, we studied ERM algorithm (3) for quantile regression and provide convergence and learning rates. We showed some essential differences between ERM scheme and kernel based regularized scheme for quantile regression. We also point out the difficulty to deal with quantile regression: the lack of strong convexity of the pinball loss. To overcome this difficulty, some noise condition on ρ is proposed to enable us to get a variance-expectation bound similar to the one for the least square regression.

In our analysis we just consider f ∈ ℋ and ∥f∥_C(X) ≤ 1. The case ∥f∥_C(X) ≤ R for R ≥ 1 would be interesting in the future work. The approximation error involving R can be estimated by the knowledge of interpolation space.

In our setting, the sample is drawn independently from the distribution ρ. However, in many practical problems, the i.i.d condition is a little demanding, so it would be interesting to investigate the ERM scheme for quantile regression with nonidentical distributions [13, 14] or dependent sampling [15].

Acknowledgment

This work described in this paper is supported by NSF of China under Grant 11001247 and 61170109.

References

1 Steinwart I. and Christman A., How SVMs Can Estimate Quantile and the Median, 2008, 20, MIT Press, Cambridge, Mass, USA, Advances in Neural Information Processing Systems.
Google Scholar
2 Xiang D. H., Conditional quantiles with varying Gaussians, Advances in Computational Mathematics. (2011) https://doi.org/10.1007/s10444-011-9257-5.
10.1007/s10444-011-9257-5
Web of Science® Google Scholar
3 Xiang D.-H., Hu T., and Zhou D.-X., Learning with varying insensitive loss, Applied Mathematics Letters. (2011) 24, no. 12, 2107–2109, https://doi.org/10.1016/j.aml.2011.06.007, MR2826146, ZBL1228.62052.
10.1016/j.aml.2011.06.007
Web of Science® Google Scholar
4 Xiang D.-H., Hu T., and Zhou D.-X., Approximation analysis of learning algorithms for support vector regression and quantile regression, Journal of Applied Mathematics. (2012) 2012, 17, 902139, https://doi.org/10.1155/2012/902139, MR2880823, ZBL1235.68206.
10.1155/2012/902139
Google Scholar
5 Aronszajn N., Theory of reproducing kernels, Transactions of the American Mathematical Society. (1950) 68, 337–404, MR0051437, https://doi.org/10.1090/S0002-9947-1950-0051437-7, ZBL0037.20701.
10.1090/S0002-9947-1950-0051437-7
Web of Science® Google Scholar
6 Cucker F. and Zhou D.-X., Learning Theory: An Approximation Theory Viewpoint, 2007, 24, Cambridge University Press, Cambridge, UK, https://doi.org/10.1017/CBO9780511618796, MR2354721.
10.1017/CBO9780511618796
Google Scholar
7 Zhou D.-X., The covering number in learning theory, Journal of Complexity. (2002) 18, no. 3, 739–767, https://doi.org/10.1006/jcom.2002.0635, MR1928805, ZBL1016.68044.
10.1006/jcom.2002.0635
Web of Science® Google Scholar
8 Mukherjee S., Wu Q., and Zhou D.-X., Learning gradients on manifolds, Bernoulli. (2010) 16, no. 1, 181–207, https://doi.org/10.3150/09-BEJ206, MR2648754, ZBL1200.62070.
10.3150/09-BEJ206
Web of Science® Google Scholar
9 Smale S. and Zhou D.-X., Estimating the approximation error in learning theory, Analysis and Applications. (2003) 1, no. 1, 17–41, https://doi.org/10.1142/S0219530503000089, MR1959283, ZBL1079.68089.
10.1142/S0219530503000089
Web of Science® Google Scholar
10 Mukherjee S. and Wu Q., Estimation of gradients and coordinate covariation in classification, Journal of Machine Learning Research. (2006) 7, 2481–2514, MR2274447, ZBL1222.62078.
Web of Science® Google Scholar
11 Yao Y., On complexity issues of online learning algorithms, Institute of Electrical and Electronics Engineers. (2010) 56, no. 12, 6470–6481, https://doi.org/10.1109/TIT.2010.2079010, MR2810011.
10.1109/TIT.2010.2079010
Google Scholar
12 Ying Y., Convergence analysis of online algorithms, Advances in Computational Mathematics. (2007) 27, no. 3, 273–291, https://doi.org/10.1007/s10444-005-9002-z, MR2335999, ZBL1129.68070.
10.1007/s10444-005-9002-z
Web of Science® Google Scholar
13 Hu T. and Zhou D.-X., Online learning with samples drawn from non-identical distributions, Journal of Machine Learning Research. (2009) 10, 2873–2898, MR2579915, ZBL1235.68157.
Web of Science® Google Scholar
14 Smale S. and Zhou D.-X., Online learning with Markov sampling, Analysis and Applications. (2009) 7, no. 1, 87–113, https://doi.org/10.1142/S0219530509001293, MR2488871, ZBL1170.68022.
10.1142/S0219530509001293
Web of Science® Google Scholar
15 Guo Z.-C. and Shi L., Classification with non-i.i.d. sampling, Mathematical and Computer Modelling. (2011) 54, no. 5-6, 1347–1364, https://doi.org/10.1016/j.mcm.2011.03.042, MR2812159, ZBL1228.62074.
10.1016/j.mcm.2011.03.042
Google Scholar

All articles

ERM Scheme for Quantile Regression

Abstract

1. Introduction

2. Noise Condition and Main Results

3. Error Analysis

3.1. Error Decomposition

3.2. Concentration Inequality and Sample Error

3.3. Bounding the Total Error

4. Further Discussions

Acknowledgment

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

ERM Scheme for Quantile Regression

Abstract

1. Introduction

2. Noise Condition and Main Results

3. Error Analysis

3.1. Error Decomposition

3.2. Concentration Inequality and Sample Error

3.3. Bounding the Total Error

4. Further Discussions

Acknowledgment

References

References

Related

Information