ERM Scheme for Quantile Regression
Abstract
This paper considers the ERM scheme for quantile regression. We conduct error analysis for this learning algorithm by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The learning rates are derived by applying concentration techniques involving the ℓ2-empirical covering numbers.
1. Introduction
ERM scheme (3) is very different from kernel based regularized scheme (4). The output function fz produced by the ERM scheme has a uniform bound, under our assumption, ∥fz∥C(X) ≤ 1. However, we cannot expect it for fz,λ. It is easy to see that by choosing f = 0. It happens often that ∥fz,λ∥K → ∞ as λ → 0. The lack of a uniform bound for fz,λ has a serious negative impact on the learning rates. So in the literature of kernel based regularized schemes for quantile regression, values of the output function fz,λ are always projected onto the interval [−1,1], and error analysis is conducted for the projected function, not fz,λ itself.
In this paper, we aim at establishing convergence and learning rates for the error in the space . Here r > 0 depends on the pair (ρ, τ) which will be decided in Section 2 and ρX is the marginal distribution of ρ on X. In the rest of this paper, we assume Y = [−1,1] which in turn leads to values of the target function fτ,ρ lie in the same interval.
2. Noise Condition and Main Results
There has been a large literature in learning theory (see [6] and references therein) devoted to the least square regression. It aims at learning the regression function fρ(x) = ∫Y ydρ(y∣x). The identity for the generalization error ℰls(f) = ∫Z (y − f(x)) 2dρ leads to a variance-expectation bound with the form of Eξ2 ≤ 4Eξ, where ξ = (y − f(x)) 2 − (y − fρ(x)) 2 on (Z, ρ). It plays an essential role in error analysis of kernel based regularized schemes.
However, this identity relation and expectation-variance bound fail in the setting of the quantile regression. The reason is that the pinball loss is lack of strong convexity. If we add some noise condition on distribution ρ named τ-quantile of p-average type q (see Definition 1), we can also get a similar identity relation which in turn enables us to have a variance-expectation bound stated in the following which is proved by Steinwart and Christman [1].
Definition 1. Let p ∈ (0, ∞] and q ∈ [1, ∞). A distribution ρ on X × [−1,1] is said to have a τ-quantile of p-average type q if for ρX-almost every x ∈ X, there exist a τ-quantile t* ∈ ℝ and constants αρ(·∣x) ∈ (0,2], bρ(·∣x) > 0 such that for each s ∈ [0, αρ(·∣x)],
We also need capacity of the hypothesis space ℋ for our learning rates. Here in this paper, we measure the capacity by empirical covering numbers.
Definition 2. Let (𝔐, d) be a pseudometric space and S be a subset of 𝔐. For every ε > 0, the covering number 𝒩(S, ε, d) of S with respect to ε and d is defined as the minimal number of balls of radius ε whose union covers S, that is,
Definition 3. Let ℱ be a set of functions on X, and . Set 𝒩2,x(ℱ, ε) = 𝒩(ℱ|x, ε, d2). The ℓ2-empirical covering number of ℱ is defined by
Theorem 4. Assume that ρ satisfies (5) with some p ∈ (0, ∞] and q ∈ [1, ∞). Denote r = pq/(p + 1). One further assumes that fτ,ρ is uniquely defined. If fτ,ρ ∈ ℋ and ℋ satisfies (9) with ι ∈ (0,2), then for any 0 < δ < 1, with confidence 1 − δ, one has
Remark 5. In the ERM scheme, we can choose fτ,ρ ∈ ℋ which in turn makes the approximation error described by (23) equal to zero. However, it is impossible for the kernel based regularized scheme because of the appearance of the penalty term .
If q ≤ 2, all conditional distributions around the quantile behave similar to the uniform distribution. In this case r = pq/(p + 1) ≤ 2 and θ = min {2/q, p/(p + 1)} = p/(p + 1) for all p > 0. Hence, ϑ = 2(p + 1)/((2 + ι)pq + 4q). Furthermore, when p is large enough, the parameter r tends to q and the power index for the above learning rate arbitrarily approaches to 2/(2 + ι)q which shows that the learning rate power index for is arbitrarily close to 2/(2 + ι) independent of q. In particular, ι can be arbitrarily small when ℋ is smooth enough. In this case, the power index of the learning rates 2/(2 + ι) can be arbitrarily close to 1 which is the optimal learning rate for the least square regression.
Let us take some examples to demonstrate the above main result.
Example 6. Let ℋ be a unit ball of the sobolev space Hs with s > 0. Observe that the empirical covering number is bounded above by the uniform covering number defined in Definition 2. Hence we have (see [6, 7])
Under the same assumptions as Theorem 4, we get that by replacing ι by n/s, for any 0 < δ < 1, with confidence 1 − δ,
We carry out the same discussions on the case of q ≤ 2 and large enough p as Remark 5. Therefore the power index of the learning rates for is arbitrarily close to 2/(2 + n/s) independent of q. Furthermore, s can be arbitrarily large if the Sobolev space is smooth enough. In this special case, the learning rate power index arbitrarily approaches to 1.
Example 7. Let ℋ be a unit ball of the reproducing kernel Hilbert space ℋσ generated by a Gaussian kernel (see [5]). Reference [7] tells us that
So from Theorem 4, we can get different learning rates with power index
3. Error Analysis
We need the following results from [1] for our error analysis.
Proposition 8. Let Lτ be the pinball loss. Assume that ρ satisfies (5) with some p ∈ (0, ∞] and q ∈ [1, ∞). Then for all f : X → [−1,1] one has
The above result implies that we can get convergence rates of fz in the space by bounding the excess generalization error ℰτ(fz) − ℰτ(fτ,ρ).
To bound ℰτ(fz) − ℰτ(fτ,ρ), we need a standard error decomposition procedure [6] and a concentration inequality.
3.1. Error Decomposition
Proof. The excess generalization error can be written as
We call the term (23) approximation error. It has been studied in [9].
3.2. Concentration Inequality and Sample Error
Let us recall the one-sided Bernstein inequality as follows.
Lemma 10. Let ξ be a random variable on a probability space Z with variance σ2 satisfying |ξ − 𝔼(ξ)| ≤ Mξ for some constant Mξ. Then for any 0 < δ < 1, with confidence 1 − δ, one has
Proposition 11. Let fℋ ∈ ℋ. Assume that ρ on X × [−1,1] satisfies the variance bound (20) with index θ indicated in (19). For any 0 < δ < 1, with confidence 1 − δ/2, (3.6) can be bounded as
Proof. Let ξ(z) = Lτ(y − fℋ(x)) − Lτ(y − fτ,ρ(x)) which satisfies |ξ | ≤ 2 and in turn |ξ − E(ξ)| ≤ 2. The variance bound (20) implies that
Let us turn to estimate the sample error (3.5) involving the function fz which runs over a set of functions since z is a random sample itself. To estimate it, we use a concentration inequality below involving empirical covering numbers [10–12].
Lemma 12. Let ℱ be a class of measurable functions on Z. Assume that there are constants B, c > 0 and α ∈ [0,1] and ∥f∥∞ ≤ B and 𝔼f2 ≤ c(𝔼f) α for every f ∈ ℱ. If (7) holds, then there exists a constant depending only on ι such that for any t > 0, with probability at least 1 − e−t, there holds
Proposition 13. Assume ρ on X × [−1,1] satisfies the variance bound (20) with index θ indicated in (19). If ℋ satisfies (9) with ι ∈ (0,2), then for any 0 < δ < 1, with confidence 1 − δ/2, one has
Proof. Take g ∈ ℱ with the form g(z) = Lτ(y − f(x)) − Lτ(y − fτ,ρ(x)) where f ∈ ℋ. Hence 𝔼g = ℰτ(f) − ℰτ(fτ,ρ) and .
The Lipschitz property of the pinball loss Lτ implies that
Applying Lemma 12 with B = 2, α = θ, and c = Cθ, we know that for any 0 < δ < 1, with confidence 1 − δ/2, there holds
Proposition 14. Assume ρ on X × [−1,1] satisfies the variance bound (20) with index θ indicated in (19). If ℋ satisfies (9) with ι ∈ (0,2) Then for any 0 < δ < 1, with confidence 1 − δ, there holds
The above bound follows directly from Propositions 11 and 13 with the fact that fz ∈ ℋ.
3.3. Bounding the Total Error
Now we are in a position to present our general result on error analysis for algorithm (3).
Theorem 15. Assume that ρ satisfies (5) with some p ∈ (0, ∞] and q ∈ [1, ∞). Denote γ = pq/(p + 1). Further assume that ℋ satisfies (9) with ι ∈ (0,2) and fτ,ρ is uniquely defined. Then for any 0 < δ < 1, with confidence 1 − δ, one has
Proof of Theorem 4. The assumption fτ,ρ ∈ ℋ implies that
Therefore, our desired result comes directly from Theorem 15.
4. Further Discussions
In this paper, we studied ERM algorithm (3) for quantile regression and provide convergence and learning rates. We showed some essential differences between ERM scheme and kernel based regularized scheme for quantile regression. We also point out the difficulty to deal with quantile regression: the lack of strong convexity of the pinball loss. To overcome this difficulty, some noise condition on ρ is proposed to enable us to get a variance-expectation bound similar to the one for the least square regression.
In our analysis we just consider f ∈ ℋ and ∥f∥C(X) ≤ 1. The case ∥f∥C(X) ≤ R for R ≥ 1 would be interesting in the future work. The approximation error involving R can be estimated by the knowledge of interpolation space.
In our setting, the sample is drawn independently from the distribution ρ. However, in many practical problems, the i.i.d condition is a little demanding, so it would be interesting to investigate the ERM scheme for quantile regression with nonidentical distributions [13, 14] or dependent sampling [15].
Acknowledgment
This work described in this paper is supported by NSF of China under Grant 11001247 and 61170109.