The study of multitask learning algorithms is one of very important issues. This paper proposes a least-square regularized regression algorithm for multi-task learning with hypothesis space being the union of a sequence of Hilbert spaces. The algorithm consists of two steps of selecting the optimal Hilbert space and searching for the optimal function. We assume that the distributions of different tasks are related to a set of transformations under which any Hilbert space in the hypothesis space is norm invariant. We prove that under the above assumption the optimal prediction function of every task is in the same Hilbert space. Based on this result, a pivotal error decomposition is founded, which can use samples of related tasks to bound excess error of the target task. We obtain an upper bound for the sample error of related tasks, and based on this bound, potential faster learning rates are obtained compared to single-task learning algorithms.

1. Introduction

Multitask learning [1] is a learning paradigm which seeks to improve the generalization performance of a learning task with the help of some other related tasks. This learning paradigm is inspired by human learning activities in that people often apply the knowledge gained from previous learning tasks to help learn a new task [2]. Different multitask learning algorithms have been designed, such as multitask support vector machine (SVM) [3], multitask feature learning [4, 5], multitask clustering approach [6], multitask structure learning [7], and multitask gradients learning [8].

Multitask learning can be formulated under two different settings: symmetric and asymmetric [9]. The symmetric multitask learning tries to improve the performance of all tasks simultaneously, and the objective of asymmetric multitask learning tries to improve the performance of some target task using information from related tasks. The asymmetric multitask learning is related to transfer learning [10]. The major difference is that the source tasks are still learned simultaneously in asymmetric multitask learning while they are learned independently in transfer learning [2]. Much experimental work has achieved the target that is improving the prediction performance of a learning task with the help of other related tasks [1, 11–15]. However, there has been relatively little progress on theoretical analysis for these results.

Baxter presented a general frame for model selection in multitask learning environment [16]. They showed that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. They proved that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task. Ando and Zhang considered learning predictive structures on hypothesis spaces from multitask learning [17]. They presented a general framework in which the structural learning problem can be formulated and analyzed theoretically and related it to learning with unlabeled data. Ben-David and Borbely defined relatedness of tasks on the basis of similarity between the example generating distributions for classification [18], and then they gave precise conditions under which bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach. Solnon et al. studied multitask regression, using penalization techniques [19]. They showed that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. They presented a new algorithm to estimate this covariance matrix and proved that this estimator converges towards the covariance matrix.

In this paper, we propose a least-square regularized regression algorithm for multitask learning with hypothesis space being the union of a sequence of Hilbert spaces. The relatedness of tasks is described by distributions that underlie these tasks and some property of the hypothesis space. We assume that the distributions are related by a set of transformations under which the norm of any Hilbert space in the hypothesis space is invariant. We design a multitask learning algorithm with two steps: firstly, samples of other related tasks are used to select an approximate optimal Hilbert space in the hypothesis space; secondly, in the optimal Hilbert space, we use standard least square regularized regression algorithm for the target task. It is proved that, under the above assumption, the optimal prediction function of every task is in the same Hilbert space. For error analysis, we decompose the excess error of prediction function in target task into regularization error and sample error in which the difference between error and empirical error of the prediction function in the target task is estimated by the average value of those in related tasks. This leads to a potential faster learning rate than that of standard regularized regression algorithm in single task.

The rest of the paper is organized as follows. In Section 2, we introduce some notions and definitions and then propose the least square regularized regression for multitask learning. In Section 3, we decompose excess error of target task into regularization error and sample error in which samples of other related tasks can be used to estimated difference between error and empirical error of the prediction function in target task. The main result is presented in Section 4. An upper bound for sample error of multiple tasks is given by Hoeffding’s inequality and an estimation for covering number in multitask learning, and then, based on the upper bound, potential faster learning rates compared to single-task learning algorithms are obtained.

2. Preliminaries

To propose regularized regression algorithm for multitask learning, we introduce some definitions and notations. Let (X, d) be a compact metric space and let Y = [−M, M]. Let ρ be a probability distribution on Z : = X × Y. The regression function is defined as

()

where ρ(y∣x) is the conditional probability measure at x induced by ρ. Knowing a set of samples from the probability distribution ρ, our goal is to find a good approximation of f_ρ.

For multitask learning, we define the relatedness of probability distribution of multiple tasks.

Definition 1. For a function f : X → X, let f[P] be the probability distribution over X × Y defined by f[ρ](T) = ρ({(f(x), b)∣(x, b) ∈ T}), for T⊆X × Y. Let ℱ be a set of transformations f : X → X and let ρ₁ and ρ₂ be probability distributions over X × Y. We say that ρ₁ and ρ₂ are ℱ-related if there exists some f ∈ ℱ such that ρ₁ = f[ρ₂] or ρ₂ = f[ρ₁].

Then, we describe the relatedness of the transformation set ℱ above and a hypothesis space.

Definition 2. Let ℱ be a set of transformations f : X → X and let ℋ be a set of functions X → Y. We say that ℱ acts as a group over ℋ, if,

(1)
for every f ∈ ℱ and every h ∈ ℋ, there holds h∘f ∈ ℋ;
(2)
for every f, g ∈ ℱ, the inverse transformation f⁻¹ and the composition f∘g are also members of ℱ.

Definition 3. Let ℋ_σ be a Hilbert space with norm ∥·∥_σ and let ℱ act as a group over ℋ_σ. We say ℋ_σ is norm invariant under ℱ, if, for any h ∈ ℋ_σ and any f ∈ ℱ, there holds

()

To explain the above definitions, we give an example.

Example 4. Let X = ℝ², Y = [−M, M], Γ = R, and , for x, x^′ ∈ X, and ℋ_σ be the closure of linear span of the set {K_σ,x : = K_σ(x, ·) : x ∈ X} with inner product , for any σ ∈ Γ. Let the norm ∥·∥_σ of functions in ℋ_σ be induced by the inner product.

Assume ℱ is a set of translation and rotation transformations on {ℋ_σ, σ ∈ Γ}: for any f ∈ ℱ and h ∈ {ℋ_σ, σ ∈ Γ}, there holds (h∘f)(x): = h(x + x_f), for some x_f ∈ X, or (h∘f)(x): = h(A_fx), for

()

where θ_f is an angle dependent on f. We can verify that ℱ acts as a group over ℋ_σ and ℋ_σ is norm invariant under ℱ, for any σ ∈ Γ.

Now we introduce standard least-square regularized regression algorithm. We denote error and empirical error of a function f : X → Y with squared loss as follows. For a distribution ρ on X × Y, the error of f is defined as

()

It is well known that the regression function minimizes the error. Indeed,

()

where ρ_X is the marginal distribution of ρ on X and

. The above difference is called the excess error of f, and for a sample set S = {(x_i, y_i), i = 1,2, …, m}, independently drawn according to ρ, the empirical error of f is defined as

()

In this paper, we consider a sequence of related learning tasks. The goal is to use information of related tasks to improve learning performance of one special learning task. Let ℱ be a transformation set and ℋ_σ be a Hilbert space with norm ∥·∥_σ, for any σ ∈ Γ, where Γ is an index set. We assume that there is κ > 0, such that

()

Let ρ_i denote the probability distribution of the ith task, for i = 1, …, n. We assume is pair-wise ℱ-related, ℱ acts as a group over ℋ_σ, and ℋ_σ is norm invariant under ℱ, for any σ ∈ Γ. Let be samples independently drawn according to ρ_i. Since the n tasks are related, we try to use samples of all the n tasks to improve the learning performance of the target task.

Standard least-square regularized regression algorithm associated with {ℋ_σ : σ ∈ Γ} for the ith single task is defined as the minimizer

()

In the above optimization, there are two steps; firstly, for any fixed σ ∈ Γ, find the optimal function

; secondly, find the global optimal function f_S, for σ ∈ Γ. Since the n tasks are related, we try to use samples

of all the n tasks to improve the learning performance of the target task. Without loss of generality, we choose the first task as the target task.

Now, we propose least square regularized regression algorithm for multitask learning.

Step 1. Use samples of other n − 1 tasks to select the approximate optimal as follows:

()

Step 2. In , search for the approximation to as follows:

()

3. Error Decomposition

To estimate the bound of excess error

, we introduce some notations. Let

be the optimal σ ∈ Γ for the ith task,

()

Lemma 5. Let ρ_i denote the probability distribution of the ith task for i = 1, …, n, let ℋ_σ be a Hilbert space with norm ∥·∥_σ for σ ∈ Γ, and let ℱ be a transformation set. Assume is pair-wise ℱ-related, ℱ acts as a group over ℋ_σ, and ℋ_σ is norm invariant under ℱ, for any σ ∈ Γ. Then defined in (11) satisfies

()

Proof. By the assumption that ρ_i and ρ_j are ℱ-related, it is easy to show that

()

for any i, j ∈ {1, …, n} and σ ∈ Γ. Notice that ℋ_σ is norm-invariant under ℱ for any σ ∈ Γ. Then, there holds

()

for any σ ∈ Γ. Then the lemma follows.

To decompose the excess error, we give the following notations. Denote

()

Proposition 6. Let be defined in (10). Then under the assumption of Lemma 5, can be bounded by

()

Proof. Write the regularization error as

()

By Lemma 5, for any i, j ∈ {1, …, n}, there holds

. Then we obtain

()

By the definition, we have that

()

Therefore, we have

()

Then, the proposition follows.

In (16), there are 5 terms. The first term is called regularization error which depends on the approximation ability of hypothesis space to . The estimation of this term has been discussed in [20] for reproducing kernel Hilbert space with Gaussian kernel with flexible variances.

The other four terms are called sample error. In the second and third terms, and are selected from which is dependent on S_i, for i = 2, …, n, and is independent of S₁. Therefore, when we take expectation of S₁, can be seen as a fixed function space. Consequently, these two terms can be estimated with the same method in the proof of Propositions 2.1 and 3.1 in [21]. In the fourth term, is a fixed function, for i = 1, …, n. Therefore, this term can also be estimated as in the proof of Proposition 2.1 in [21]. The last term is more difficult to deal with because , for i = 2, …, n, can not be considered in ℋ_σ, for any fixed σ ∈ Γ. Consequently, when sample number m → ∞, the convergence rate of the sample error depends on that of the last term. Therefore, in the following section, we focus on the estimation for the bound of the last term in (16).

4. Error Analysis

In this section, we estimate the bound of the last term in (16). To bound this term, we have to estimate capacity of {ℋ_σ : σ ∈ Γ}. Here, the capacity is measured by the covering number.

Definition 7. For a subset ℱ of a metric space (X, d) and η > 0, the covering number 𝒩(ℱ, η, d) is defined to be the minimal integer l ∈ ℕ such that there exist l disks with radius η covering ℱ.

For n ∈ ℕ and ℋ_σ,R = {f ∈ ℋ_σ, ∥f∥_σ ≤ R}, define

()

For

and

, define

()

Then, define the distance from

()

Let

denote the minimal integer l ∈ ℕ such that there exist l parameters σ₁, …, σ_l ∈ Γ, such that

()

Then, for

, we have the following lemma.

Lemma 8. Consider the following:

()

Proof. For any , there is σ ∈ Γ such that . By the definition of , there is such that . Then, by the definition of , we have . And by the definition of d_∞, there is satisfying .

By the definition of , there is such that . Therefore, we can obtain

()

Note that, for any σ ∈ Γ, there holds

. Then the lemma follows.

Proposition 9. For and , for i = 2, …, n, defined in (10), there holds

()

Proof. By the definition of , we have , for i = 1,2, …, n. Then, there holds

()

For

, denote

. Note that, for

, there holds

()

Therefore, we can find that

balls

such that the center

of each ball and point f^k in this ball satisfies

. Therefore, the probability in (28) can be bounded by following expression:

()

For the covering number 𝒩, by Lemma 8, we have the estimate

()

Using Hoeffding inequality, for any fixed

, we have

()

Then, the proposition follows.

Finally, we can obtain a bound for the last term of (16) by Proposition 9.

Proposition 10. Let , for i = 2, …, n, be defined in (10). Assume, for all σ ∈ Γ and for all s > 0, there holds

()

Then with confidence at least 1 − δ, there holds

()

where ε₀ is the solution of the following equation:

()

Proof. Let

()

By condition on ln 𝒩(η, ℋ_σ,1, l_∞,1) ≤ C₀(1/η) ^s, we have

()

Then, ε₁ is not larger than ε₀ in the following equation:

()

Then by Proposition 9, the proposition follows.

Remark 11. Compare multitask learning with multiple Hilbert spaces to single task learning with multiple Hilbert space.

Recall that the least square regularized regression in {ℋ_σ, σ ∈ Γ} for single task is defined as

()

can be bounded by the sum of regularization error and sample error with similar method in Proposition 6. In the sample error, the term most difficult to estimated will be

, because function

changed with the S₁ runs over function set {ℋ_σ, σ ∈ Γ}. By the same method in Proposition 10, with confidence 1 − δ, this term can be bounded by the solution

of the following equation:

()

Obviously, we have

. Therefore, multitask learning algorithm has potential faster learning rate.

Remark 12. Comparing multitask learning with multiple Hilbert spaces to single task learning with single Hilbert space.

In this paper, we set the hypothesis space as a set of Hilbert spaces. It is well known that hypothesis space with more functions has stronger approximation ability and bigger complexity. Therefore, the regularization error may be smaller and sample error may be larger than that of algorithms with a single Hilbert space being the hypothesis space.

For least square regularized regression with a single Hilbert space ℋ_σ for single task, with confidence 1 − δ, the largest term in sample error can be bounded by the solution of the following equation:

()

If we assume n = 𝒪(m^ζ) with some ζ > 0 large enough, (8(κR + M) ²/(n − 1)m)ln 𝒩_R(ε₀/16(κR + M), Γ, d_∞) in Proposition 10 can converge to 0 fast. Then, we can obtain , while the regularization error of multitask learning with multiple Hilbert spaces is smaller. Trading off the regularization error and sample error, we can obtain potential faster learning rate than that of single task learning.

Acknowledgments

This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant ZY1118, National Technology Support Program under Grant 2012BAH05B01, and the National Science Foundation of China under Grants 11101024 and 11171014.

References

1 Caruana R., Multitask learning, Machine Learning. (1997) 28, no. 1, 41–75, 2-s2.0-0031189914.
10.1023/A:1007379606734
Web of Science® Google Scholar
2 Zhang Y. and Yeung D.-Y., A convex formulation for learning task relationships in multi-task learning, Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI ′10), July 2010, Catalina Island, Calif, USA, 733–742, 2-s2.0-80053162594.
Google Scholar
3 Evgeniou T. and Pontil M., Regularized multi-task learning, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2004, New York, NY, USA, 109–117, 2-s2.0-12244250351.
Google Scholar
4 Argyriou A., Evgeniou T., and Pontil M., Convex multi-task feature learning, Machine Learning. (2008) 73, no. 3, 243–272, 2-s2.0-55149088329, https://doi.org/10.1007/s10994-007-5040-8.
10.1007/s10994-007-5040-8
Web of Science® Google Scholar
5 Liu J., Ji S., and Ye J., Multi-task feature learning via efficient L2,1-norm minimization, Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI ′09), June 2009, Montreal, Canada, 339–348, 2-s2.0-80053145416.
Google Scholar
6 Bakker B. and Heskes T., Task clustering and gating for bayesian multitask learning, Journal of Machine Learning Research. (2004) 4, no. 1, 83–99, 2-s2.0-0346238931, https://doi.org/10.1162/153244304322765658.
10.1162/153244304322765658
Web of Science® Google Scholar
7 Argyriou A., Micchelli C. A., Pontil M., and Ying Y. M., A spectral regularization framework for multi-task structure learning, Proceedings of the 21st Annual Conference on Advances in Neural Information Processing Systems (NIPS ′07), December 2007, 2-s2.0-84858767507.
Google Scholar
8 Guinney J., Wu Q., and Mukherjee S., Estimating variable structure and dependence in multitask learning via gradients, Machine Learning. (2011) 83, no. 3, 265–287, https://doi.org/10.1007/s10994-010-5217-4, MR3108212, ZBL1237.68145.
10.1007/s10994-010-5217-4
Web of Science® Google Scholar
9 Xue Y., Liao X., Carin L., and Krishnapuram B., Multi-task learning for classification with Dirichlet process priors, Journal of Machine Learning Research. (2007) 8, 35–63, MR2280214, ZBL1222.68338.
Web of Science® Google Scholar
10 Pan S. J. and Yang Q., A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering. (2010) 22, no. 10, 1345–1359, 2-s2.0-77956031473, https://doi.org/10.1109/TKDE.2009.191.
10.1109/TKDE.2009.191
Web of Science® Google Scholar
11 Baxter J., Learning internal representations, Proceedings of the Workshop on Computational Learning Theory (COLT ′95), 1995, San Mateo, Calif, USA, Morgan Kaufmann.
Google Scholar
12 Intrator N. and Edelman S., Making a low-dimensional representation suitable for diverse tasks, Connection Science. (1996) 8, no. 2, 205–224, 2-s2.0-0030488027, https://doi.org/10.1080/095400996116884.
10.1080/095400996116884
Google Scholar
13 Thrun S., D. Touretzky and M. Mozer, Is learning the n-th thing any easier than learning the first?, Proceedings of the Advances in Neural Information Processing Systems (NIPS ′96), 1996.
Google Scholar
14 Heskes T., Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach, Proceedings of the International Conference on Machine Learning (ICML ′98), 1998.
Google Scholar
15 Romera-Paredes B., Argyriou A., Berthouze N., and Pontil M., Exploiting unrelated tasks in multi-task learning, Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS ′12), 2012, La Palma, Spain, 951–959.
Google Scholar
16 Baxter J., A model of inductive bias learning, Journal of Artificial Intelligence Research. (2000) 12, 149–198, MR1752410, ZBL0940.68106.
10.1613/jair.731
Web of Science® Google Scholar
17 Ando R. K. and Zhang T., A framework for learning predictive structures from multiple tasks and unlabeled data, Journal of Machine Learning Research. (2005) 6, 1817–1853, MR2249873, ZBL1222.68133.
Web of Science® Google Scholar
18 Ben-David S. and Borbely R. S., A notion of task relatedness yielding provable multiple-task learning guarantees, Machine Learning. (2008) 73, no. 3, 273–287, 2-s2.0-55149085224, https://doi.org/10.1007/s10994-007-5043-5.
10.1007/s10994-007-5043-5
Web of Science® Google Scholar
19 Solnon M., Arlot S., and Bach F., Multi-task regression using minimal penalties, Journal of Machine Learning Research. (2012) 13, 2773–2812, MR2989914.
Web of Science® Google Scholar
20 Ying Y. M. and Zhou D.-X., Learnability of Gaussians with flexible variances, Journal of Machine Learning Research. (2007) 8, 249–276, MR2320669, ZBL1222.68339.
Web of Science® Google Scholar
21 Wu Q., Ying Y., and Zhou D.-X., Learning rates of least-square regularized regression, Foundations of Computational Mathematics. (2006) 6, no. 2, 171–192, https://doi.org/10.1007/s10208-004-0155-9, MR2228738, ZBL1100.68100.
10.1007/s10208-004-0155-9
Web of Science® Google Scholar

All articles

Least Square Regularized Regression for Multitask Learning

Abstract

1. Introduction

2. Preliminaries

3. Error Decomposition

4. Error Analysis

Acknowledgments

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Least Square Regularized Regression for Multitask Learning

Abstract

1. Introduction

2. Preliminaries

3. Error Decomposition

4. Error Analysis

Acknowledgments

References

References

Related

Information