Volume 2013, Issue 1 715275
Research Article
Open Access

Least Square Regularized Regression for Multitask Learning

Yong-Li Xu

Corresponding Author

Yong-Li Xu

Department of Mathematics, Beijing University of Chemical Technology, Beijing 100029, China buct.edu.cn

Search for more papers by this author
Di-Rong Chen

Di-Rong Chen

Department of Mathematics, Beijing University of Aeronautics and Astronautics, Beijing 100091, China buaa.edu.cn

Search for more papers by this author
Han-Xiong Li

Han-Xiong Li

Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong cityu.edu.hk

Search for more papers by this author
First published: 21 December 2013
Citations: 1
Academic Editor: Yiming Ying

Abstract

The study of multitask learning algorithms is one of very important issues. This paper proposes a least-square regularized regression algorithm for multi-task learning with hypothesis space being the union of a sequence of Hilbert spaces. The algorithm consists of two steps of selecting the optimal Hilbert space and searching for the optimal function. We assume that the distributions of different tasks are related to a set of transformations under which any Hilbert space in the hypothesis space is norm invariant. We prove that under the above assumption the optimal prediction function of every task is in the same Hilbert space. Based on this result, a pivotal error decomposition is founded, which can use samples of related tasks to bound excess error of the target task. We obtain an upper bound for the sample error of related tasks, and based on this bound, potential faster learning rates are obtained compared to single-task learning algorithms.

1. Introduction

Multitask learning [1] is a learning paradigm which seeks to improve the generalization performance of a learning task with the help of some other related tasks. This learning paradigm is inspired by human learning activities in that people often apply the knowledge gained from previous learning tasks to help learn a new task [2]. Different multitask learning algorithms have been designed, such as multitask support vector machine (SVM) [3], multitask feature learning [4, 5], multitask clustering approach [6], multitask structure learning [7], and multitask gradients learning [8].

Multitask learning can be formulated under two different settings: symmetric and asymmetric [9]. The symmetric multitask learning tries to improve the performance of all tasks simultaneously, and the objective of asymmetric multitask learning tries to improve the performance of some target task using information from related tasks. The asymmetric multitask learning is related to transfer learning [10]. The major difference is that the source tasks are still learned simultaneously in asymmetric multitask learning while they are learned independently in transfer learning [2]. Much experimental work has achieved the target that is improving the prediction performance of a learning task with the help of other related tasks [1, 1115]. However, there has been relatively little progress on theoretical analysis for these results.

Baxter presented a general frame for model selection in multitask learning environment [16]. They showed that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. They proved that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task. Ando and Zhang considered learning predictive structures on hypothesis spaces from multitask learning [17]. They presented a general framework in which the structural learning problem can be formulated and analyzed theoretically and related it to learning with unlabeled data. Ben-David and Borbely defined relatedness of tasks on the basis of similarity between the example generating distributions for classification [18], and then they gave precise conditions under which bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach. Solnon et al. studied multitask regression, using penalization techniques [19]. They showed that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. They presented a new algorithm to estimate this covariance matrix and proved that this estimator converges towards the covariance matrix.

In this paper, we propose a least-square regularized regression algorithm for multitask learning with hypothesis space being the union of a sequence of Hilbert spaces. The relatedness of tasks is described by distributions that underlie these tasks and some property of the hypothesis space. We assume that the distributions are related by a set of transformations under which the norm of any Hilbert space in the hypothesis space is invariant. We design a multitask learning algorithm with two steps: firstly, samples of other related tasks are used to select an approximate optimal Hilbert space in the hypothesis space; secondly, in the optimal Hilbert space, we use standard least square regularized regression algorithm for the target task. It is proved that, under the above assumption, the optimal prediction function of every task is in the same Hilbert space. For error analysis, we decompose the excess error of prediction function in target task into regularization error and sample error in which the difference between error and empirical error of the prediction function in the target task is estimated by the average value of those in related tasks. This leads to a potential faster learning rate than that of standard regularized regression algorithm in single task.

The rest of the paper is organized as follows. In Section 2, we introduce some notions and definitions and then propose the least square regularized regression for multitask learning. In Section 3, we decompose excess error of target task into regularization error and sample error in which samples of other related tasks can be used to estimated difference between error and empirical error of the prediction function in target task. The main result is presented in Section 4. An upper bound for sample error of multiple tasks is given by Hoeffding’s inequality and an estimation for covering number in multitask learning, and then, based on the upper bound, potential faster learning rates compared to single-task learning algorithms are obtained.

2. Preliminaries

To propose regularized regression algorithm for multitask learning, we introduce some definitions and notations. Let (X, d) be a compact metric space and let Y = [−M, M]. Let ρ be a probability distribution on Z : = X × Y. The regression function is defined as
()
where ρ(yx) is the conditional probability measure at x induced by ρ. Knowing a set of samples from the probability distribution ρ, our goal is to find a good approximation of fρ.

For multitask learning, we define the relatedness of probability distribution of multiple tasks.

Definition 1. For a function f : XX, let f[P] be the probability distribution over X × Y defined by f[ρ](T) = ρ({(f(x), b)∣(x, b) ∈ T}), for TX × Y. Let be a set of transformations f : XX and let ρ1  and  ρ2 be probability distributions over X × Y. We say that ρ1 and ρ2 are -related if there exists some f such that ρ1 = f[ρ2] or ρ2 = f[ρ1].

Then, we describe the relatedness of the transformation set above and a hypothesis space.

Definition 2. Let be a set of transformations f : XX and let be a set of functions XY. We say that acts as a group over , if,

  • (1)

    for every f and every h, there holds hf;

  • (2)

    for every f, g, the inverse transformation f−1 and the composition fg are also members of .

Definition 3. Let σ be a Hilbert space with norm ∥·∥σ and let act as a group over σ. We say σ is norm invariant under , if, for any hσ and any f, there holds

()
To explain the above definitions, we give an example.

Example 4. Let X = 2, Y = [−M, M], Γ = R, and , for x, xX, and σ be the closure of linear span of the set {Kσ,x : = Kσ(x, ·) : xX} with inner product , for any σ ∈ Γ. Let the norm ∥·∥σ of functions in σ be induced by the inner product.

Assume is a set of translation and rotation transformations on {σ, σ ∈ Γ}: for any f and h ∈ {σ, σ ∈ Γ}, there holds (hf)(x): = h(x + xf), for some xfX, or (hf)(x): = h(Afx), for

()
where θf is an angle dependent on f. We can verify that acts as a group over σ and σ is norm invariant under , for any σ ∈ Γ.

Now we introduce standard least-square regularized regression algorithm. We denote error and empirical error of a function f : XY with squared loss as follows. For a distribution ρ on X × Y, the error of f is defined as
()
It is well known that the regression function minimizes the error. Indeed,
()
where ρX is the marginal distribution of ρ on X and . The above difference is called the excess error of f, and for a sample set S = {(xi, yi), i = 1,2, …, m}, independently drawn according to ρ, the empirical error of f is defined as
()
In this paper, we consider a sequence of related learning tasks. The goal is to use information of related tasks to improve learning performance of one special learning task. Let be a transformation set and σ be a Hilbert space with norm ∥·∥σ, for any σ ∈ Γ, where Γ is an index set. We assume that there is κ > 0, such that
()

Let ρi denote the probability distribution of the ith task, for i = 1, …, n. We assume is pair-wise -related, acts as a group over σ, and σ is norm invariant under , for any σ ∈ Γ. Let be samples independently drawn according to ρi. Since the n tasks are related, we try to use samples of all the n tasks to improve the learning performance of the target task.

Standard least-square regularized regression algorithm associated with {σ : σ ∈ Γ} for the ith single task is defined as the minimizer
()
In the above optimization, there are two steps; firstly, for any fixed σ ∈ Γ, find the optimal function ; secondly, find the global optimal function fS, for σ ∈ Γ. Since the n tasks are related, we try to use samples of all the n tasks to improve the learning performance of the target task. Without loss of generality, we choose the first task as the target task.

Now, we propose least square regularized regression algorithm for multitask learning.

Step 1. Use samples of other n − 1 tasks to select the approximate optimal as follows:

()

Step 2. In , search for the approximation to as follows:

()

3. Error Decomposition

To estimate the bound of excess error , we introduce some notations. Let be the optimal σ ∈ Γ for the ith task,
()

Lemma 5. Let ρi denote the probability distribution of the ith task for i = 1, …, n, let σ be a Hilbert space with norm ∥·∥σ for σ ∈ Γ, and let be a transformation set. Assume is pair-wise -related, acts as a group over σ, and σ is norm invariant under , for any σ ∈ Γ. Then defined in (11) satisfies

()

Proof. By the assumption that ρi and ρj are -related, it is easy to show that

()
for any i, j ∈ {1, …, n} and σ ∈ Γ. Notice that σ is norm-invariant under for any σ ∈ Γ. Then, there holds
()
for any σ ∈ Γ. Then the lemma follows.

To decompose the excess error, we give the following notations. Denote
()

Proposition 6. Let be defined in (10). Then under the assumption of Lemma 5, can be bounded by

()

Proof. Write the regularization error as

()
By Lemma 5, for any i, j ∈ {1, …, n}, there holds . Then we obtain
()
By the definition, we have that
()
Therefore, we have
()
Then, the proposition follows.

In (16), there are 5 terms. The first term is called regularization error which depends on the approximation ability of hypothesis space to . The estimation of this term has been discussed in [20] for reproducing kernel Hilbert space with Gaussian kernel with flexible variances.

The other four terms are called sample error. In the second and third terms, and are selected from which is dependent on Si, for i = 2, …, n, and is independent of S1. Therefore, when we take expectation of S1, can be seen as a fixed function space. Consequently, these two terms can be estimated with the same method in the proof of Propositions 2.1 and 3.1 in [21]. In the fourth term, is a fixed function, for i = 1, …, n. Therefore, this term can also be estimated as in the proof of Proposition 2.1 in [21]. The last term is more difficult to deal with because , for i = 2, …, n, can not be considered in σ, for any fixed σ ∈ Γ. Consequently, when sample number m, the convergence rate of the sample error depends on that of the last term. Therefore, in the following section, we focus on the estimation for the bound of the last term in (16).

4. Error Analysis

In this section, we estimate the bound of the last term in (16). To bound this term, we have to estimate capacity of {σ : σ ∈ Γ}. Here, the capacity is measured by the covering number.

Definition 7. For a subset of a metric space (X, d) and η > 0, the covering number 𝒩(, η, d) is defined to be the minimal integer l such that there exist l disks with radius η covering .

For n and σ,R = {fσ, ∥fσR}, define
()
For and , define
()
Then, define the distance from to as
()
Let denote the minimal integer l such that there exist l parameters σ1, …, σl ∈ Γ, such that
()
Then, for , we have the following lemma.

Lemma 8. Consider the following:

()

Proof. For any , there is σ ∈ Γ such that . By the definition of , there is such that . Then, by the definition of , we have . And by the definition of d, there is satisfying .

By the definition of , there is such that . Therefore, we can obtain

()
Note that, for any σ ∈ Γ, there holds . Then the lemma follows.

Proposition 9. For and , for i = 2, …, n, defined in (10), there holds

()

Proof. By the definition of , we have , for i = 1,2, …, n. Then, there holds

()
For , denote . Note that, for , there holds
()
Therefore, we can find that balls such that the center of each ball and point fk in this ball satisfies . Therefore, the probability in (28) can be bounded by following expression:
()
For the covering number 𝒩, by Lemma 8, we have the estimate
()
Using Hoeffding inequality, for any fixed , we have
()
Then, the proposition follows.

Finally, we can obtain a bound for the last term of (16) by Proposition 9.

Proposition 10. Let , for i = 2, …, n, be defined in (10). Assume, for all σ ∈ Γ and for  all  s > 0, there holds

()
Then with confidence at least 1 − δ, there holds
()
where ε0 is the solution of the following equation:
()

Proof. Let

()
By condition on ln 𝒩(η, σ,1, l,1) ≤ C0(1/η) s, we have
()
Then, ε1 is not larger than ε0 in the following equation:
()
Then by Proposition 9, the proposition follows.

Remark 11. Compare multitask learning with multiple Hilbert spaces to single task learning with multiple Hilbert space.

Recall that the least square regularized regression in {σ, σ ∈ Γ} for single task is defined as

()
can be bounded by the sum of regularization error and sample error with similar method in Proposition 6. In the sample error, the term most difficult to estimated will be , because function changed with the S1 runs over function set {σ, σ ∈ Γ}. By the same method in Proposition 10, with confidence 1 − δ, this term can be bounded by the solution of the following equation:
()
Obviously, we have . Therefore, multitask learning algorithm has potential faster learning rate.

Remark 12. Comparing multitask learning with multiple Hilbert spaces to single task learning with single Hilbert space.

In this paper, we set the hypothesis space as a set of Hilbert spaces. It is well known that hypothesis space with more functions has stronger approximation ability and bigger complexity. Therefore, the regularization error may be smaller and sample error may be larger than that of algorithms with a single Hilbert space being the hypothesis space.

For least square regularized regression with a single Hilbert space σ for single task, with confidence 1 − δ, the largest term in sample error can be bounded by the solution of the following equation:

()

If we assume n = 𝒪(mζ) with some ζ > 0 large enough, (8(κR + M) 2/(n − 1)m)ln 𝒩R(ε0/16(κR + M), Γ, d) in Proposition 10 can converge to 0 fast. Then, we can obtain , while the regularization error of multitask learning with multiple Hilbert spaces is smaller. Trading off the regularization error and sample error, we can obtain potential faster learning rate than that of single task learning.

Acknowledgments

This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant ZY1118, National Technology Support Program under Grant 2012BAH05B01, and the National Science Foundation of China under Grants 11101024 and 11171014.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.