Remodeling and Estimation for Sparse Partially Linear Regression Models
Abstract
When the dimension of covariates in the regression model is high, one usually uses a submodel as a working model that contains significant variables. But it may be highly biased and the resulting estimator of the parameter of interest may be very poor when the coefficients of removed variables are not exactly zero. In this paper, based on the selected submodel, we introduce a two-stage remodeling method to get the consistent estimator for the parameter of interest. More precisely, in the first stage, by a multistep adjustment, we reconstruct an unbiased model based on the correlation information between the covariates; in the second stage, we further reduce the adjusted model by a semiparametric variable selection method and get a new estimator of the parameter of interest simultaneously. Its convergence rate and asymptotic normality are also obtained. The simulation results further illustrate that the new estimator outperforms those obtained by the submodel and the full model in the sense of mean square errors of point estimation and mean square prediction errors of model prediction.
1. Introduction
A feature of the model is that the parametric part contains both the parameter vector of interest and nuisance parameter vector. The reason for this coefficient separation is as follows. In practice we often use such a model to distinguish the main treatment variables of interest from the state variables. For instance, in a clinical trial, X consists of treatment variables and can be easily controlled, Z is a vector of many clinical variables, such as patient ages and body weights. The variables in Z may have an impact on Y but are not of primary interest and the effects may be small. In order to make up for potentially nonnegligible effects on the response Y, the nuisance covariate Z are introduced into model (1); see Shen et al. [1]. Model (1) contains all relevant covariates and in this paper we call it full model.
However, when many components of Z are correlated with (X, U), the number of nonparametric functions added in the above working model is large. Such a model is improper in practice. Thus, in the second stage, we further simplify the above adjusted model by a semiparametric variable selection procedure proposed by Zhao and Xue [4]. Their method can select significant parametric and nonparametric components simultaneously under sparsity condition for semiparametric varying coefficient partially linear models. The relevant papers include Fan and Li [5], Wang et al. [6, 7], among others. After two-stage remodeling, the final model is conditionally unbiased. Based on this model, the estimation and model prediction are significantly improved.
The rest of this paper is organized as follows. In Section 2, a multistep adjustment and remodeled models are firstly proposed, then the models are further simplified via the semiparametric SCAD variable selection procedure. A new estimator of the parameter of interest based on the simplified model is derived, its convergence rate and asymptotic normality are also obtained. Simulations are given in Section 3. A short conclusion and some remarks are contained in Section 4. Some regular conditions and theoretical proofs are presented in the appendix.
2. New Estimator for the Parameter of Interest
In this paper, we suppose that covariate Z has zero mean, p is finite and p ≪ q, E(ε∣X, Z, U) = 0 and Var (ε∣X, Z, U) = σ2. We also assume that covariates X and U and parameter β are prespecified, so that the submodel (2) is a fixed model.
2.1. Multistep-Adjustment by Correlation
In this subsection, we first adjust the submodel to be conditionally unbiased by a multistep-adjustment.
When Z is normally distributed, the principal component analysis (PCA) method will be used. Let ΣZ be the covariance matrix of Z, then there exists an orthogonal q × q matrix Q such that QΣZQT = Λ, where Λ is the diagonal matrix diag (λ1, λ2, …, λq) with λ1 ≥ λ2 ≥ ⋯≥λq ≥ 0 being eigenvalues of ΣZ. Denote QT = (τ1, τ2, …, τq) and .
When Z is centered but nonnormally distributed, we shall apply independent component analysis (ICA) method. Assume that Z is generated by a nonlinear combination of independent components , that is , where F(·) is an unknown nonlinear mapping from Rq to Rq, is an unknown random vector with independent components. By imposing some constraints on the nonlinear mixing mapping F or the independent components , the independent components can be properly estimated. See Simas Filho and Seixas [8] for an overview of the main statistical principles and some algorithms for estimating the independent components. For simplicity, in this paper we suppose that Z = (Z(1), …, Z(q)) T with , l = 1, …, q, and Flj(·) are scalar functions.
In the above two cases, ′s are independent of each other. Set K0 to be the size of set . Without loss of generality, let M0 = {1, …, K0}.
The adjusted model (3) is an additive partially linear model, in which βTX is the parametric part, f(U) and , j = 1, …, K0, are the nonparametric parts and is the random error. Compared with the submodel (2), the nonparametric parts , j = 1, …, K0, may be regarded as bias-corrected terms for the random error η. For centered Z, , j = 1, …, K0, the nonparametric components can be properly identified. In fact, centered Z can be relaxed to any Z such that satisfies γTE(Z) = 0.
2.2. Model Simplification
When the most of the features in the full model are correlated, then K0 is very large and even is close to q. In this case, the adjusted model (3) is improper in practice, so we shall use the group SCAD regression procedure, proposed by Wang et al. [6], and the semiparametric variable selection procedure, proposed by Zhao and Xue [4], to further simplify the model.
Denote by , and the least squares estimators based on the penalized function (10), that is . Let and , then is an estimator of , is an estimator of f(U).
2.3. Asymptotic Property of Point Estimator
Let β0, θ0, ν0, and gj0(·), f0(·) be the true values of β, θ, ν, and gj(·), f(·), respectively, in model (3). Without loss of generality, we assume that , j = s + 1, …, K0, and , j = 1, …, s, are all nonzero components.
We suppose that , j = 1, …, K0 can be expressed as and f(U) can be expressed as , θj and ν belong to the Sobolev ellipsoid .
The following theorem gives the consistency of the penalized SCAD estimators.
Theorem 1. Suppose that the regularity conditions (C1)–(C5) in the appendix hold and the number of terms L = Op(n1/(2r+1)). Then,
- (i)
,
- (ii)
,
- (iii)
,
where .
From the last paragraph of Section 2.2 we know that, for linear regression model and normally distributed Z, the multistep adjusted model (5) is a linear model. By orthogonal basis functions, such as power series, we have r = ∞, then , implying the estimator has the same convergence rate as that of the SCAD estimator in Fan and Li [5].
Theorem 2. Suppose that the regularity conditions (C1)–(C6) in the appendix hold and the number of terms L = Op(n1/(2r+1)). Let λmax = max j{λj} and λmin = min j{λj}. If λmax → 0 and nr/(2r+1)λmin → ∞ as n → ∞, then, with probability tending to 1, , j = s + 1, …, K0.
Remark 3. By Remark 1 of Fan and Li [5], we have that, if λmax → 0 as n → ∞, then an → 0. Hence from Theorems 1 and 2, by choosing proper tuning parameters, the variable selection method is consistent and the estimators of nonparametric components achieve the optimal convergence rate as if the subset of true zero coefficients was already known; see Stone [10].
Theorem 4. Suppose that the regularity conditions (C1)–(C6) in the appendix hold and the number of terms L = Op(n1/(2r+1)). If Σ is invertible, then
2.4. Some Issues on Implementation
In the adjusted model (4), τj, j = 1, …, K0 are used. When the population distribution is not available, they need to be approximated by estimators. When Z is normally distributed and eigenvalues rj, j = 1, …, q of the covariance matrix ΣZ are different from each other, then is asymptotically N(0, Vj) with , where uj is the jth eigenvector of with ; see Anderson [11]. For the case when the population size is large and comparable with the sample size, if the covariance matrix is sparse, we can use the method in Rütimann and Bühlmann [12] or Cai and Liu [13] to estimate the covariance matrix. So we can use uj to approximate τj. When τj in model (4) are replaced by these consistent estimators, one can see that the approximation error can be neglected without changing the asymptotic property.
The nonparametric parts in the adjusted model depend on the univariate variable , for l = 1, …, K0. So it needs to choose the steps K0 firstly. In real implementation, we compute all the q multiple correlation coefficients of (l = 1, …, q) with X and U. Then we choose the components for given small number δ > 0, where mcorr(u, V) denotes the multicorrelation coefficient between u and V and can be approximated by its sample form; see Anderson [11].
There are some tuning parameters needing to choose in order to implement the two-stage remodeling procedure. Fan and Li [5] showed that the SCAD penalty with a = 3.7 performs well in a variety of situations. Hence, we use their suggestion throughout this paper. We still need to choose the positive integer L for basis functions and the tuning parameter λj of the penalty functions. Similar to the adaptive lasso of Zou [14], we suggest taking , where is initial estimator of θj by using ordinary least squares method based on the first term in (10). So the two remaining parameters L and λ can be selected simultaneously using the leave-one-out CV or GCV method; see Zhao and Xue [4] for more details.
3. Simulation Studies
In this section, we investigate the behavior of the newly proposed method by simulation studies.
3.1. Linear Model with Normally Distributed Covariates
The dimensions of the full model (1) and the submodel (2) are chosen to be 100 and 5, respectively. We set β = (0.5,3.5,2.5,1.5,4.0)T and , where γ2 ~ Unif[−0.5,0.5]30, a 30-dimensional uniform distribution on [−0.5,0.5]30, and γ1 is chosen in the following ways:
Case (I). γ1 ~ Unif[0.5,1.0]10.
Case (II). γ1 = (1.0,1.0,1.0,1.5,1.5,1.5,2.0,2.0,2.0,2.0).
Here we denote the submodel (2) as model (I), the multistep adjusted linear model (5) as model (II), the two-stage model (12) as model (III), and the full model (1) as model (IV). We compare mean square errors (MSEs) of the new two-stage estimator based on model (III) with the estimator based on model (I), the multistep estimator based on model (II), the SCAD estimator and the least squares estimator based on model (IV). We also compare mean square prediction errors (MSPEs) of the above mentioned models with corresponding estimators.
The data are simulated from the full model (1) with sample size n = 100 and simulation times m = 1000. We use the sample-based PCA approximations to substitute τj′s. The parameter a in the SCAD penalty function is set to be 3.7 and λ is selected by leave-one-out CV method.
Table 1 reports the MSEs of point estimators on the parameter β and the MSPEs of model predictions. From the table, we have the following findings: (1) has the largest MSEs and takes the second place, nearly all the new estimator has the smallest MSEs. (2) When c = 0.5, the MSEs of are smaller than those of , while when c = 0.8 they are larger than those of . These show that if the correlation between the covariates is strong, the MSEs of are larger than those of , the multistep-adjustment is necessary, so the estimations and model predictions based on two-stage model are significantly improved. (3) In case (I) and (II) the simulation results have the similar performance. (4) Similar to the trend of the MSEs of the five estimators, the MSPE of the two-stage adjusted model is the smallest among the mentioned five models.
No. | Item | |||||
---|---|---|---|---|---|---|
0.3079 | 0.0457 | 0.0660 | 0.0571 | 1.6105 × 103 | ||
0.1763 | 0.0206 | 0.0346 | 0.0176 | 1.0940 × 103 | ||
Case (I) | MSEs | 0.1396 | 0.0481 | 0.0631 | 0.0461 | 4.2049 × 103 |
c = 0.5 | 0.1870 | 0.0196 | 0.0349 | 0.0186 | 5.0183 × 103 | |
0.1131 | 0.0517 | 0.0609 | 0.0430 | 6.2615 × 103 | ||
MSPEs | 3.4780 | 1.1896 | 1.6512 | 1.0679 | 3.0499 × 102 | |
0.1568 | 0.6191 | 0.0934 | 0.0826 | 1.2494 × 103 | ||
0.6239 | 0.1060 | 0.0090 | 0.0083 | 1.0456 × 102 | ||
Case (I) | MSEs | 0.8829 | 0.8173 | 0.0895 | 0.1039 | 2.6368 × 102 |
c = 0.8 | 0.5882 | 0.0919 | 0.0107 | 0.0100 | 7.6452 × 101 | |
1.0799 | 0.9829 | 0.0961 | 0.0929 | 1.1610 × 103 | ||
MSPEs | 4.7930 | 2.6700 | 0.8354 | 0.7771 | 1.3223 × 102 | |
0.4272 | 0.0660 | 0.0849 | 0.0557 | 4.3002 × 102 | ||
0.6371 | 0.0318 | 0.0499 | 0.0295 | 3.7893 × 103 | ||
Case (II) | MSEs | 0.4560 | 0.0715 | 0.0927 | 0.0588 | 1.2784 × 103 |
c = 0.5 | 0.5926 | 0.0306 | 0.0491 | 0.0287 | 6.7354 × 103 | |
0.9052 | 0.0734 | 0.0874 | 0.0583 | 2.5047 × 102 | ||
MSPEs | 6.8634 | 1.5096 | 2.0780 | 1.2077 | 5.0464 × 103 | |
0.6764 | 0.4263 | 0.1212 | 0.0960 | 1.3904 × 103 | ||
0.9721 | 0.1060 | 0.0107 | 0.0102 | 4.0743 × 102 | ||
Case (II) | MSEs | 0.6242 | 0.4756 | 0.1146 | 0.1003 | 1.0498 × 103 |
c = 0.8 | 1.0282 | 0.0954 | 0.0112 | 0.0098 | 5.6031 × 102 | |
1.3420 | 0.5474 | 0.1341 | 0.1124 | 9.9632 × 102 | ||
MSPEs | 7.9928 | 2.1165 | 0.9514 | 0.8469 | 2.3110 × 102 |
In summary, Table 1 indicates that the two-stage adjusted linear model (12) performs much better than the full model, and better than the submodel, the SCAD-penalized model and the multistep adjusted model.
3.2. Partially Linear Model with Nonnormally Distributed Covariates
-
γ1 = (0.5,0.1,0.8,0.2,0.5,0.2,0.6,0.5,0.1,0.9),
-
γ2 ~ Unif[−0.3,0.3]10, a 10-dimensional uniform distribution on [−0.3,0.3].
We assume that the covariates are distributed in the following two ways.
Case (II). X = (1/(1 + c))(W1 + cV), with Z1 = (1/(1 + c))(W2 + cV), Z2 = W3, Z3 = (1/(1 + c))(W4 + cV), Z4 = W5, , where W1, W2, W3, W4 ~ Unif[−1.0,1.0]5, W5 ~ Unif[−1.0,1.0]30, V ~ Unif[−1.0,1.0]5, uniform distributions on [−1.0,1.0] and constant c = 0.1. All W1, W2, W3, W4, W5, and V are independent.
The error term ε is assumed to be normally distributed as N(0,0.32).
Here we denote the submodel (2) as model (I)′, the multistep adjusted additive partially linear model (3) as model (II)′, the two-stage model (11) as model (III)′ and the full model (1) as model (IV)′. We compare mean square errors (MSEs) of the new two-stage estimator based on model (III)′ with the estimator based on model (I)′, the estimator based on model (II)′ and the least squares estimator based on model (IV)′. We also compare the mean average square errors (MASEs) of the nonparametric estimators of f(·) and the mean square prediction errors (MSPEs) of different models with corresponding estimators.
The data are simulated from the full model (1) with sample size n = 100 and simulation times m = 500. We use the sample-based approximations of ICA, see Hyvärinen and Oja [15]. The parameter a in the SCAD penalty function is set to be 3.7, the number L and the parameter λ is selected by GCV method. We use the standard Fourier orthogonal basis as the basis functions.
Table 2 reports the MSEs of point estimators on the parameter β, the MASEs of f(·) and the MSPEs of model predictions. From the table, we have the following results: (1) has the largest MSEs, its MSEs are much larger than the MSEs of the other estimators, and the new estimator always has the smallest MSEs. (2) The MASEs of f(·) have similar trend to the MSEs of the four estimators, while the differences are not very noticeable. (3) Similar to the MSEs of the estimators, the MSPEs of the two-stage adjusted model are the smallest among the four models. (4) In Case (II), the simulation results of models (I)′, (II)′ and (III)′, perform a little better than those in Case (I) because of the correlation structure among the covariates.
No. | Item | ||||
---|---|---|---|---|---|
0.4352 | 5.0403 | 0.3267 | 2.9753 × 101 | ||
0.6859 | 1.2820 × 101 | 0.3328 | 1.4593 × 101 | ||
MSEs | 1.1152 | 8.1542 | 0.3723 | 1.4391 × 101 | |
Case (I) | 1.8489 | 7.2055 | 1.3194 | 2.4036 × 101 | |
3.3079 | 1.6144 × 101 | 1.9989 | 4.8575 × 101 | ||
MASEs | 3.0887 | 5.9814 | 3.0175 | 3.0633 | |
MSPEs | 4.6047 | 7.0331 × 101 | 3.5536 | 3.9648 | |
0.0377 | 0.6144 | 0.0191 | —1 | ||
0.0449 | 1.0876 | 0.0305 | — | ||
MSEs | 0.0332 | 3.7510 | 0.0246 | — | |
Case (II) | 0.0396 | 0.4324 | 0.0238 | — | |
0.0512 | 1.1995 | 0.0335 | — | ||
MASEs | 0.4722 | 0.5220 | 0.4126 | 0.4380 | |
MSPEs | 0.9221 | 9.3068 | 0.8053 | — |
- 1“—” denotes the algorithm collapsed and returned no value.
In summary, Table 2 indicates that the two-stage adjusted model (11) performs much better than the full model and the multistep adjusted model, and better than the submodel.
4. Some Remarks
In this paper, the main objective is to consistently estimate the parameter of interest β. When estimating the parameter of interest, its bias is mainly determined by the relevant variables, and its variance may be impacted by other variables. Because variable selection much relies on the sparsity of the parameter, when we directly consider the partially linear model, some irrelevant variables with nonzero coefficients may be selected in the final model. This may affect the estimation of the parameter β on its efficiency and stability. Thus based on the prespecified submodel, a two-stage remodeling method is proposed. In the new remodeling procedure, the correlation among the covariates (X, Z) and the sparsity of the regression structure are fully used. So the final model is sufficiently simplified and conditionally unbiased. Based on the simplified model, the estimation and model prediction are significantly improved. Generally, after the first stage the adjusted model is an additive partially linear model. Therefore, the remodeling method can be applied to partially linear regression model with linear regression model as a special case.
From the remodeling procedure, we can see that it can be directly applied to additive partially linear model, in which the nonparametric function f(U) has component-wise additive form. As for general partially linear model with multivariate nonparametric function, we should resort to multivariate nonparametric estimation method. If the dimension of covariate U is high, it may be faced with “the curse of dimensionality”.
In the procedure of model simplification, orthogonal series estimation method is used. This is only for technical convenience, because the semiparametric penalized least squares (6) can be easily transformed into parametric penalized least squares (10) and then the theoretic results are obtained. Although other nonparametric methods such as kernel and spline can be used without any essential difficulty, they can not directly achieve this goal. Compared with kernel method, it is somewhat difficult for series method to establish the asymptotic normality result for the nonparametric component f(U) under primitive conditions.
Acknowledgment
Lin and Zeng’s research are supported by NNSF projects (11171188, 10921101, and 11231005) of China, NSF and SRRF projects (ZR2010AZ001 and BS2011SF006) of Shandong Province of China and K C Wong-HKBU Fellowship Programme for Mainland China Scholars 2010-11. Wang’s research is supported by NSF project (ZR2011AQ007) of Shandong Province of China.
Appendix
A. Some Conditions and Proofs
A.1. Regularity Conditions (C1)–(C6)
- (C1)
has finite nondegenerate compact support, denoted as .
- (C2)
The density function rj(t) of and r0(t) of U satisfies 0 < L1 ≤ rj(t) ≤ L2 < ∞ on its support for 0 ≤ j ≤ K0 for some constants L1 and L2, and it is continuously differentiable.
- (C3)
and are continuous. For given and u, is positive definite, and its eigenvalues are bounded.
- (C4)
, Ef(U) = 0, the first two derivatives of f(·) are Lipschitz continuous of order one.
- (C5)
as n → ∞.
- (C6)
for j = s + 1, …, K0 where s satisfies for 1 ≤ j ≤ s; for s < j ≤ K0.
Conditions (C1)–(C3) are some regular constraints on the covariates and condition (C4) is some constraints on the regression structure as those in Härdle et al. [16]. Conditions (C5)-(C6) are assumptions on the penalty function which are similar to those used in Fan and Li [5] and Wang et al. [7].
A.2. Proof for Theorem 1
Let δ = n−r/(2r+1) + an, β = β0 + δT1, θ = θ0 + δT2, ν = ν0 + δT3 and . Firstly, we shall prove that, ∀ϵ > 0, ∃C > 0, P{inf ∥T∥=CF(β, θ, ν) > F(β0, θ0, ν0)} ≥ 1 − ϵ.
Similarly, there exists a local minimizer satisfies that . Then we can get .
A.3. Proof for Theorem 2
Under the conditions and λjnr/(2r+1) > λmin nr/(2r+1) → ∞, then ∂F(β, θ, ν)/∂θj = Op(nλj(θj/∥θj∥2)). So the sign of the derivative is determined by θj.
So with probability tending to 1, , j = s + 1, …, K0. Then under , , j = s + 1, …, K0.
A.4. Proof for Theorem 4
By the law of large numbers, we have , where “” means the convergence in probability. Then using the Slutsky theorem, we get .