Mixed Spline Smoothing and Kernel Estimator in Biresponse Nonparametric Regression
Abstract
Mixed estimators in nonparametric regression have been developed in models with one response. The biresponse cases with different patterns among predictor variables that tend to be mixed estimators are often encountered. Therefore, in this article, we propose a biresponse nonparametric regression model with mixed spline smoothing and kernel estimators. This mixed estimator is suitable for modeling biresponse data with several patterns (response vs. predictors) that tend to change at certain subintervals such as the spline smoothing pattern, and other patterns that tend to be random are commonly modeled using kernel regression. The mixed estimator is obtained through two-stage estimation, i.e., penalized weighted least square (PWLS) and weighted least square (WLS). Furthermore, the proposed biresponse modeling with mixed estimators is validated using simulation data. This estimator is also applied to the percentage of the poor population and human development index data. The results show that the proposed model can be appropriately implemented and gives satisfactory results.
1. Introduction
One of the most popular statistical methods often used for prediction is regression analysis. Regression analysis is commonly used to determine the functional relationship between independent variables (predictors) and dependent variables (responses) [1]. Functional relationships between predictor variables and response variables can have clear or unknown patterns; if these relationships have unknown patterns, the appropriate type of regression analysis is nonparametric regression [2]. In the nonparametric regression, the regression curve is assumed to be smooth. This regression has high flexibility because the data can drive to estimate its own regression curve without subjectivity from the researcher [3]. Researchers have proposed methods for estimating nonparametric regression functions such as spline, kernel, and Fourier series functions. The spline nonparametric regression has been developed by Eubank [3], Becher et al. [4], and Wang et al. [5]. Hall and Huang [6], Okumura and Naito [7], Du et al. [8], Chamidah and Saifudin [9], and Erçelik and Nadar [10] developed kernel nonparametric regression. Bilodeau [11] and Amato et al. [12] estimated the nonparametric regression function with the Fourier series function.
In applying nonparametric regression modeling, researchers sometimes assume that each predictor variable has the same pattern. Although there are often real cases with different patterns between the response and each predictor, if the researcher still insists on applying one type of estimator to all predictor variables, the estimation results can be inaccurate and produce a large error. Researchers have begun to develop nonparametric regression with a mixed estimator, including Hidayat et al. [13], Mariati et al. [14], and Octavanny et al. [15]. These mixed estimators are formed by referring to the idea of semiparametric regression. The semiparametric regression model is an additive regression model that consists of a parametric component and a nonparametric component; see the work of Green and Yandell [16], Roozbeh and Arashi [17], and Roozbeh and Najarian [18]. In these mixed estimators, the additive model concept in semiparametric regression is adapted by modification using two different nonparametric components. However, these mixed estimators only use one response variable, even though some biresponse cases also have different patterns among the predictor variables. At present, nonparametric regression studies with mixed estimators for biresponse cases have never been developed. Therefore, this study develops a new theory about the mixed estimator of spline smoothing and kernel in biresponse nonparametric regression. This mixed estimator is the development of the mixed estimator proposed by Hidayat et al. [13] that the kernel estimator is considered to be fixed, while the kernel function in this paper is estimated. In addition, this new mixed estimator can be applied for biresponse cases. This mixed estimator is obtained through a two-stage estimation. The first stage of estimation uses the penalized weighted least square (PWLS) to obtain the spline smoothing component, followed by the second stage that employs the weighted least-square (WLS) estimation method to estimate the kernel component.
The spline smoothing estimator is very dependent on the smoothing parameter, while the kernel estimator is very dependent on the bandwidth parameter. These smoothing and bandwidth parameters are tuning parameters. The optimal value of these parameters will produce the best regression model. In nonparametric and semiparametric regression, there are several methods to determine the optimal parameter value to obtain the best regression model. Some of the popular methods are the cross-validation (CV) and generalized cross-validation (GCV) methods. The CV method is a method of selecting the best model based on the best predictive ability from all the different datasets. This method is widely used, but the calculation of this method will become more complex as the number of datasets increases. In addition, for partial linear models including mixed estimator models, the one-out crossover method tends to be time consuming even for moderate sample sizes [19]. Craven and Wahba [20] modified the CV method to make the calculation simpler, and the result of this modification is called the GCV method. This method is widely used by researchers because it has several advantages. The advantages of the GCV method include the following: simple and efficient in calculation, invariant to transformation, and does not require variant information. This method also has the advantage of optimal asymptotic properties over other methods [21, 22]. Some researchers develop specific GCV methods according to the model in their research such as the GCV for semiparametric ridge regression with kernel smoothing [23–25]; also, several types of GCV were developed for uniresponse nonparametric regression with mixed estimators including the mixed estimator of spline smoothing and kernel [13], mixed estimator of spline smoothing and Fourier series [14], and mixed estimator of truncated spline and fourier series for longitudinal data [15]. Therefore, in this study, the determination of the best model was carried out using the GCV method which was developed specifically for the mixed estimator of spline smoothing and kernel in biresponse nonparametric regression.
Next, the proposed mixed estimator is applied to the simulation data. The formula for generating data contains two different functions to represent two different patterns of predictor variables. This estimator is also implemented to model to the percentage of the poor population and human development index data in Papua Province, Indonesia. Empirical results indicate that the proposed mixed estimator performs very well for modeling the data with two different patterns. One pattern (response vs. predictors) tends to change at certain subintervals, and another pattern appears to be random, which are commonly modeled using kernel regression.
The rest of this paper is organized as follows: In Section 2, we present the materials and methods about the two-stage estimation method, i.e., the PWLS and followed by the WLS. The proposed mixed spline smoothing and kernel estimator in biresponse nonparametric regression is explained in Section 3.1. The selection of smoothing and bandwidth parameters using generalized cross validation (GCV) is described in Section 3.2. The simulation study and real data analysis are conducted to illustrate the performance of the proposed biresponse mixed estimator in Sections 3.3 and 3.4. The conclusions and further research are presented in the last section.
2. Materials and Methods
3. Results and Discussion
3.1. Mixed Spline Smoothing and Kernel Estimator in Biresponse Nonparametric Regression
Before conducting the estimation to obtain biresponse mixed spline smoothing and kernel estimator, it is necessary to obtain the function form for each component of this mixed estimator. The structure of the spline smoothing component is explained in Lemma 1, while the kernel component is explained in Lemma 2.
Lemma 1. If the regression curve g is assumed to be smooth and contained in the Sobolev space [22], then the function form of the spline smoothing component in the biresponse nonparametric regression can be stated as
{τh1, τh2, …, τhm} and {βh1, βh2, …, βhn} are bases of spaces in the Sobolev space.
Proof. If gh; h = 1,2 is a function lying in Hilbert space H, the H space can be decomposed into a direct sum of two spaces H0 and H1 where H = H0 ⊕ H1 and H0 ⊥ H1. If {τh1, τh2, …, τhm} is the basis in H0 and {βh1, βh2, …, βhn} is the basis in H1, according to Wahba [22], for each function gh ∈ H with uh ∈ H0 and vh ∈ H1, we obtain
Equation (10) is a limited linear function in H space and gh ∈ H; therefore, equation (10) can be stated as
Therefore, for all responses (h = 1,2), we can obtain the function form of the spline smoothing component in the biresponse regression curve as follows:
Lemma 2. If the regression curve f is approached by the kernel function, then the function form of kernel component in the biresponse nonparametric regression can be expressed as
The function fh(ti); h = 1,2; i = 1,2, …, n is approached by the Taylor series with t around t0.
Proof. The form of the regression function f is derived from the component fh(ti) in equation (2). The form of the function is unknown and approached using the kernel estimator. The function fh(ti) for h = 1,2 can be approached by the Taylor series with t around t0 as follows [9]:
The kernel estimator can be obtained when the polynomial order mh = 0; h = 1,2. Therefore, the function form for each response involving all observations can be stated as
Then, we can obtain the function form of component f in biresponse nonparametric regression as follows:
Theorem 1. The biresponse nonparametric regression model is given in equation (1), where each component of the regression curve is additive as stated in equation (2). The function form of the g component is presented in Lemma 1, and the function form of the f component is presented in Lemma 2. Using PWLS in the first-stage estimation and WLS in the second-stage estimation, the mixed spline smoothing and kernel estimator in biresponse nonparametric regression is obtained as follows:
Proof. The first-stage estimation on the mixed spline smoothing and kernel estimator in biresponse nonparametric regression is performed by estimating the spline smoothing component using the PWLS in equation (5). The penalty component in equation (5) can be obtained through the following decomposition [26]:
The solution for the PWLS optimization can be obtained from the partial derivative Q(c, d) by c and d. The partial derivative Q(c, d) by c results as follows:
The partial derivative of Q(c, d) by d gives the result
Equation (33) is substituted into equation (32), and then, we solve it and get the following result:
Furthermore, by substituting equation (34) to equation (30), we obtain
and are substituted into the function form of the spline smoothing component in equation (16), and then, the following spline smoothing estimator component in the biresponse nonparametric regression model is obtained:
Because z = y − f, the first-stage estimation results can be stated as
In the second stage of estimation, the function f as the kernel component on a mixed spline smoothing and kernel estimator in biresponse nonparametric regression is estimated using the WLS method with the following formula:
By substituting the results of the first-stage estimation (38) and the function form f (Lemma 2) into the model of the mixed estimator (equation (3)), the error of this mixed estimator model can be written as follows:
Furthermore, by supposing D = M−1Kα(t0) and substituting equation (42) into equation (40), we can get the following equation for WLS optimization:
The solution for the WLS optimization can be obtained from the partial derivative Q(t0) by ω0(t0). The optimization result is obtained as follows:
Therefore, the estimation for kernel estimator component can be written as
Based on the first-stage estimation results in equation (38) and the second-stage estimation results in equation (45), the estimation of the additive regression curve in equation (2) with the mixed spline smoothing and kernel estimator in biresponse nonparametric regression can be stated as
3.2. Selection of Smoothing and Bandwidth Parameters
3.3. Simulation Study
The predictors are generated from xi ~ U(0,2) and ti ~ U(0,2) with the sample size n = 100, and the random errors εhi are generated from bivariate normal distributions with , and ρ = 0.6. The scatterplots of the simulated data are shown in Figure 1. It can be seen that the pattern between y1, y2 against x tends to change at certain subintervals such as the spline smoothing pattern, while the scatterplot between y1, y2 against t tends to have a random pattern that is commonly modeled with kernel regression.

In this simulation, the Gaussian kernel was employed. Based on the empirical results from the two-stage estimation, we obtain combination values of the smoothing parameters (λ1, λ2) and bandwidth parameters (α1, α2) around the optimal values (Table 1). The exhibited results are a few of all combinations due to the limited space. The best model is chosen based on the smallest GCV value resulted from optimal smoothing parameters λ1(opt) = 0.0002413 and λ2(opt) = 0.0000899 along with optimal bandwidth parameters α1(opt) = 2.396 and α2(opt) = 2.416. This model produces the lowest GCV = 3.239 with R2 = 99.91% and RMSE = 0.1045.
λ1 | λ2 | α1 | α2 | GCV |
---|---|---|---|---|
0.002413 | 0.000899 | 23.96 | 24.16 | 13.123 |
0.0002413 | 0.0000899 | 23.96 | 24.16 | 3.397 |
0.002413 | 0.000899 | 2.396 | 2.416 | 12.143 |
0.0002413 | 0.0000899 | 2.396 | 2.416 | 3.239 |
0.00002413 | 0.00000899 | 2.396 | 2.416 | 5.739 |
0.0002413 | 0.0000899 | 0.2396 | 0.2416 | 11494.49 |
0.00002413 | 0.00000899 | 0.2396 | 0.2416 | 3130.825 |
The result of modeling simulation data using the mixed spline smoothing and kernel estimator is compared with modeling using either a spline smoothing estimator or kernel estimator only. The results of these models are presented in Table 2. From Table 2, we can find out the best model that gives the smallest GCV value is the model with the mixed spline smoothing and kernel estimator. Besides, this model has the largest R2 and the lowest MSE value.
Model | Minimum GCV |
---|---|
Mixed spline smoothing and kernel | 3.2392 |
Spline smoothing | 4.6587 |
Kernel | 4.4085 |
The plot between the estimation results and the original simulation data is presented in Figure 2, where the estimated values (red triangles) are very close to the original data (black squares). Thus, the proposed model and estimation procedure can be used to make a prediction correctly. Furthermore, on the left side of Figure 3, the surface plots are formed using equation (50), which is the equation for generating simulation data, whereas on the right side of Figure 3, there are two surface plots for each response which are formed from equation (48) where its parameters are estimated using two-stage estimation, i.e., the PWLS and WLS. The two sides of Figure 3 show that the plots appear to have similar surface shape. This evidence indicates that the estimation procedure proposed in equation (48) can be used appropriately to estimate the function generated from the simulation.






3.4. Data Application
The biresponse mixed spline smoothing and kernel estimator proposed in this paper is applied to regress the percentage of the poor population (PPP), as the first response, and human development index (HDI), as the second response, on several predictors. These two response variables are important because they are indicators of the success level of a country’s development. The adoption of biresponse modeling for the two variables considers the initial study that there is a negative correlation between PPP and HDI. If the PPP in a region is getting lower, the HDI in that region will be higher [27, 28].
Several variables that typically affect the two response variables are the gross regional domestic product (GRDP) and the population growth rate. Some researchers have pointed out several factors that can affect the PPP and HDI, including Grubaugh [29], who stated the variables that influence the growth of HDI in developing countries are population, population growth, and the initial level of the gross domestic product (GDP). Meanwhile, Mallick and Ghani [30] found that high population growth is the cause of poverty in Pakistan. While in North Sumatra Province, Indonesia, the GRDP and education level to university have a positive and significant influence in reducing the PPP [31]. Based on Malthus’s theory, poverty is considered as the impact of high population growth rates [32]. Also, additional life support needs are considered slower than population growth. A high increase in population growth will have an impact on decreasing the quality of natural resources and reducing the opportunity for people to access life-support facilities. This situation can reduce the quality of human life, and people will be challenged to live in prosperity. Rapid economic growth is one way to alleviate poverty [33]. The GDP or GRDP has a close relationship with economic growth because the economic growth of a region is related to an increase in production or an increase in income per capita. Besides, if the GRDP is higher, the income per capita in the region will increase and have an impact on increasing the ability of the community to meet their needs and improve their quality of life.
The application of this mixed estimator model on the PPP and HDI in this study is made after the authors conducted a preliminary study. Based on information from the initial research, it is known that the GRDP has a changing pattern at certain subintervals such as the spline pattern. In contrast, the population growth rate has a random pattern that is usually modeled by the kernel. Therefore, the data of PPP and HDI of Papua Province in 2017 are used as response variables, while the predictor variables are the GRDP and the population growth rate. The data were obtained from Statistics Indonesia (Badan Pusat Statistik–BPS), Papua Province, with 29 regencies/cities as the observation unit. The biresponse mixed spline smoothing and kernel estimator in equation (12) was applied to the data. This modeling produces the minimum GCV value of 62.75, the R2 of 96.54%, and the RMSE of 3.166. Based on the R2 value, it is known that the model can describe the relationship between predictor variables and response up to 96.54%. This finding shows that the biresponse mixed spline smoothing and kernel estimator is suitable for modeling the PPP and HDI in Papua Province. Also, the bar chart in Figures 4 and 5 show that the estimated values of PPP and HDI in Papua Province are close to their actual values.


Furthermore, to determine the predictive ability of this modeling, we use the model that has been obtained from the data in 2017 to predict the PPP and HDI values of Papua Province in 2018 and 2019. This prediction is carried out by applying the model that has been obtained to the GRDP and the population growth rate of Papua Province in 2018 and 2019. One of the criteria for the predictive ability of a model that can be obtained from this prediction is the Mean Absolute Percentage Error (MAPE) value. The MAPE value obtained from the predictions in 2018 is 3.8042% or the level of accuracy is 96.1958%, and the MAPE value from the predictions in 2019 is 5.1658% or the level of accuracy is 94.8342%. These MAPE values are less than 10%, which indicates that the biresponse nonparametric regression model with mixed spline smoothing and kernel estimators has a good predictive ability to predict PPP and HDI in Papua Province.
4. Conclusions
This paper presents the biresponse nonparametric regression model with mixed spline smoothing and kernel estimators. This new mixed estimator is obtained through two-stage estimation, i.e., the first stage using the PWLS to obtain the spline smoothing component, followed by the second stage that employs the WLS to estimate the kernel component. This mixed estimator is formed to handle the different data patterns between each predictor in the biresponse case, so this estimator can provide better estimation results. Selection of the best model for the proposed estimator is carried out by selecting a model that produces a minimum GCV value. The simulation results show the biresponse mixed spline smoothing and kernel estimator provides better results compared to the biresponse spline smoothing or biresponse kernel estimator. Furthermore, this proposed estimator can be appropriately applied to model the percentage of the poor population (PPP) and human development index (HDI) in Papua Province and gives satisfactory results. The limitation of this study is we only use one predictor variable for each component of the estimators, both spline smoothing and kernel estimators. For future work, this biresponse mixed estimator can be developed with more than one predictor for each estimator component. Apart from this limitation, this study is useful for our insight into mixed estimators in biresponse nonparametric regression.
Conflicts of Interest
The authors declare that there are no conflicts of interest in this paper.
Acknowledgments
The authors thank the Ministry of Research, Technology, and Higher Education (Ristekdikti), Republic of Indonesia, for supporting this research through the PMDSU scholarship.
Open Research
Data Availability
The data in this article are available in the BPS of the Papua Province repository (https://papua.bps.go.id).