Various studies have examined generalized additive models (GAMs), comparing thin plate splines (tp), P-splines (ps), cubic regression splines (cr), and Gaussian processes (gp) for discrete choice data, function approximation, and in the presence of multicollinearity and outliers. Some studies have applied ps to models with correlated and heteroscedastic errors, while others have reviewed multiple smoothing term packages for modeling GAMs. This study seeks to examine the performance of semiparametric models in the presence of different noise and outliers within the framework of GAMs through simulation. The study adopted four GAMs, cr, ps, tp, and gp, for simulated data with different noise and outliers with varying sample sizes. According to our investigation, the cr model performs well in terms of deviance for the majority of sample sizes and all types of noise. With higher sample sizes, the ps model frequently performs well, particularly in terms of AIC and GCV under noise that is heteroscedastic and Gaussian. The gp model excels with the smallest sample size under Gaussian and lognormal noise in terms of GCV, and the tp model frequently performs best under exponential and lognormal noise for larger samples in terms of AIC and GCV. For data containing outliers, the cr and tp models are effective with smaller sample sizes, while the gp model excels with larger sample sizes based on AIC and GCV. Regarding deviance, the cr model consistently performs best across all sample sizes. Our results show that the sample size and kind of noise in the data have a significant impact on the smoothing model’s performance. No single model consistently outperforms the others for all noise types and sample sizes, suggesting that the choice of model should be based on the specific goal of a study.

1. Introduction

The method of ordinary least squares (OLS) is a statistical approach commonly used for parameter estimation, mainly due to its historical popularity and straightforward computational approach [1]. It is highly regarded for its ability to produce impartial and effective estimates of parameters, given the fulfillment of key foundational assumptions. These assumptions encompass the requirement for the error terms to follow a normal distribution, the assumption of equal variance in the error terms, and the absence of any outliers, leverage points, multicollinearity, and deviations from linearity [2]. The assumption of linearity is a crucial prerequisite; however, attaining linearity in practical scenarios can present challenges due to a multitude of factors. Although statistical models are inevitably simplifications, even the most intricate model can only offer a limited representation of reality. Consequently, the task of estimating statistical models may appear futile at first glance.

However, Box [3] succinctly captured the essence of statistical research when he stated that “all models are wrong, some are useful.” Despite their inherent inaccuracies, statistics afford us significant insights into the political, economic, and sociological dimensions of our world. Statistical models are inherently simplifications of reality, as they necessitate making assumptions about various aspects of the real world. The practice of statistics acknowledges the need for assumptions, and a significant part of statistical modeling involves conducting diagnostic tests to ensure that these assumptions are met. In the realm of social sciences, considerable effort is dedicated to scrutinizing assumptions concerning the nature of the error term, such as heteroskedasticity or serial correlation. However, researchers in the social sciences often tend to be less rigorous in testing assumptions regarding the functional form of the model [4]. In the field of social sciences, the predominant choice is the linear functional form, typically characterized by additivity. However, researchers often overlook the crucial task of validating the linearity assumption. While substantial attention is directed towards specification and avoiding misspecification, there is a noticeable dearth of exploration regarding alternative functional forms when the chosen form proves to be incorrect, essentially resulting in a specification error [4].

Regression analysis, a widely used statistical tool for establishing linear relationships between variables or the investigation of dependence, has found applications in numerous fields of study such as social science, health science, engineering, and physical science. For instance, banks utilize regression analysis to determine factors that positively or negatively impact their profits, while medical doctors employ regression to assess total body fat in patients by considering variables that influence its increase or decrease. Similarly, statisticians in hospitals use regression analysis to examine lifestyle factors that may be causative agents for certain diseases, like blood pressure [2]. Hence, regression analysis holds significant importance as a statistical tool. It is used to respond to queries like, does increasing class size affect students’ success? Can the duration of the most recent eruption of Old Faithful Geyser be used to predict the time of the next eruption? Do dietary changes affect cholesterol levels, and if so, do the effects depend on other factors like age, sex, and level of exercise? Do nations with higher incomes per person have birth rates that are lower than nations with lower incomes? Many research efforts focus on regression analysis as a key component [5].

Contrary to popular belief, the use of linear functional forms is more widespread. Many analysts concentrate largely on the outcome variable’s linearity element within a statistical model. Analysts typically use least squares estimation of a linear regression model when the outcome variable is continuous. On the other hand, generalized linear models (GLMs) like logistic or Poisson regression are often used for discrete outcomes. However, researchers frequently think they have abandoned the linear functional form by using logistic or Poisson regression models. These models nevertheless retain a crucial linear property in their functional form [4]. The introduction of the GLM notation by [6] offers valuable clarity regarding the linearity assumption in models that are commonly perceived as nonlinear by many researchers.

Hastie and Tibshirani [7] proposed the category of generalized additive models (GAMs), which represents a broadened version of GLMs. GAMs, often referred to as semiparametric regression models, encompass both additive models and GAMs [8]. They combine the use of local estimation models like lowess and splines with standard linear models and GLMs. These models provide analysts with the flexibility to incorporate nonparametric regression for certain predictor variables while estimating other predictors parametrically [4]. While GAMs relax the assumption of a global fit, they still maintain the assumption of additive effects. The additivity assumption in GAMs enhances interpretability compared to other methods such as neural networks, support vector machines, projection pursuit, and tree-based approaches. A crucial advantage of GAMs is their ability to diagnose nonlinearity. Within the GAM framework, simple linear and power transformations are nested, allowing local estimates from a GAM to be compared against linear, quadratic, or other transformations using statistical tests such as the F-test or likelihood ratio test [4]. Semiparametric models offer more flexibility than parametric models because they do not require a strict assumption about the data distribution. Asfha [9] assessed and compared the performance of GAMs, focusing on their tolerance to the effects of outliers, multicollinearity, and the combined impact of both. In the study by [10], titled “Selection of Splines Models in the Presence of Multicollinearity and Outliers,” the research is aimed at addressing challenges related to multicollinearity and outliers. The focus was on evaluating cubic regression splines (cr), P-splines (ps), and thin plate splines (tp), to handle these challenges

However, in this study, our objective is to perform simulation experiments to evaluate and analyze the effectiveness of semiparametric models when subjected to various types of noise and outliers within the context of GAMs. Noise refers to random variations in the data that arise due to measurement errors, environmental conditions, or inherent randomness in the data collection process. These variations are assumed to follow a probabilistic distribution, typically normal, and do not systematically bias the dataset and outliers are observations that deviate significantly from the majority of the data and do not conform to the expected distribution or pattern. They may be indicative of data entry errors, rare events, or underlying phenomena not captured by the main data model.

2. Methods and Materials

Smoothing stands as a straightforward methodology for effectively conforming a curve to the distribution of data points within a scatter plot. Its significance becomes particularly pronounced as a foundational cornerstone within the realm of GAMs. In this study, four smoothing techniques based on GAMs are adopted, namely, cr, ps, tp, and Gaussian process (gp).

2.1. cr

Given:

(1)

where

(2)

In Equation (3), the objective is to minimize the sum of squares between the observed values y and the nonparametric estimate f(x), while also incorporating a constraint known as a roughness penalty. This roughness penalty consists of two components. The first component is denoted by λ, which is commonly referred to as the smoothing or tuning parameter. The second component of the roughness penalty involves calculating the integrated squared second derivative of the function f(x):

(3)

where

•
λ ≥ 0 is a smoothing parameter, controlling the trade-off between fidelity to the data and roughness of the function estimate. This is often estimated by generalized cross validation (GCV) or by restricted marginal likelihood (REML) [4, 11].
•
As λ⟶0, the curve fitting algorithm will not impose any penalty on the model for fitting the data too closely. This can result in a very accurate fit to the data, but it can also lead to a very noisy curve, as it will follow every minor fluctuation in the data [4, 11].
•
As λ⟶∞, the roughness penalty becomes paramount and the estimate converges to linear least squares estimate [4, 11].
•
The second component of the roughness measures the curvature or roughness of the estimate throughout the range of x values [4].

By exploiting the equivalence between spline models and linear regression models, we can represent the first term in Equation (3) as a linear regression model in matrix form. Additionally, it has been demonstrated that the penalty term from Equation (3) can be expressed as a quadratic form in β [12]. As a result, it can represent the penalty in matrix form as follows:

(4)

where D is a matrix of the following form:

(5)

where

•
k denotes the number of knots.
•
0_2x2 is the 2 × 2 matrix of zeroes.
•
I_kxk is the K × K identity matrix.

By expressing the penalty in matrix form, we can formulate the penalized spline regression model in a concise matrix representation as

(6)

2.2. ps

Given:

(7)

where

(8)

where (P_o, P₁, ⋯, P_n) are control points and β_i,k(t) are the basis functions defined using the Cox-de Boor recursion formula [13]:

(9)

(10)

where

(11)

(12)

(13)

(14)

ps, introduced by [14], is a penalty incorporated B-splines designed to provide greater stability, especially in situations where lower rank smoothing is required, given by

(15)

They are usually defined on evenly spaced knots, with a difference penalty applied directly to the parameters, β_i, to control function wiggliness. The penalty is as follows if we choose to penalize the squared difference between adjacent β_i values:

(16)

(17)

(18)

In matrix form, it can be written as

(19)

(20)

where

•
P is a first-order finite difference matrix.

2.3. tp

tp are used to estimate smooth functions of multiple predictor variables from noisy observations of the function at particular values of those predictors [15]. Consider:

(21)

where ε_i represents a random error term and x is a vector with d dimensions.

tp smoothing computes an estimation of the function by seeking a function f that minimizes Equation (22) given as

(22)

where

•
•
•
J_md(f) is the penalty function which measures the wiggliness of the smoother f.
•
λ is the smoothing parameter.
•
d is the number of predictors in the smooth function.
•
m = floor((d + 1)/2) + 1.

The wiggliness penalty is defined as

(23)

where

•
sums the overall possible combinations of nonnegative integer values γ₁, γ₂, ⋯, γ_d that satisfy the equation γ₁ + γ₂ + ⋯+γ_d = m.
•
m! /γ₁! ⋯γ_d! is the combinatorial coefficient that represents the number of ways to partition m identical objects into d distinct groups, where each group contains γ_i objects.
•
is the m partial derivative of the function f with respect to each of the d variables, where the orders of differentiation are determined by γ₁, γ₂, ⋯, γ_d.

For the smoothing of two predictors (d=2), then m = 2, the measure of wiggliness is written as

(24)

(25)

(26)

where

•
δ and α are unknown parameter vectors to be estimated in which δ is subject to the linear constraint T^Tδ = 0 where T_ij = ϕ_j(x).
•
ϕ_j is the basis function.
•
M = (m + d − 1)! /d! (m − 1)! .

For d = 2, then m = 2 and M = 3. The size of the basis is 3 given by ϕ₁(x) = 1, ϕ₂(x) = x₁ and ϕ₃(x) = x₂. Furthermore,

(27)

where ‖r_i‖ = ‖x − x_i‖.

Let matrix E be defined as E_ij ≡ η_md(‖x_i − x_j‖). The objective of the tp fitting problem is to minimize the expression:

(28)

subject to the constraint T^Tδ = 0, where δ and α are the variables to be determined. Note, tp can be used with any number of predictors [9].

2.4. gp

A gp is a stochastic process that defines a distribution over functions [16]. In other words, it allows us to model a set of functions, where each function is treated as a random variable. A Gaussian distribution is specified by a mean vector μ and a covariance matrice Σ. A gp model is given as

(29)

(30)

where

•
x = [x₁, ⋯, x_n]
•
f(x_i) = [f(x₁), ⋯, f(x_n)]
•
μ = [m(x₁), ⋯, m(x_n)]
•
K_ij = k(x_i, x_j)

X are the observed data points. A gp is specified by a mean function m(·) and a covariance function k(·, ·), where

•
m(·) = 0
•
Cov[f(x_i), f(x_j)] = k(x_i, x_j)

The most commonly used covariance function is the squared exponential kernel [17, 18] given as

(31)

where

•
l is the length scale parameter
•
is the signal variance parameter

l scales up or down the distance between the x^′s. A higher l provides a smoother function and a smaller l provides a wigglier function.

larger values allow for more variation and smaller values allow for less variation. l and

are commonly called hyperparameters. The optimal hyperparameters Θ^∗ are determined by the log marginal likelihood [19] as

(32)

gp spline combines the flexibility of gp with the smoothness of splines. This is done by modeling each spline piece as a separate gp segment and combining them to form a more flexible model.

3. Model Selection

Assessing models in regression analysis is crucial for comparing their effectiveness. Once different models are fitted to the data, it becomes essential to thoroughly evaluate their overall fit and the quality of that fit. cr, ps, tp, and gp regression models were applied to the simulated data, where the dependent variable was “y” and the independent variable “x.” In each simulation scenario, the models’ performance was evaluated using AIC, deviance, and GCV to assess their effectiveness and suitability. These model selection techniques were chosen because the AIC helps in balancing model fit and complexity [20], the deviance provides a direct measure of model fit [9], and the GCV helps find a model that balances accuracy and simplicity, making it reliable for future predictions [21].

3.1. Model AIC

The AIC [20] is a statistical measure used for model selection. It quantifies the trade-off between the goodness of fit of a model and its complexity, penalizing models with more parameters. The AIC is calculated using the following mathematical formulation:

(33)

where

•
L_model is the likelihood of the GAM model
•
k is the number of estimated parameters in the model

3.2. Model Deviance

Model deviance is calculated as negative twice the disparity in log-likelihood between the null model and the full model, which represents the model of interest [9]. Given by

(34)

where

•
L_NULL is the log-likelihood of a null model (a model with no predictors)
•
L_MODEL is the likelihood of the GAM model

3.3. Model GCV

GCV, as proposed by [21], is given as

(35)

where

•
n is the number of observations
•
tr(S) is the trace of the smoothing matrix S
•
RSS is the residual sum of squares

4. Simulation

In this segment, a simulation study is conducted to demonstrate how the four GAMs, namely, cr, ps, tp, and gp are fitted to datasets containing errors from Gaussian, exponential, and lognormal distributions. Additionally, the study explores scenarios involving correlated and heteroscedastic noise, along with outlier presence, to evaluate the models’ effectiveness. To achieve this, the following must be observed.

•
The values of the independent variable were generated as a sequence of equally spaced values ranging from 0 to 1. The values of the dependent variable were generated based on a sine function sin(2πx), which produces a periodic wave pattern. To simulate the presence of noise, random fluctuations were added to the dependent variable.
•
These fluctuations were generated from the Gaussian distributions with mean of 0 and a standard deviation of 0.1, from the exponential distribution with a rate of 1, and the lognormal distribution with a mean log of 0 and a standard deviation log of 1.
•
For correlated errors, a sequence of correlated noise was generated using an autoregressive process with an autoregressive coefficient of 0.8 and standard deviation of 0.1; for heteroscedastic errors, the random fluctuations were generated from the Gaussian distribution with mean 0 and the standard deviation (0.1x²) resulting in heteroscedastic noise.
•
To introduce outliers, 20% of the dependent variable values were randomly chosen as maximum vertical outliers. To simulate the presence of both noise and outliers, random fluctuations (Gaussian, exponential, lognormal, heteroscedastic, and correlated noise) and 20% of outliers were added to the response variable for each data generated.
•
To investigate the model’s behavior, 1000 iterations were simulated across various sample sizes 50, 100, 200, 300, 400, and 500 recording average deviance and AIC values for all sample sizes in each model. A fixed knot at k = 25 was employed consistently when fitting the GAMs across all scenarios.

5. Empirical Study

In the empirical study, we used data from an anthropometric study of 892 females under 50 years in three Gambian villages in West Africa, a data frame with 892 observations on the following three variables: age (age of respondents), lntriceps (log of the triceps skinfold thickness), and triceps (triceps skinfold thickness). For the purposes of our study, we extracted age as our dependent variable and triceps as our independent variable.

6. Results

6.1. Simulation Study

6.1.1. Performance of the Smoothing Models in the Presence of Different Noises

Table 1 shows the average AIC, deviance, and GCV values for different smoothing models applied across varying sample sizes for each type of noise. In the presence of Gaussian noise, the cr model exhibits the lowest AIC values for sample sizes of 50 and 100, while the ps model shows the lowest AIC values for sample sizes ranging from 200 to 500. Regarding deviance, the cr model consistently has the lowest values across all sample sizes. For GCV, the gp model has the lowest value for a sample size of 50, and the ps model shows the lowest GCV values for sample sizes from 100 to 500.

Table 1. Simulation results for different types of noise.

n	cr			ps			tp			gp
n	AIC	Dev	GCV	AIC	Dev	GCV	AIC	Dev	GCV	AIC	Dev	GCV
Gaussian noise
50	93.90621	14.44691	0.3890253	94.14079	14.69773	0.3882821	94.55149	15.05676	0.3886149	94.1508	14.67584	0.3882202
100	186.80599	32.76760	0.3781714	186.81444	32.83055	0.3781079	186.93655	33.08236	0.3781899	186.8825	32.88542	0.3782136
200	369.72954	68.95374	0.3704007	369.70498	69.04901	0.3703276	369.76360	69.13356	0.3704022	369.8064	69.03241	0.3705073
300	551.71050	104.87276	0.3672722	551.65768	105.01119	0.3671869	551.70564	105.02585	0.3672412	551.7817	104.96504	0.3673434
400	731.73180	140.15956	0.3638832	731.64478	140.25801	0.3637970	731.73285	140.28544	0.3638730	731.7972	140.21116	0.3639373
500	915.71571	176.87235	0.3648749	915.64695	176.94916	0.3648214	915.73794	177.04573	0.3648818	915.8034	176.90096	0.3649363

Exponential noise
50	319.6493	1541.939	36.90835	319.8258	1550.800	36.92509	320.0093	1596.270	36.76120	319.8061	1556.810	36.85055
100	641.5452	3360.978	36.60641	641.5665	3368.524	36.60132	641.5513	3413.017	36.53660	641.5693	3376.194	36.58600
200	1284.0189	6961.620	36.28710	1284.0321	6963.671	36.28869	1283.9462	6997.981	36.26228	1283.9719	6970.786	36.27413
300	1927.5254	10,615.351	36.38911	1927.4890	10,618.675	36.38393	1927.4236	10,655.292	36.37128	1927.4673	10,620.715	36.38121
400	2569.1207	14,172.382	36.23391	2569.1669	14,174.472	36.23756	2569.0406	14,215.202	36.22311	2569.0665	14,184.858	36.22755
500	3209.4782	17,700.128	36.04654	3209.4911	17,707.708	36.04666	3209.3939	17,738.320	36.03869	3209.4334	17,711.490	36.04224

Lognormal noise
50	384.4017	7150.621	168.2466	384.3376	7240.909	167.6750	384.7875	7356.442	167.7671	384.4797	7263.681	167.6115
100	779.8758	16,429.065	177.2600	779.8747	16,433.674	177.0760	779.9213	16,629.556	177.0538	779.9200	16,493.364	177.1642
200	1569.2552	32,154.810	167.0437	1569.1530	32,118.925	166.9821	1569.1911	32,333.629	166.9545	1569.2190	32,189.221	167.0100
300	2361.9916	49,284.522	168.4242	2361.9513	49,291.997	168.3912	2361.8930	49,442.418	168.3490	2361.9263	49,332.242	168.3787
400	3156.6207	66,352.375	169.0108	3156.6205	66,351.810	169.0132	3156.5415	66,481.631	168.9717	3156.5977	66,374.968	169.0020
500	3959.0341	83,602.138	169.8227	3959.0740	83,581.961	169.8377	3958.9157	83,805.733	169.7727	3958.9908	83,649.899	169.8046

Correlated noise
50	53.8370	3.7386	0.2050191	58.5327	4.3383	0.2176801	56.6344	4.0860	0.2123397	55.3564	3.9727	0.2073078
100	125.4313	12.7127	0.2141385	137.6718	15.2028	0.2382971	133.5194	14.1658	0.2303664	141.4205	16.2391	0.2457095
200	365.6562	57.1011	0.3661364	371.1149	59.1274	0.3758966	363.7029	56.5237	0.3625970	366.9008	58.2453	0.3677800
300	633.7779	123.5519	0.4840338	627.7072	119.9756	0.4747138	620.8280	117.9380	0.4637159	630.5637	123.2383	0.4785559
400	848.3205	171.9077	0.4876447	906.9662	200.2215	0.5644374	856.1566	175.3607	0.4972824	891.7622	193.9790	0.5431788
500	1107.0419	242.2257	0.5350661	1125.8548	253.0260	0.5554178	1120.0091	248.9138	0.5490881	1125.2149	251.4332	0.5548444

Heteroscedastic noise
50	10.0792	2.6791	0.07831279	10.1197	2.7320	0.07785383	11.4557	2.8614	0.07876284	10.7735	2.7636	0.07842460
100	22.9451	6.2389	0.07524141	22.6376	6.2392	0.07498336	23.2708	6.3525	0.07530713	23.1298	6.2918	0.07528758
200	45.6395	13.4907	0.07406204	45.3220	13.5069	0.07393325	45.7957	13.5843	0.07408459	45.8535	13.5477	0.07411817
300	65.4210	20.5738	0.07310522	65.0601	20.6037	0.07301035	65.5530	20.6647	0.07312095	65.6481	20.6454	0.07314678
400	86.4178	27.7676	0.07280445	86.1340	27.7854	0.07275172	86.5214	27.8327	0.07281793	86.7035	27.8201	0.07285208
500	106.7539	34.9374	0.07265729	106.3950	34.9547	0.07260265	106.8468	35.0011	0.07266690	107.0292	34.9983	0.07269362

Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.

For exponential noise, the gp model achieves the lowest AIC values for sample sizes of 50 and 100, whereas the tp model shows the lowest AIC values for sample sizes from 200 to 500. The cr model has the lowest deviance values across all sample sizes, and the tp model consistently records the lowest GCV values regardless of sample size.

In the case of lognormal noise, the ps model yields the lowest AIC values for sample sizes of 50, 100, and 200, while the tp model has the lowest AIC values for sample sizes from 300 to 500. For deviance, the cr model has the lowest values for sample sizes of 50, 100, and 300, while the ps model shows the lowest deviance values for sample sizes of 200, 400, and 500. Regarding GCV, the gp model has the lowest value for a sample size of 50, and the tp model consistently shows the lowest GCV values for sample sizes from 100 to 500.

For correlated noise, the cr model demonstrates the lowest AIC, deviance, and GCV values for sample sizes of 50, 100, 400, and 500. However, for sample sizes of 200 and 300, the tp model exhibits the lowest AIC, deviance, and GCV values.

In the presence of heteroscedastic noise, the cr model records the lowest AIC value for a sample size of 50, while the ps model shows the lowest AIC values for sample sizes from 100 to 500. The cr model consistently has the lowest deviance values across all sample sizes. For GCV, the ps model records the lowest values for all sample sizes. The simulations in this study were performed using R. The corresponding codes are provided in 8 for reproducibility.

6.1.2. Performance of the Smoothing Models in the Presence of Outliers

Table 2 presents the average AIC, deviance, and GCV metrics showcasing the performance of the four different smoothing models under outliers. For sample size 50, the cr model records the lowest AIC value, and for sample sizes 100–500, the gp model records the lowest AIC values. For deviance, the cr records the lowest deviance values for all sample sizes, and for GCV, the tp model records the lowest GCV values for sample sizes 50 and 100 while the gp model records the lowest GCV values for sample sizes 200–500.

Table 2. Simulation results for outliers.

n	cr			ps			tp			gp
n	AIC	Dev	GCV	AIC	Dev	GCV	AIC	Dev	GCV	AIC	Dev	GCV
50	221.7973	201.8335	4.911279	221.8378	202.9597	4.903755	222.1760	208.6935	4.893145	221.8933	204.3111	4.894905
100	441.4750	432.2088	4.787199	441.5084	432.5889	4.788514	441.5425	437.4634	4.783731	441.4503	433.3559	4.784178
200	877.2997	884.4995	4.674621	877.2997	884.7719	4.674582	877.3502	888.5398	4.674604	877.2779	885.6246	4.673631
300	1312.5408	1335.6427	4.630237	1312.5522	1336.4536	4.630282	1312.5607	1339.6206	4.629994	1312.5080	1336.5940	4.629525
400	1748.9729	1793.9210	4.623047	1748.9572	1794.7829	4.622803	1748.9688	1797.8367	4.622700	1748.9442	1795.2525	4.622575
500	2183.8988	2246.4415	4.604702	2183.8796	2246.9413	4.604503	2183.9102	2249.8641	4.604644	2183.8733	2247.1608	4.604411

Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.

6.2. Empirical Study

Table 3 presents the performance of the models using the empirical data. Based on the results, the gp model has the lowest AIC, the tp model has the lowest deviance, and the ps model has the lowest GCV.

Table 3. Empirical results.

cr			ps			tp			gp
AIC	Dev	GCV	AIC	Dev	GCV	AIC	Dev	GCV	AIC	Dev	GCV
4898.589	12,397.56	14.17771	4897.559	12,359.89	14.16162	4897.047	12,170.46	14.15661	4897.014	12,182.06	14.15584

Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.

7. Discussion

In this study, we aimed to evaluate the performance of semiparametric models in the presence of Gaussian, exponential, lognormal, correlated, and heteroscedastic noise and outliers within the framework of GAMs through simulation. Generally, our analysis shows that the cr model shows strong performance in terms of deviance across all noise types and most sample sizes. The ps model often performs well with larger sample sizes, especially in terms of AIC and GCV under Gaussian and heteroscedastic noise. The gp model excels with the smallest sample size under Gaussian and lognormal noise in terms of GCV, and the tp model frequently performs best under exponential and lognormal noise for larger samples in terms of AIC and GCV. For data with outliers, the cr and tp models perform well with smaller sample sizes, while the gp model excels with larger sample sizes based on AIC and GCV; in terms of deviance, the cr model performs best across all sample sizes. Our findings reveal that the performance of smoothing models is highly dependent on the type of noise present in the data and the sample size and no single model consistently outperforms others across all noise types and sample sizes, which implies that model selection should be tailored to the specific objective of a research. Based on the results, the gp model has the lowest AIC, the tp model has the lowest deviance, and the ps model has the lowest GCV.

While this study provides valuable insights, it has several limitations. The study focused on a limited set of smoothing models. Other models not considered here may perform differently under similar conditions and also the study considered only two variables. Future research can address these limitations and expand on the current study by including a wider array of smoothing models to see if the findings hold or if other models perform better under certain conditions and applying the methodology to more varied datasets to assess the generalizability of the findings.

The methodology employed in this study, which assesses smoothing models across diverse noise types and sample sizes through the metrics of AIC, deviance, and GCV, holds potential for broader application. It can readily be adapted to analyze datasets resembling sine functions, a common occurrence in fields dealing with periodic or cyclical phenomena. Such datasets are prevalent in physics, where oscillatory systems frequently yield data resembling sine functions. Also, engineers encounter sine-like data in various applications, including sound waves, electrical signals, and mechanical vibrations.

8. Conclusion

In conclusion, this study sheds light on the performance of smoothing models under different noise types and sample sizes, employing metrics such as AIC, deviance, and GCV for evaluation. The findings indicate that the effectiveness of smoothing models varies greatly depending on the noise type and sample size. Since no single model consistently excels across all conditions, it is essential to choose models based on the specific goals of research. By showcasing the variability in model performance across various noise types and sample sizes, this research provides valuable insights for practitioners and researchers in fields where smoothing techniques are applied. The study highlights the importance of considering noise characteristics and sample size when selecting the most suitable smoothing model. However, it is important to acknowledge the study’s limitations. Future research should aim at addressing these limitations by exploring a broader range of models and datasets to enhance the generalizability and applicability of the findings.

Nomenclature

GLM: generalized linear model
GAM: generalized additive model
OLS: ordinary least squares
GCV: generalized cross validation
REML: restricted marginal likelihood
RSS: residual sum of squares
AIC: Akaike information criterion
Dev: deviance
cr: cubic regression spline
ps: P-spline
tp: thin plate spline
gp: Gaussian process

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

Daniel Edinam Wormenor was responsible for the conceptualization, methodology, numerical simulations, editing, and analysis of results. Sampson Twumasi-Ankrah was responsible for supervision, review, and editing. Accam Burnett Tetteh was responsible for supervision, review, numerical simulations, editing, and analysis of results.

Funding

The authors did not receive support from any organization for the submitted work.

Appendix

The simulations in this study were performed using R Version 4.3.0. The corresponding codes are provided as an electronic appendix for reproducibility (https://github.com/Daniel-Edinam/Codes.git).

Open Research

Data Availability Statement

This study utilized the Triceps dataset from the MultiKink package in R for the empirical analysis. The dataset is publicly available and can be accessed by loading the MultiKink package. Additionally, the R code used to generate the simulated data is provided in the appendix.

References

1 Darlington R. B. and Hayes A. F., Regression analysis and linear models: concepts, applications, and implementation, 2016, Guilford Publications.
Google Scholar
2 Adedia D., Comparison of Robust Regression Estimators, [Ph.D. Thesis], 2014, Kwame Nkrumah University of Science and Technology.
Google Scholar
3 Box G. E., Science and statistics, Journal of the American Statistical Association. (1976) 71, no. 356, 791–799, https://doi.org/10.1080/01621459.1976.10480949, 2-s2.0-0000025871.
10.1080/01621459.1976.10480949
PubMed Web of Science® Google Scholar
4 Keele L. J., Semiparametric Regression for the Social Sciences, 2007, John Wiley & Sons, https://doi.org/10.1002/9780470998137, 2-s2.0-84864391517.
10.1002/9780470998137
Google Scholar
5 Su X., Yan X., and Tsai C.-L., Linear regression, Wiley Interdisciplinary Reviews: Computational Statistics. (2012) 4, no. 3, 275–294, https://doi.org/10.1002/wics.1198, 2-s2.0-84868190974.
10.1002/wics.1198
Google Scholar
6 McCullagh P., Nelder J. A., McCullagh P., and Nelder J. A., Binary data, Generalized Linear Models, 1989, Springer, 98–148, https://doi.org/10.1007/978-1-4899-3242-6_4.
10.1007/978-1-4899-3242-6_4
Google Scholar
7 Hastie T. and Tibshirani R., Generalized additive models: some applications, Journal of the American Statistical Association. (1987) 82, no. 398, 371–386, https://doi.org/10.1080/01621459.1987.10478440, 2-s2.0-84866389083.
10.1080/01621459.1987.10478440
Web of Science® Google Scholar
8 Khoda Bakhshi A. and Ahmed M. M., Real-time crash prediction for a long low-traffic volume corridor using corrected-impurity importance and semi-parametric generalized additive model, Journal of Transportation Safety & Security. (2022) 14, no. 7, 1165–1200, https://doi.org/10.1080/19439962.2021.1898069.
10.1080/19439962.2021.1898069
Google Scholar
9 Asfha H. D., Performance of Spline-Based Gam in the Presence of Outliers and Multicollinearity= Splayn-Tabanlı Gam’ın C¸ Oklu Bağlantı Ve Aykırı Değer Varlığında Performansları, [Ph.D. Thesis], Anadolu University, Turkey, 2018.
Google Scholar
10 Kan Kilinç B. and Debessay Asfha H., Selection of splines models in the presence of multicollinearity and outliers, Türkiye Klinikleri Biyoistatistik. (2020) 12, no. 2, 183–192, https://doi.org/10.5336/biostatic.2020-74122.
10.5336/biostatic.2020-74122
Google Scholar
11 German G., Smoothing and Non-Parametric Regression, 2001, Princeton University.
Google Scholar
12 Ruppert D., Wand M. P., and Carroll R. J., Semiparametric Regression, 2003, Cambridge University Press.
10.1017/CBO9780511755453
Google Scholar
13 De Boor C. and De Boor C., A Practical Guide to Splines, 1978, 27, Springer-verlag, New York, https://doi.org/10.1007/978-1-4612-6333-3.
10.1007/978-1-4612-6333-3
Google Scholar
14 Eilers P. H. and Marx B. D., Flexible Smoothing With B-Splines and Penalties, Statistical Science. (1996) 11, no. 2, 89–121, https://doi.org/10.1214/ss/1038425655, 2-s2.0-25444532788.
10.1214/ss/1038425655
Web of Science® Google Scholar
15 Duchon J., Splines minimizing rotation-invariant semi-norms in sobolev spaces, Constructive Theory of Functions of Several Variables: Proceedings of a Conference Held at Oberwolfach April 25–May 1, 1976, Springer, 85–100.
Google Scholar
16 Williams C. K. and Rasmussen C. E., Gaussian Processes for Machine Learning, 2006, 2, MIT Press Cambridge, MA.
Google Scholar
17 Lecka A., A Cross-Comparison of Cubic Splines and Gaussian Processes in Function Approximation, [Ph.D. Thesis], 2019, University of Groningen.
Google Scholar
18 Wang J., An intuitive tutorial to Gaussian processes regression, 2020, https://arxiv.org/abs/2009.10862.
Google Scholar
19 Rasmussen C. E. and Williams C. K., Gaussian Processes for Machine Learning, 2005, Springer, Heidelberg, https://doi.org/10.7551/mitpress/3206.001.0001.
10.7551/mitpress/3206.001.0001
Google Scholar
20 Akaike H., A new look at the statistical model identification, IEEE Transactions on Automatic Control. (1974) 19, no. 6, 716–723, https://doi.org/10.1109/TAC.1974.1100705, 2-s2.0-0016355478.
10.1109/TAC.1974.1100705
CAS PubMed Web of Science® Google Scholar
21 Gu C. and Wahba G., Minimizing gcv/gml scores with multiple smoothing parameters via the newton method, SIAM Journal on Scientific and Statistical Computing. (1991) 12, no. 2, 383–398, https://doi.org/10.1137/0912021.
10.1137/0912021
Web of Science® Google Scholar

All articles

Comparison of Semiparametric Models in the Presence of Noise and Outliers

Abstract

1. Introduction

2. Methods and Materials

2.1. cr

2.2. ps

2.3. tp

2.4. gp

3. Model Selection

3.1. Model AIC

3.2. Model Deviance

3.3. Model GCV

4. Simulation

5. Empirical Study

6. Results

6.1. Simulation Study

6.1.1. Performance of the Smoothing Models in the Presence of Different Noises

6.1.2. Performance of the Smoothing Models in the Presence of Outliers

6.2. Empirical Study

7. Discussion

8. Conclusion

Nomenclature

Conflicts of Interest

Author Contributions

Funding

Appendix

Open Research

Data Availability Statement

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Comparison of Semiparametric Models in the Presence of Noise and Outliers

Abstract

1. Introduction

2. Methods and Materials

2.1. cr

2.2. ps

2.3. tp

2.4. gp

3. Model Selection

3.1. Model AIC

3.2. Model Deviance

3.3. Model GCV

4. Simulation

5. Empirical Study

6. Results

6.1. Simulation Study

6.1.1. Performance of the Smoothing Models in the Presence of Different Noises

6.1.2. Performance of the Smoothing Models in the Presence of Outliers

6.2. Empirical Study

7. Discussion

8. Conclusion

Nomenclature

Conflicts of Interest

Author Contributions

Funding

Appendix

Open Research

Data Availability Statement

References

References

Related

Information