Volume 2025, Issue 1 3904251
Research Article
Open Access

Comparison of Semiparametric Models in the Presence of Noise and Outliers

Daniel Edinam Wormenor

Corresponding Author

Daniel Edinam Wormenor

Department of Statistics and Actuarial Science , Kwame Nkrumah University of Science and Technology , Kumasi , Ghana , knust.edu.gh

Search for more papers by this author
Sampson Twumasi-Ankrah

Sampson Twumasi-Ankrah

Department of Statistics and Actuarial Science , Kwame Nkrumah University of Science and Technology , Kumasi , Ghana , knust.edu.gh

Search for more papers by this author
Accam Burnett Tetteh

Accam Burnett Tetteh

Department of Statistics and Actuarial Science , Kwame Nkrumah University of Science and Technology , Kumasi , Ghana , knust.edu.gh

Search for more papers by this author
First published: 14 February 2025
Academic Editor: Tudor Barbu

Abstract

Various studies have examined generalized additive models (GAMs), comparing thin plate splines (tp), P-splines (ps), cubic regression splines (cr), and Gaussian processes (gp) for discrete choice data, function approximation, and in the presence of multicollinearity and outliers. Some studies have applied ps to models with correlated and heteroscedastic errors, while others have reviewed multiple smoothing term packages for modeling GAMs. This study seeks to examine the performance of semiparametric models in the presence of different noise and outliers within the framework of GAMs through simulation. The study adopted four GAMs, cr, ps, tp, and gp, for simulated data with different noise and outliers with varying sample sizes. According to our investigation, the cr model performs well in terms of deviance for the majority of sample sizes and all types of noise. With higher sample sizes, the ps model frequently performs well, particularly in terms of AIC and GCV under noise that is heteroscedastic and Gaussian. The gp model excels with the smallest sample size under Gaussian and lognormal noise in terms of GCV, and the tp model frequently performs best under exponential and lognormal noise for larger samples in terms of AIC and GCV. For data containing outliers, the cr and tp models are effective with smaller sample sizes, while the gp model excels with larger sample sizes based on AIC and GCV. Regarding deviance, the cr model consistently performs best across all sample sizes. Our results show that the sample size and kind of noise in the data have a significant impact on the smoothing model’s performance. No single model consistently outperforms the others for all noise types and sample sizes, suggesting that the choice of model should be based on the specific goal of a study.

1. Introduction

The method of ordinary least squares (OLS) is a statistical approach commonly used for parameter estimation, mainly due to its historical popularity and straightforward computational approach [1]. It is highly regarded for its ability to produce impartial and effective estimates of parameters, given the fulfillment of key foundational assumptions. These assumptions encompass the requirement for the error terms to follow a normal distribution, the assumption of equal variance in the error terms, and the absence of any outliers, leverage points, multicollinearity, and deviations from linearity [2]. The assumption of linearity is a crucial prerequisite; however, attaining linearity in practical scenarios can present challenges due to a multitude of factors. Although statistical models are inevitably simplifications, even the most intricate model can only offer a limited representation of reality. Consequently, the task of estimating statistical models may appear futile at first glance.

However, Box [3] succinctly captured the essence of statistical research when he stated that “all models are wrong, some are useful.” Despite their inherent inaccuracies, statistics afford us significant insights into the political, economic, and sociological dimensions of our world. Statistical models are inherently simplifications of reality, as they necessitate making assumptions about various aspects of the real world. The practice of statistics acknowledges the need for assumptions, and a significant part of statistical modeling involves conducting diagnostic tests to ensure that these assumptions are met. In the realm of social sciences, considerable effort is dedicated to scrutinizing assumptions concerning the nature of the error term, such as heteroskedasticity or serial correlation. However, researchers in the social sciences often tend to be less rigorous in testing assumptions regarding the functional form of the model [4]. In the field of social sciences, the predominant choice is the linear functional form, typically characterized by additivity. However, researchers often overlook the crucial task of validating the linearity assumption. While substantial attention is directed towards specification and avoiding misspecification, there is a noticeable dearth of exploration regarding alternative functional forms when the chosen form proves to be incorrect, essentially resulting in a specification error [4].

Regression analysis, a widely used statistical tool for establishing linear relationships between variables or the investigation of dependence, has found applications in numerous fields of study such as social science, health science, engineering, and physical science. For instance, banks utilize regression analysis to determine factors that positively or negatively impact their profits, while medical doctors employ regression to assess total body fat in patients by considering variables that influence its increase or decrease. Similarly, statisticians in hospitals use regression analysis to examine lifestyle factors that may be causative agents for certain diseases, like blood pressure [2]. Hence, regression analysis holds significant importance as a statistical tool. It is used to respond to queries like, does increasing class size affect students’ success? Can the duration of the most recent eruption of Old Faithful Geyser be used to predict the time of the next eruption? Do dietary changes affect cholesterol levels, and if so, do the effects depend on other factors like age, sex, and level of exercise? Do nations with higher incomes per person have birth rates that are lower than nations with lower incomes? Many research efforts focus on regression analysis as a key component [5].

Contrary to popular belief, the use of linear functional forms is more widespread. Many analysts concentrate largely on the outcome variable’s linearity element within a statistical model. Analysts typically use least squares estimation of a linear regression model when the outcome variable is continuous. On the other hand, generalized linear models (GLMs) like logistic or Poisson regression are often used for discrete outcomes. However, researchers frequently think they have abandoned the linear functional form by using logistic or Poisson regression models. These models nevertheless retain a crucial linear property in their functional form [4]. The introduction of the GLM notation by [6] offers valuable clarity regarding the linearity assumption in models that are commonly perceived as nonlinear by many researchers.

Hastie and Tibshirani [7] proposed the category of generalized additive models (GAMs), which represents a broadened version of GLMs. GAMs, often referred to as semiparametric regression models, encompass both additive models and GAMs [8]. They combine the use of local estimation models like lowess and splines with standard linear models and GLMs. These models provide analysts with the flexibility to incorporate nonparametric regression for certain predictor variables while estimating other predictors parametrically [4]. While GAMs relax the assumption of a global fit, they still maintain the assumption of additive effects. The additivity assumption in GAMs enhances interpretability compared to other methods such as neural networks, support vector machines, projection pursuit, and tree-based approaches. A crucial advantage of GAMs is their ability to diagnose nonlinearity. Within the GAM framework, simple linear and power transformations are nested, allowing local estimates from a GAM to be compared against linear, quadratic, or other transformations using statistical tests such as the F-test or likelihood ratio test [4]. Semiparametric models offer more flexibility than parametric models because they do not require a strict assumption about the data distribution. Asfha [9] assessed and compared the performance of GAMs, focusing on their tolerance to the effects of outliers, multicollinearity, and the combined impact of both. In the study by [10], titled “Selection of Splines Models in the Presence of Multicollinearity and Outliers,” the research is aimed at addressing challenges related to multicollinearity and outliers. The focus was on evaluating cubic regression splines (cr), P-splines (ps), and thin plate splines (tp), to handle these challenges

However, in this study, our objective is to perform simulation experiments to evaluate and analyze the effectiveness of semiparametric models when subjected to various types of noise and outliers within the context of GAMs. Noise refers to random variations in the data that arise due to measurement errors, environmental conditions, or inherent randomness in the data collection process. These variations are assumed to follow a probabilistic distribution, typically normal, and do not systematically bias the dataset and outliers are observations that deviate significantly from the majority of the data and do not conform to the expected distribution or pattern. They may be indicative of data entry errors, rare events, or underlying phenomena not captured by the main data model.

2. Methods and Materials

Smoothing stands as a straightforward methodology for effectively conforming a curve to the distribution of data points within a scatter plot. Its significance becomes particularly pronounced as a foundational cornerstone within the realm of GAMs. In this study, four smoothing techniques based on GAMs are adopted, namely, cr, ps, tp, and Gaussian process (gp).

2.1. cr

Given:
(1)
where
(2)
In Equation (3), the objective is to minimize the sum of squares between the observed values y and the nonparametric estimate f(x), while also incorporating a constraint known as a roughness penalty. This roughness penalty consists of two components. The first component is denoted by λ, which is commonly referred to as the smoothing or tuning parameter. The second component of the roughness penalty involves calculating the integrated squared second derivative of the function f(x):
(3)
where
  • λ ≥ 0 is a smoothing parameter, controlling the trade-off between fidelity to the data and roughness of the function estimate. This is often estimated by generalized cross validation (GCV) or by restricted marginal likelihood (REML) [4, 11].

  • As λ⟶0, the curve fitting algorithm will not impose any penalty on the model for fitting the data too closely. This can result in a very accurate fit to the data, but it can also lead to a very noisy curve, as it will follow every minor fluctuation in the data [4, 11].

  • As λ⟶∞, the roughness penalty becomes paramount and the estimate converges to linear least squares estimate [4, 11].

  • The second component of the roughness measures the curvature or roughness of the estimate throughout the range of x values [4].

By exploiting the equivalence between spline models and linear regression models, we can represent the first term in Equation (3) as a linear regression model in matrix form. Additionally, it has been demonstrated that the penalty term from Equation (3) can be expressed as a quadratic form in β [12]. As a result, it can represent the penalty in matrix form as follows:
(4)
where D is a matrix of the following form:
(5)
where
  • k denotes the number of knots.

  • 02x2 is the 2 × 2 matrix of zeroes.

  • Ikxk is the K × K identity matrix.

By expressing the penalty in matrix form, we can formulate the penalized spline regression model in a concise matrix representation as
(6)

2.2. ps

Given:
(7)
where
(8)
where (Po, P1, ⋯, Pn) are control points and βi,k(t) are the basis functions defined using the Cox-de Boor recursion formula [13]:
(9)
(10)
where
(11)
(12)
(13)
(14)
ps, introduced by [14], is a penalty incorporated B-splines designed to provide greater stability, especially in situations where lower rank smoothing is required, given by
(15)
They are usually defined on evenly spaced knots, with a difference penalty applied directly to the parameters, βi, to control function wiggliness. The penalty is as follows if we choose to penalize the squared difference between adjacent βi values:
(16)
(17)
(18)
In matrix form, it can be written as
(19)
(20)
where
  • P is a first-order finite difference matrix.

2.3. tp

tp are used to estimate smooth functions of multiple predictor variables from noisy observations of the function at particular values of those predictors [15]. Consider:
(21)
where εi represents a random error term and x is a vector with d dimensions.
tp smoothing computes an estimation of the function by seeking a function f that minimizes Equation (22) given as
(22)
where
  • Jmd(f) is the penalty function which measures the wiggliness of the smoother f.

  • λ is the smoothing parameter.

  • d is the number of predictors in the smooth function.

  • m = floor((d + 1)/2) + 1.

The wiggliness penalty is defined as
(23)
where
  • sums the overall possible combinations of nonnegative integer values γ1, γ2, ⋯, γd that satisfy the equation γ1 + γ2 + ⋯+γd = m.

  • m! /γ1! ⋯γd! is the combinatorial coefficient that represents the number of ways to partition m identical objects into d distinct groups, where each group contains γi objects.

  • is the m partial derivative of the function f with respect to each of the d variables, where the orders of differentiation are determined by γ1, γ2, ⋯, γd.

For the smoothing of two predictors (d=2), then m = 2, the measure of wiggliness is written as
(24)
(25)
(26)
where
  • δ and α are unknown parameter vectors to be estimated in which δ is subject to the linear constraint TTδ = 0 where Tij = ϕj(x).

  • ϕj is the basis function.

  • M = (m + d − 1)! /d! (m − 1)! .

For d = 2, then m = 2 and M = 3. The size of the basis is 3 given by ϕ1(x) = 1, ϕ2(x) = x1 and ϕ3(x) = x2. Furthermore,
(27)
where ‖ri‖ = ‖xxi‖.
Let matrix E be defined as Eijηmd(‖xixj‖). The objective of the tp fitting problem is to minimize the expression:
(28)
subject to the constraint TTδ = 0, where δ and α are the variables to be determined. Note, tp can be used with any number of predictors [9].

2.4. gp

A gp is a stochastic process that defines a distribution over functions [16]. In other words, it allows us to model a set of functions, where each function is treated as a random variable. A Gaussian distribution is specified by a mean vector μ and a covariance matrice Σ. A gp model is given as
(29)
(30)
where
  • x = [x1, ⋯, xn]

  • f(xi) = [f(x1), ⋯, f(xn)]

  • μ = [m(x1), ⋯, m(xn)]

  • Kij = k(xi, xj)

X are the observed data points. A gp is specified by a mean function m(·) and a covariance function k(·, ·), where
  • m(·) = 0

  • Cov[f(xi), f(xj)] = k(xi, xj)

The most commonly used covariance function is the squared exponential kernel [17, 18] given as
(31)
where
  • l is the length scale parameter

  • is the signal variance parameter

l scales up or down the distance between the xs. A higher l provides a smoother function and a smaller l provides a wigglier function. larger values allow for more variation and smaller values allow for less variation. l and are commonly called hyperparameters. The optimal hyperparameters Θ are determined by the log marginal likelihood [19] as
(32)

gp spline combines the flexibility of gp with the smoothness of splines. This is done by modeling each spline piece as a separate gp segment and combining them to form a more flexible model.

3. Model Selection

Assessing models in regression analysis is crucial for comparing their effectiveness. Once different models are fitted to the data, it becomes essential to thoroughly evaluate their overall fit and the quality of that fit. cr, ps, tp, and gp regression models were applied to the simulated data, where the dependent variable was “y” and the independent variable “x.” In each simulation scenario, the models’ performance was evaluated using AIC, deviance, and GCV to assess their effectiveness and suitability. These model selection techniques were chosen because the AIC helps in balancing model fit and complexity [20], the deviance provides a direct measure of model fit [9], and the GCV helps find a model that balances accuracy and simplicity, making it reliable for future predictions [21].

3.1. Model AIC

The AIC [20] is a statistical measure used for model selection. It quantifies the trade-off between the goodness of fit of a model and its complexity, penalizing models with more parameters. The AIC is calculated using the following mathematical formulation:
(33)
where
  • Lmodel is the likelihood of the GAM model

  • k is the number of estimated parameters in the model

3.2. Model Deviance

Model deviance is calculated as negative twice the disparity in log-likelihood between the null model and the full model, which represents the model of interest [9]. Given by
(34)
where
  • LNULL is the log-likelihood of a null model (a model with no predictors)

  • LMODEL is the likelihood of the GAM model

3.3. Model GCV

GCV, as proposed by [21], is given as
(35)
where
  • n is the number of observations

  • tr(S) is the trace of the smoothing matrix S

  • RSS is the residual sum of squares

4. Simulation

In this segment, a simulation study is conducted to demonstrate how the four GAMs, namely, cr, ps, tp, and gp are fitted to datasets containing errors from Gaussian, exponential, and lognormal distributions. Additionally, the study explores scenarios involving correlated and heteroscedastic noise, along with outlier presence, to evaluate the models’ effectiveness. To achieve this, the following must be observed.
  • The values of the independent variable were generated as a sequence of equally spaced values ranging from 0 to 1. The values of the dependent variable were generated based on a sine function sin(2πx), which produces a periodic wave pattern. To simulate the presence of noise, random fluctuations were added to the dependent variable.

  • These fluctuations were generated from the Gaussian distributions with mean of 0 and a standard deviation of 0.1, from the exponential distribution with a rate of 1, and the lognormal distribution with a mean log of 0 and a standard deviation log of 1.

  • For correlated errors, a sequence of correlated noise was generated using an autoregressive process with an autoregressive coefficient of 0.8 and standard deviation of 0.1; for heteroscedastic errors, the random fluctuations were generated from the Gaussian distribution with mean 0 and the standard deviation (0.1x2) resulting in heteroscedastic noise.

  • To introduce outliers, 20% of the dependent variable values were randomly chosen as maximum vertical outliers. To simulate the presence of both noise and outliers, random fluctuations (Gaussian, exponential, lognormal, heteroscedastic, and correlated noise) and 20% of outliers were added to the response variable for each data generated.

  • To investigate the model’s behavior, 1000 iterations were simulated across various sample sizes 50, 100, 200, 300, 400, and 500 recording average deviance and AIC values for all sample sizes in each model. A fixed knot at k = 25 was employed consistently when fitting the GAMs across all scenarios.

5. Empirical Study

In the empirical study, we used data from an anthropometric study of 892 females under 50 years in three Gambian villages in West Africa, a data frame with 892 observations on the following three variables: age (age of respondents), lntriceps (log of the triceps skinfold thickness), and triceps (triceps skinfold thickness). For the purposes of our study, we extracted age as our dependent variable and triceps as our independent variable.

6. Results

6.1. Simulation Study

6.1.1. Performance of the Smoothing Models in the Presence of Different Noises

Table 1 shows the average AIC, deviance, and GCV values for different smoothing models applied across varying sample sizes for each type of noise. In the presence of Gaussian noise, the cr model exhibits the lowest AIC values for sample sizes of 50 and 100, while the ps model shows the lowest AIC values for sample sizes ranging from 200 to 500. Regarding deviance, the cr model consistently has the lowest values across all sample sizes. For GCV, the gp model has the lowest value for a sample size of 50, and the ps model shows the lowest GCV values for sample sizes from 100 to 500.

Table 1. Simulation results for different types of noise.
n cr ps tp gp
AIC Dev GCV AIC Dev GCV AIC Dev GCV AIC Dev GCV
Gaussian noise
50 93.90621 14.44691 0.3890253 94.14079 14.69773 0.3882821 94.55149 15.05676 0.3886149 94.1508 14.67584 0.3882202
100 186.80599 32.76760 0.3781714 186.81444 32.83055 0.3781079 186.93655 33.08236 0.3781899 186.8825 32.88542 0.3782136
200 369.72954 68.95374 0.3704007 369.70498 69.04901 0.3703276 369.76360 69.13356 0.3704022 369.8064 69.03241 0.3705073
300 551.71050 104.87276 0.3672722 551.65768 105.01119 0.3671869 551.70564 105.02585 0.3672412 551.7817 104.96504 0.3673434
400 731.73180 140.15956 0.3638832 731.64478 140.25801 0.3637970 731.73285 140.28544 0.3638730 731.7972 140.21116 0.3639373
500 915.71571 176.87235 0.3648749 915.64695 176.94916 0.3648214 915.73794 177.04573 0.3648818 915.8034 176.90096 0.3649363
  
Exponential noise
50 319.6493 1541.939 36.90835 319.8258 1550.800 36.92509 320.0093 1596.270 36.76120 319.8061 1556.810 36.85055
100 641.5452 3360.978 36.60641 641.5665 3368.524 36.60132 641.5513 3413.017 36.53660 641.5693 3376.194 36.58600
200 1284.0189 6961.620 36.28710 1284.0321 6963.671 36.28869 1283.9462 6997.981 36.26228 1283.9719 6970.786 36.27413
300 1927.5254 10,615.351 36.38911 1927.4890 10,618.675 36.38393 1927.4236 10,655.292 36.37128 1927.4673 10,620.715 36.38121
400 2569.1207 14,172.382 36.23391 2569.1669 14,174.472 36.23756 2569.0406 14,215.202 36.22311 2569.0665 14,184.858 36.22755
500 3209.4782 17,700.128 36.04654 3209.4911 17,707.708 36.04666 3209.3939 17,738.320 36.03869 3209.4334 17,711.490 36.04224
  
Lognormal noise
50 384.4017 7150.621 168.2466 384.3376 7240.909 167.6750 384.7875 7356.442 167.7671 384.4797 7263.681 167.6115
100 779.8758 16,429.065 177.2600 779.8747 16,433.674 177.0760 779.9213 16,629.556 177.0538 779.9200 16,493.364 177.1642
200 1569.2552 32,154.810 167.0437 1569.1530 32,118.925 166.9821 1569.1911 32,333.629 166.9545 1569.2190 32,189.221 167.0100
300 2361.9916 49,284.522 168.4242 2361.9513 49,291.997 168.3912 2361.8930 49,442.418 168.3490 2361.9263 49,332.242 168.3787
400 3156.6207 66,352.375 169.0108 3156.6205 66,351.810 169.0132 3156.5415 66,481.631 168.9717 3156.5977 66,374.968 169.0020
500 3959.0341 83,602.138 169.8227 3959.0740 83,581.961 169.8377 3958.9157 83,805.733 169.7727 3958.9908 83,649.899 169.8046
  
Correlated noise
50 53.8370 3.7386 0.2050191 58.5327 4.3383 0.2176801 56.6344 4.0860 0.2123397 55.3564 3.9727 0.2073078
100 125.4313 12.7127 0.2141385 137.6718 15.2028 0.2382971 133.5194 14.1658 0.2303664 141.4205 16.2391 0.2457095
200 365.6562 57.1011 0.3661364 371.1149 59.1274 0.3758966 363.7029 56.5237 0.3625970 366.9008 58.2453 0.3677800
300 633.7779 123.5519 0.4840338 627.7072 119.9756 0.4747138 620.8280 117.9380 0.4637159 630.5637 123.2383 0.4785559
400 848.3205 171.9077 0.4876447 906.9662 200.2215 0.5644374 856.1566 175.3607 0.4972824 891.7622 193.9790 0.5431788
500 1107.0419 242.2257 0.5350661 1125.8548 253.0260 0.5554178 1120.0091 248.9138 0.5490881 1125.2149 251.4332 0.5548444
  
Heteroscedastic noise
50 10.0792 2.6791 0.07831279 10.1197 2.7320 0.07785383 11.4557 2.8614 0.07876284 10.7735 2.7636 0.07842460
100 22.9451 6.2389 0.07524141 22.6376 6.2392 0.07498336 23.2708 6.3525 0.07530713 23.1298 6.2918 0.07528758
200 45.6395 13.4907 0.07406204 45.3220 13.5069 0.07393325 45.7957 13.5843 0.07408459 45.8535 13.5477 0.07411817
300 65.4210 20.5738 0.07310522 65.0601 20.6037 0.07301035 65.5530 20.6647 0.07312095 65.6481 20.6454 0.07314678
400 86.4178 27.7676 0.07280445 86.1340 27.7854 0.07275172 86.5214 27.8327 0.07281793 86.7035 27.8201 0.07285208
500 106.7539 34.9374 0.07265729 106.3950 34.9547 0.07260265 106.8468 35.0011 0.07266690 107.0292 34.9983 0.07269362
  • Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.

For exponential noise, the gp model achieves the lowest AIC values for sample sizes of 50 and 100, whereas the tp model shows the lowest AIC values for sample sizes from 200 to 500. The cr model has the lowest deviance values across all sample sizes, and the tp model consistently records the lowest GCV values regardless of sample size.

In the case of lognormal noise, the ps model yields the lowest AIC values for sample sizes of 50, 100, and 200, while the tp model has the lowest AIC values for sample sizes from 300 to 500. For deviance, the cr model has the lowest values for sample sizes of 50, 100, and 300, while the ps model shows the lowest deviance values for sample sizes of 200, 400, and 500. Regarding GCV, the gp model has the lowest value for a sample size of 50, and the tp model consistently shows the lowest GCV values for sample sizes from 100 to 500.

For correlated noise, the cr model demonstrates the lowest AIC, deviance, and GCV values for sample sizes of 50, 100, 400, and 500. However, for sample sizes of 200 and 300, the tp model exhibits the lowest AIC, deviance, and GCV values.

In the presence of heteroscedastic noise, the cr model records the lowest AIC value for a sample size of 50, while the ps model shows the lowest AIC values for sample sizes from 100 to 500. The cr model consistently has the lowest deviance values across all sample sizes. For GCV, the ps model records the lowest values for all sample sizes. The simulations in this study were performed using R. The corresponding codes are provided in 8 for reproducibility.

6.1.2. Performance of the Smoothing Models in the Presence of Outliers

Table 2 presents the average AIC, deviance, and GCV metrics showcasing the performance of the four different smoothing models under outliers. For sample size 50, the cr model records the lowest AIC value, and for sample sizes 100–500, the gp model records the lowest AIC values. For deviance, the cr records the lowest deviance values for all sample sizes, and for GCV, the tp model records the lowest GCV values for sample sizes 50 and 100 while the gp model records the lowest GCV values for sample sizes 200–500.

Table 2. Simulation results for outliers.
n cr ps tp gp
AIC Dev GCV AIC Dev GCV AIC Dev GCV AIC Dev GCV
50 221.7973 201.8335 4.911279 221.8378 202.9597 4.903755 222.1760 208.6935 4.893145 221.8933 204.3111 4.894905
100 441.4750 432.2088 4.787199 441.5084 432.5889 4.788514 441.5425 437.4634 4.783731 441.4503 433.3559 4.784178
200 877.2997 884.4995 4.674621 877.2997 884.7719 4.674582 877.3502 888.5398 4.674604 877.2779 885.6246 4.673631
300 1312.5408 1335.6427 4.630237 1312.5522 1336.4536 4.630282 1312.5607 1339.6206 4.629994 1312.5080 1336.5940 4.629525
400 1748.9729 1793.9210 4.623047 1748.9572 1794.7829 4.622803 1748.9688 1797.8367 4.622700 1748.9442 1795.2525 4.622575
500 2183.8988 2246.4415 4.604702 2183.8796 2246.9413 4.604503 2183.9102 2249.8641 4.604644 2183.8733 2247.1608 4.604411
  • Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.

6.2. Empirical Study

Table 3 presents the performance of the models using the empirical data. Based on the results, the gp model has the lowest AIC, the tp model has the lowest deviance, and the ps model has the lowest GCV.

Table 3. Empirical results.
cr ps tp gp
AIC Dev GCV AIC Dev GCV AIC Dev GCV AIC Dev GCV
4898.589 12,397.56 14.17771 4897.559 12,359.89 14.16162 4897.047 12,170.46 14.15661 4897.014 12,182.06 14.15584
  • Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.

7. Discussion

In this study, we aimed to evaluate the performance of semiparametric models in the presence of Gaussian, exponential, lognormal, correlated, and heteroscedastic noise and outliers within the framework of GAMs through simulation. Generally, our analysis shows that the cr model shows strong performance in terms of deviance across all noise types and most sample sizes. The ps model often performs well with larger sample sizes, especially in terms of AIC and GCV under Gaussian and heteroscedastic noise. The gp model excels with the smallest sample size under Gaussian and lognormal noise in terms of GCV, and the tp model frequently performs best under exponential and lognormal noise for larger samples in terms of AIC and GCV. For data with outliers, the cr and tp models perform well with smaller sample sizes, while the gp model excels with larger sample sizes based on AIC and GCV; in terms of deviance, the cr model performs best across all sample sizes. Our findings reveal that the performance of smoothing models is highly dependent on the type of noise present in the data and the sample size and no single model consistently outperforms others across all noise types and sample sizes, which implies that model selection should be tailored to the specific objective of a research. Based on the results, the gp model has the lowest AIC, the tp model has the lowest deviance, and the ps model has the lowest GCV.

While this study provides valuable insights, it has several limitations. The study focused on a limited set of smoothing models. Other models not considered here may perform differently under similar conditions and also the study considered only two variables. Future research can address these limitations and expand on the current study by including a wider array of smoothing models to see if the findings hold or if other models perform better under certain conditions and applying the methodology to more varied datasets to assess the generalizability of the findings.

The methodology employed in this study, which assesses smoothing models across diverse noise types and sample sizes through the metrics of AIC, deviance, and GCV, holds potential for broader application. It can readily be adapted to analyze datasets resembling sine functions, a common occurrence in fields dealing with periodic or cyclical phenomena. Such datasets are prevalent in physics, where oscillatory systems frequently yield data resembling sine functions. Also, engineers encounter sine-like data in various applications, including sound waves, electrical signals, and mechanical vibrations.

8. Conclusion

In conclusion, this study sheds light on the performance of smoothing models under different noise types and sample sizes, employing metrics such as AIC, deviance, and GCV for evaluation. The findings indicate that the effectiveness of smoothing models varies greatly depending on the noise type and sample size. Since no single model consistently excels across all conditions, it is essential to choose models based on the specific goals of research. By showcasing the variability in model performance across various noise types and sample sizes, this research provides valuable insights for practitioners and researchers in fields where smoothing techniques are applied. The study highlights the importance of considering noise characteristics and sample size when selecting the most suitable smoothing model. However, it is important to acknowledge the study’s limitations. Future research should aim at addressing these limitations by exploring a broader range of models and datasets to enhance the generalizability and applicability of the findings.

Nomenclature

  • GLM
  • generalized linear model
  • GAM
  • generalized additive model
  • OLS
  • ordinary least squares
  • GCV
  • generalized cross validation
  • REML
  • restricted marginal likelihood
  • RSS
  • residual sum of squares
  • AIC
  • Akaike information criterion
  • Dev
  • deviance
  • cr
  • cubic regression spline
  • ps
  • P-spline
  • tp
  • thin plate spline
  • gp
  • Gaussian process
  • Conflicts of Interest

    The authors declare no conflicts of interest.

    Author Contributions

    Daniel Edinam Wormenor was responsible for the conceptualization, methodology, numerical simulations, editing, and analysis of results. Sampson Twumasi-Ankrah was responsible for supervision, review, and editing. Accam Burnett Tetteh was responsible for supervision, review, numerical simulations, editing, and analysis of results.

    Funding

    The authors did not receive support from any organization for the submitted work.

    Appendix

    The simulations in this study were performed using R Version 4.3.0. The corresponding codes are provided as an electronic appendix for reproducibility (https://github.com/Daniel-Edinam/Codes.git).

    Data Availability Statement

    This study utilized the Triceps dataset from the MultiKink package in R for the empirical analysis. The dataset is publicly available and can be accessed by loading the MultiKink package. Additionally, the R code used to generate the simulated data is provided in the appendix.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.