Comparison of Semiparametric Models in the Presence of Noise and Outliers
Abstract
Various studies have examined generalized additive models (GAMs), comparing thin plate splines (tp), P-splines (ps), cubic regression splines (cr), and Gaussian processes (gp) for discrete choice data, function approximation, and in the presence of multicollinearity and outliers. Some studies have applied ps to models with correlated and heteroscedastic errors, while others have reviewed multiple smoothing term packages for modeling GAMs. This study seeks to examine the performance of semiparametric models in the presence of different noise and outliers within the framework of GAMs through simulation. The study adopted four GAMs, cr, ps, tp, and gp, for simulated data with different noise and outliers with varying sample sizes. According to our investigation, the cr model performs well in terms of deviance for the majority of sample sizes and all types of noise. With higher sample sizes, the ps model frequently performs well, particularly in terms of AIC and GCV under noise that is heteroscedastic and Gaussian. The gp model excels with the smallest sample size under Gaussian and lognormal noise in terms of GCV, and the tp model frequently performs best under exponential and lognormal noise for larger samples in terms of AIC and GCV. For data containing outliers, the cr and tp models are effective with smaller sample sizes, while the gp model excels with larger sample sizes based on AIC and GCV. Regarding deviance, the cr model consistently performs best across all sample sizes. Our results show that the sample size and kind of noise in the data have a significant impact on the smoothing model’s performance. No single model consistently outperforms the others for all noise types and sample sizes, suggesting that the choice of model should be based on the specific goal of a study.
1. Introduction
The method of ordinary least squares (OLS) is a statistical approach commonly used for parameter estimation, mainly due to its historical popularity and straightforward computational approach [1]. It is highly regarded for its ability to produce impartial and effective estimates of parameters, given the fulfillment of key foundational assumptions. These assumptions encompass the requirement for the error terms to follow a normal distribution, the assumption of equal variance in the error terms, and the absence of any outliers, leverage points, multicollinearity, and deviations from linearity [2]. The assumption of linearity is a crucial prerequisite; however, attaining linearity in practical scenarios can present challenges due to a multitude of factors. Although statistical models are inevitably simplifications, even the most intricate model can only offer a limited representation of reality. Consequently, the task of estimating statistical models may appear futile at first glance.
However, Box [3] succinctly captured the essence of statistical research when he stated that “all models are wrong, some are useful.” Despite their inherent inaccuracies, statistics afford us significant insights into the political, economic, and sociological dimensions of our world. Statistical models are inherently simplifications of reality, as they necessitate making assumptions about various aspects of the real world. The practice of statistics acknowledges the need for assumptions, and a significant part of statistical modeling involves conducting diagnostic tests to ensure that these assumptions are met. In the realm of social sciences, considerable effort is dedicated to scrutinizing assumptions concerning the nature of the error term, such as heteroskedasticity or serial correlation. However, researchers in the social sciences often tend to be less rigorous in testing assumptions regarding the functional form of the model [4]. In the field of social sciences, the predominant choice is the linear functional form, typically characterized by additivity. However, researchers often overlook the crucial task of validating the linearity assumption. While substantial attention is directed towards specification and avoiding misspecification, there is a noticeable dearth of exploration regarding alternative functional forms when the chosen form proves to be incorrect, essentially resulting in a specification error [4].
Regression analysis, a widely used statistical tool for establishing linear relationships between variables or the investigation of dependence, has found applications in numerous fields of study such as social science, health science, engineering, and physical science. For instance, banks utilize regression analysis to determine factors that positively or negatively impact their profits, while medical doctors employ regression to assess total body fat in patients by considering variables that influence its increase or decrease. Similarly, statisticians in hospitals use regression analysis to examine lifestyle factors that may be causative agents for certain diseases, like blood pressure [2]. Hence, regression analysis holds significant importance as a statistical tool. It is used to respond to queries like, does increasing class size affect students’ success? Can the duration of the most recent eruption of Old Faithful Geyser be used to predict the time of the next eruption? Do dietary changes affect cholesterol levels, and if so, do the effects depend on other factors like age, sex, and level of exercise? Do nations with higher incomes per person have birth rates that are lower than nations with lower incomes? Many research efforts focus on regression analysis as a key component [5].
Contrary to popular belief, the use of linear functional forms is more widespread. Many analysts concentrate largely on the outcome variable’s linearity element within a statistical model. Analysts typically use least squares estimation of a linear regression model when the outcome variable is continuous. On the other hand, generalized linear models (GLMs) like logistic or Poisson regression are often used for discrete outcomes. However, researchers frequently think they have abandoned the linear functional form by using logistic or Poisson regression models. These models nevertheless retain a crucial linear property in their functional form [4]. The introduction of the GLM notation by [6] offers valuable clarity regarding the linearity assumption in models that are commonly perceived as nonlinear by many researchers.
Hastie and Tibshirani [7] proposed the category of generalized additive models (GAMs), which represents a broadened version of GLMs. GAMs, often referred to as semiparametric regression models, encompass both additive models and GAMs [8]. They combine the use of local estimation models like lowess and splines with standard linear models and GLMs. These models provide analysts with the flexibility to incorporate nonparametric regression for certain predictor variables while estimating other predictors parametrically [4]. While GAMs relax the assumption of a global fit, they still maintain the assumption of additive effects. The additivity assumption in GAMs enhances interpretability compared to other methods such as neural networks, support vector machines, projection pursuit, and tree-based approaches. A crucial advantage of GAMs is their ability to diagnose nonlinearity. Within the GAM framework, simple linear and power transformations are nested, allowing local estimates from a GAM to be compared against linear, quadratic, or other transformations using statistical tests such as the F-test or likelihood ratio test [4]. Semiparametric models offer more flexibility than parametric models because they do not require a strict assumption about the data distribution. Asfha [9] assessed and compared the performance of GAMs, focusing on their tolerance to the effects of outliers, multicollinearity, and the combined impact of both. In the study by [10], titled “Selection of Splines Models in the Presence of Multicollinearity and Outliers,” the research is aimed at addressing challenges related to multicollinearity and outliers. The focus was on evaluating cubic regression splines (cr), P-splines (ps), and thin plate splines (tp), to handle these challenges
However, in this study, our objective is to perform simulation experiments to evaluate and analyze the effectiveness of semiparametric models when subjected to various types of noise and outliers within the context of GAMs. Noise refers to random variations in the data that arise due to measurement errors, environmental conditions, or inherent randomness in the data collection process. These variations are assumed to follow a probabilistic distribution, typically normal, and do not systematically bias the dataset and outliers are observations that deviate significantly from the majority of the data and do not conform to the expected distribution or pattern. They may be indicative of data entry errors, rare events, or underlying phenomena not captured by the main data model.
2. Methods and Materials
Smoothing stands as a straightforward methodology for effectively conforming a curve to the distribution of data points within a scatter plot. Its significance becomes particularly pronounced as a foundational cornerstone within the realm of GAMs. In this study, four smoothing techniques based on GAMs are adopted, namely, cr, ps, tp, and Gaussian process (gp).
2.1. cr
- •
λ ≥ 0 is a smoothing parameter, controlling the trade-off between fidelity to the data and roughness of the function estimate. This is often estimated by generalized cross validation (GCV) or by restricted marginal likelihood (REML) [4, 11].
- •
As λ⟶0, the curve fitting algorithm will not impose any penalty on the model for fitting the data too closely. This can result in a very accurate fit to the data, but it can also lead to a very noisy curve, as it will follow every minor fluctuation in the data [4, 11].
- •
As λ⟶∞, the roughness penalty becomes paramount and the estimate converges to linear least squares estimate [4, 11].
- •
The second component of the roughness measures the curvature or roughness of the estimate throughout the range of x values [4].
- •
k denotes the number of knots.
- •
02x2 is the 2 × 2 matrix of zeroes.
- •
Ikxk is the K × K identity matrix.
2.2. ps
- •
P is a first-order finite difference matrix.
2.3. tp
- •
- •
- •
Jmd(f) is the penalty function which measures the wiggliness of the smoother f.
- •
λ is the smoothing parameter.
- •
d is the number of predictors in the smooth function.
- •
m = floor((d + 1)/2) + 1.
- •
sums the overall possible combinations of nonnegative integer values γ1, γ2, ⋯, γd that satisfy the equation γ1 + γ2 + ⋯+γd = m.
- •
m! /γ1! ⋯γd! is the combinatorial coefficient that represents the number of ways to partition m identical objects into d distinct groups, where each group contains γi objects.
- •
is the m partial derivative of the function f with respect to each of the d variables, where the orders of differentiation are determined by γ1, γ2, ⋯, γd.
- •
δ and α are unknown parameter vectors to be estimated in which δ is subject to the linear constraint TTδ = 0 where Tij = ϕj(x).
- •
ϕj is the basis function.
- •
M = (m + d − 1)! /d! (m − 1)! .
2.4. gp
- •
x = [x1, ⋯, xn]
- •
f(xi) = [f(x1), ⋯, f(xn)]
- •
μ = [m(x1), ⋯, m(xn)]
- •
Kij = k(xi, xj)
- •
m(·) = 0
- •
Cov[f(xi), f(xj)] = k(xi, xj)
- •
l is the length scale parameter
- •
is the signal variance parameter
gp spline combines the flexibility of gp with the smoothness of splines. This is done by modeling each spline piece as a separate gp segment and combining them to form a more flexible model.
3. Model Selection
Assessing models in regression analysis is crucial for comparing their effectiveness. Once different models are fitted to the data, it becomes essential to thoroughly evaluate their overall fit and the quality of that fit. cr, ps, tp, and gp regression models were applied to the simulated data, where the dependent variable was “y” and the independent variable “x.” In each simulation scenario, the models’ performance was evaluated using AIC, deviance, and GCV to assess their effectiveness and suitability. These model selection techniques were chosen because the AIC helps in balancing model fit and complexity [20], the deviance provides a direct measure of model fit [9], and the GCV helps find a model that balances accuracy and simplicity, making it reliable for future predictions [21].
3.1. Model AIC
- •
Lmodel is the likelihood of the GAM model
- •
k is the number of estimated parameters in the model
3.2. Model Deviance
- •
LNULL is the log-likelihood of a null model (a model with no predictors)
- •
LMODEL is the likelihood of the GAM model
3.3. Model GCV
- •
n is the number of observations
- •
tr(S) is the trace of the smoothing matrix S
- •
RSS is the residual sum of squares
4. Simulation
- •
The values of the independent variable were generated as a sequence of equally spaced values ranging from 0 to 1. The values of the dependent variable were generated based on a sine function sin(2πx), which produces a periodic wave pattern. To simulate the presence of noise, random fluctuations were added to the dependent variable.
- •
These fluctuations were generated from the Gaussian distributions with mean of 0 and a standard deviation of 0.1, from the exponential distribution with a rate of 1, and the lognormal distribution with a mean log of 0 and a standard deviation log of 1.
- •
For correlated errors, a sequence of correlated noise was generated using an autoregressive process with an autoregressive coefficient of 0.8 and standard deviation of 0.1; for heteroscedastic errors, the random fluctuations were generated from the Gaussian distribution with mean 0 and the standard deviation (0.1x2) resulting in heteroscedastic noise.
- •
To introduce outliers, 20% of the dependent variable values were randomly chosen as maximum vertical outliers. To simulate the presence of both noise and outliers, random fluctuations (Gaussian, exponential, lognormal, heteroscedastic, and correlated noise) and 20% of outliers were added to the response variable for each data generated.
- •
To investigate the model’s behavior, 1000 iterations were simulated across various sample sizes 50, 100, 200, 300, 400, and 500 recording average deviance and AIC values for all sample sizes in each model. A fixed knot at k = 25 was employed consistently when fitting the GAMs across all scenarios.
5. Empirical Study
In the empirical study, we used data from an anthropometric study of 892 females under 50 years in three Gambian villages in West Africa, a data frame with 892 observations on the following three variables: age (age of respondents), lntriceps (log of the triceps skinfold thickness), and triceps (triceps skinfold thickness). For the purposes of our study, we extracted age as our dependent variable and triceps as our independent variable.
6. Results
6.1. Simulation Study
6.1.1. Performance of the Smoothing Models in the Presence of Different Noises
Table 1 shows the average AIC, deviance, and GCV values for different smoothing models applied across varying sample sizes for each type of noise. In the presence of Gaussian noise, the cr model exhibits the lowest AIC values for sample sizes of 50 and 100, while the ps model shows the lowest AIC values for sample sizes ranging from 200 to 500. Regarding deviance, the cr model consistently has the lowest values across all sample sizes. For GCV, the gp model has the lowest value for a sample size of 50, and the ps model shows the lowest GCV values for sample sizes from 100 to 500.
n | cr | ps | tp | gp | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AIC | Dev | GCV | AIC | Dev | GCV | AIC | Dev | GCV | AIC | Dev | GCV | |
Gaussian noise | ||||||||||||
50 | 93.90621 | 14.44691 | 0.3890253 | 94.14079 | 14.69773 | 0.3882821 | 94.55149 | 15.05676 | 0.3886149 | 94.1508 | 14.67584 | 0.3882202 |
100 | 186.80599 | 32.76760 | 0.3781714 | 186.81444 | 32.83055 | 0.3781079 | 186.93655 | 33.08236 | 0.3781899 | 186.8825 | 32.88542 | 0.3782136 |
200 | 369.72954 | 68.95374 | 0.3704007 | 369.70498 | 69.04901 | 0.3703276 | 369.76360 | 69.13356 | 0.3704022 | 369.8064 | 69.03241 | 0.3705073 |
300 | 551.71050 | 104.87276 | 0.3672722 | 551.65768 | 105.01119 | 0.3671869 | 551.70564 | 105.02585 | 0.3672412 | 551.7817 | 104.96504 | 0.3673434 |
400 | 731.73180 | 140.15956 | 0.3638832 | 731.64478 | 140.25801 | 0.3637970 | 731.73285 | 140.28544 | 0.3638730 | 731.7972 | 140.21116 | 0.3639373 |
500 | 915.71571 | 176.87235 | 0.3648749 | 915.64695 | 176.94916 | 0.3648214 | 915.73794 | 177.04573 | 0.3648818 | 915.8034 | 176.90096 | 0.3649363 |
Exponential noise | ||||||||||||
50 | 319.6493 | 1541.939 | 36.90835 | 319.8258 | 1550.800 | 36.92509 | 320.0093 | 1596.270 | 36.76120 | 319.8061 | 1556.810 | 36.85055 |
100 | 641.5452 | 3360.978 | 36.60641 | 641.5665 | 3368.524 | 36.60132 | 641.5513 | 3413.017 | 36.53660 | 641.5693 | 3376.194 | 36.58600 |
200 | 1284.0189 | 6961.620 | 36.28710 | 1284.0321 | 6963.671 | 36.28869 | 1283.9462 | 6997.981 | 36.26228 | 1283.9719 | 6970.786 | 36.27413 |
300 | 1927.5254 | 10,615.351 | 36.38911 | 1927.4890 | 10,618.675 | 36.38393 | 1927.4236 | 10,655.292 | 36.37128 | 1927.4673 | 10,620.715 | 36.38121 |
400 | 2569.1207 | 14,172.382 | 36.23391 | 2569.1669 | 14,174.472 | 36.23756 | 2569.0406 | 14,215.202 | 36.22311 | 2569.0665 | 14,184.858 | 36.22755 |
500 | 3209.4782 | 17,700.128 | 36.04654 | 3209.4911 | 17,707.708 | 36.04666 | 3209.3939 | 17,738.320 | 36.03869 | 3209.4334 | 17,711.490 | 36.04224 |
Lognormal noise | ||||||||||||
50 | 384.4017 | 7150.621 | 168.2466 | 384.3376 | 7240.909 | 167.6750 | 384.7875 | 7356.442 | 167.7671 | 384.4797 | 7263.681 | 167.6115 |
100 | 779.8758 | 16,429.065 | 177.2600 | 779.8747 | 16,433.674 | 177.0760 | 779.9213 | 16,629.556 | 177.0538 | 779.9200 | 16,493.364 | 177.1642 |
200 | 1569.2552 | 32,154.810 | 167.0437 | 1569.1530 | 32,118.925 | 166.9821 | 1569.1911 | 32,333.629 | 166.9545 | 1569.2190 | 32,189.221 | 167.0100 |
300 | 2361.9916 | 49,284.522 | 168.4242 | 2361.9513 | 49,291.997 | 168.3912 | 2361.8930 | 49,442.418 | 168.3490 | 2361.9263 | 49,332.242 | 168.3787 |
400 | 3156.6207 | 66,352.375 | 169.0108 | 3156.6205 | 66,351.810 | 169.0132 | 3156.5415 | 66,481.631 | 168.9717 | 3156.5977 | 66,374.968 | 169.0020 |
500 | 3959.0341 | 83,602.138 | 169.8227 | 3959.0740 | 83,581.961 | 169.8377 | 3958.9157 | 83,805.733 | 169.7727 | 3958.9908 | 83,649.899 | 169.8046 |
Correlated noise | ||||||||||||
50 | 53.8370 | 3.7386 | 0.2050191 | 58.5327 | 4.3383 | 0.2176801 | 56.6344 | 4.0860 | 0.2123397 | 55.3564 | 3.9727 | 0.2073078 |
100 | 125.4313 | 12.7127 | 0.2141385 | 137.6718 | 15.2028 | 0.2382971 | 133.5194 | 14.1658 | 0.2303664 | 141.4205 | 16.2391 | 0.2457095 |
200 | 365.6562 | 57.1011 | 0.3661364 | 371.1149 | 59.1274 | 0.3758966 | 363.7029 | 56.5237 | 0.3625970 | 366.9008 | 58.2453 | 0.3677800 |
300 | 633.7779 | 123.5519 | 0.4840338 | 627.7072 | 119.9756 | 0.4747138 | 620.8280 | 117.9380 | 0.4637159 | 630.5637 | 123.2383 | 0.4785559 |
400 | 848.3205 | 171.9077 | 0.4876447 | 906.9662 | 200.2215 | 0.5644374 | 856.1566 | 175.3607 | 0.4972824 | 891.7622 | 193.9790 | 0.5431788 |
500 | 1107.0419 | 242.2257 | 0.5350661 | 1125.8548 | 253.0260 | 0.5554178 | 1120.0091 | 248.9138 | 0.5490881 | 1125.2149 | 251.4332 | 0.5548444 |
Heteroscedastic noise | ||||||||||||
50 | 10.0792 | 2.6791 | 0.07831279 | 10.1197 | 2.7320 | 0.07785383 | 11.4557 | 2.8614 | 0.07876284 | 10.7735 | 2.7636 | 0.07842460 |
100 | 22.9451 | 6.2389 | 0.07524141 | 22.6376 | 6.2392 | 0.07498336 | 23.2708 | 6.3525 | 0.07530713 | 23.1298 | 6.2918 | 0.07528758 |
200 | 45.6395 | 13.4907 | 0.07406204 | 45.3220 | 13.5069 | 0.07393325 | 45.7957 | 13.5843 | 0.07408459 | 45.8535 | 13.5477 | 0.07411817 |
300 | 65.4210 | 20.5738 | 0.07310522 | 65.0601 | 20.6037 | 0.07301035 | 65.5530 | 20.6647 | 0.07312095 | 65.6481 | 20.6454 | 0.07314678 |
400 | 86.4178 | 27.7676 | 0.07280445 | 86.1340 | 27.7854 | 0.07275172 | 86.5214 | 27.8327 | 0.07281793 | 86.7035 | 27.8201 | 0.07285208 |
500 | 106.7539 | 34.9374 | 0.07265729 | 106.3950 | 34.9547 | 0.07260265 | 106.8468 | 35.0011 | 0.07266690 | 107.0292 | 34.9983 | 0.07269362 |
- Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.
For exponential noise, the gp model achieves the lowest AIC values for sample sizes of 50 and 100, whereas the tp model shows the lowest AIC values for sample sizes from 200 to 500. The cr model has the lowest deviance values across all sample sizes, and the tp model consistently records the lowest GCV values regardless of sample size.
In the case of lognormal noise, the ps model yields the lowest AIC values for sample sizes of 50, 100, and 200, while the tp model has the lowest AIC values for sample sizes from 300 to 500. For deviance, the cr model has the lowest values for sample sizes of 50, 100, and 300, while the ps model shows the lowest deviance values for sample sizes of 200, 400, and 500. Regarding GCV, the gp model has the lowest value for a sample size of 50, and the tp model consistently shows the lowest GCV values for sample sizes from 100 to 500.
For correlated noise, the cr model demonstrates the lowest AIC, deviance, and GCV values for sample sizes of 50, 100, 400, and 500. However, for sample sizes of 200 and 300, the tp model exhibits the lowest AIC, deviance, and GCV values.
In the presence of heteroscedastic noise, the cr model records the lowest AIC value for a sample size of 50, while the ps model shows the lowest AIC values for sample sizes from 100 to 500. The cr model consistently has the lowest deviance values across all sample sizes. For GCV, the ps model records the lowest values for all sample sizes. The simulations in this study were performed using R. The corresponding codes are provided in 8 for reproducibility.
6.1.2. Performance of the Smoothing Models in the Presence of Outliers
Table 2 presents the average AIC, deviance, and GCV metrics showcasing the performance of the four different smoothing models under outliers. For sample size 50, the cr model records the lowest AIC value, and for sample sizes 100–500, the gp model records the lowest AIC values. For deviance, the cr records the lowest deviance values for all sample sizes, and for GCV, the tp model records the lowest GCV values for sample sizes 50 and 100 while the gp model records the lowest GCV values for sample sizes 200–500.
n | cr | ps | tp | gp | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AIC | Dev | GCV | AIC | Dev | GCV | AIC | Dev | GCV | AIC | Dev | GCV | |
50 | 221.7973 | 201.8335 | 4.911279 | 221.8378 | 202.9597 | 4.903755 | 222.1760 | 208.6935 | 4.893145 | 221.8933 | 204.3111 | 4.894905 |
100 | 441.4750 | 432.2088 | 4.787199 | 441.5084 | 432.5889 | 4.788514 | 441.5425 | 437.4634 | 4.783731 | 441.4503 | 433.3559 | 4.784178 |
200 | 877.2997 | 884.4995 | 4.674621 | 877.2997 | 884.7719 | 4.674582 | 877.3502 | 888.5398 | 4.674604 | 877.2779 | 885.6246 | 4.673631 |
300 | 1312.5408 | 1335.6427 | 4.630237 | 1312.5522 | 1336.4536 | 4.630282 | 1312.5607 | 1339.6206 | 4.629994 | 1312.5080 | 1336.5940 | 4.629525 |
400 | 1748.9729 | 1793.9210 | 4.623047 | 1748.9572 | 1794.7829 | 4.622803 | 1748.9688 | 1797.8367 | 4.622700 | 1748.9442 | 1795.2525 | 4.622575 |
500 | 2183.8988 | 2246.4415 | 4.604702 | 2183.8796 | 2246.9413 | 4.604503 | 2183.9102 | 2249.8641 | 4.604644 | 2183.8733 | 2247.1608 | 4.604411 |
- Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.
6.2. Empirical Study
Table 3 presents the performance of the models using the empirical data. Based on the results, the gp model has the lowest AIC, the tp model has the lowest deviance, and the ps model has the lowest GCV.
cr | ps | tp | gp | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
AIC | Dev | GCV | AIC | Dev | GCV | AIC | Dev | GCV | AIC | Dev | GCV |
4898.589 | 12,397.56 | 14.17771 | 4897.559 | 12,359.89 | 14.16162 | 4897.047 | 12,170.46 | 14.15661 | 4897.014 | 12,182.06 | 14.15584 |
- Abbreviations: AIC, Akaike information criterion; cr, cubic spline; Dev, deviance; GCV, generalized cross validation; gp, Gaussian process; ps, P-spline; tp, thin plate spline.
7. Discussion
In this study, we aimed to evaluate the performance of semiparametric models in the presence of Gaussian, exponential, lognormal, correlated, and heteroscedastic noise and outliers within the framework of GAMs through simulation. Generally, our analysis shows that the cr model shows strong performance in terms of deviance across all noise types and most sample sizes. The ps model often performs well with larger sample sizes, especially in terms of AIC and GCV under Gaussian and heteroscedastic noise. The gp model excels with the smallest sample size under Gaussian and lognormal noise in terms of GCV, and the tp model frequently performs best under exponential and lognormal noise for larger samples in terms of AIC and GCV. For data with outliers, the cr and tp models perform well with smaller sample sizes, while the gp model excels with larger sample sizes based on AIC and GCV; in terms of deviance, the cr model performs best across all sample sizes. Our findings reveal that the performance of smoothing models is highly dependent on the type of noise present in the data and the sample size and no single model consistently outperforms others across all noise types and sample sizes, which implies that model selection should be tailored to the specific objective of a research. Based on the results, the gp model has the lowest AIC, the tp model has the lowest deviance, and the ps model has the lowest GCV.
While this study provides valuable insights, it has several limitations. The study focused on a limited set of smoothing models. Other models not considered here may perform differently under similar conditions and also the study considered only two variables. Future research can address these limitations and expand on the current study by including a wider array of smoothing models to see if the findings hold or if other models perform better under certain conditions and applying the methodology to more varied datasets to assess the generalizability of the findings.
The methodology employed in this study, which assesses smoothing models across diverse noise types and sample sizes through the metrics of AIC, deviance, and GCV, holds potential for broader application. It can readily be adapted to analyze datasets resembling sine functions, a common occurrence in fields dealing with periodic or cyclical phenomena. Such datasets are prevalent in physics, where oscillatory systems frequently yield data resembling sine functions. Also, engineers encounter sine-like data in various applications, including sound waves, electrical signals, and mechanical vibrations.
8. Conclusion
In conclusion, this study sheds light on the performance of smoothing models under different noise types and sample sizes, employing metrics such as AIC, deviance, and GCV for evaluation. The findings indicate that the effectiveness of smoothing models varies greatly depending on the noise type and sample size. Since no single model consistently excels across all conditions, it is essential to choose models based on the specific goals of research. By showcasing the variability in model performance across various noise types and sample sizes, this research provides valuable insights for practitioners and researchers in fields where smoothing techniques are applied. The study highlights the importance of considering noise characteristics and sample size when selecting the most suitable smoothing model. However, it is important to acknowledge the study’s limitations. Future research should aim at addressing these limitations by exploring a broader range of models and datasets to enhance the generalizability and applicability of the findings.
Nomenclature
-
- GLM
-
- generalized linear model
-
- GAM
-
- generalized additive model
-
- OLS
-
- ordinary least squares
-
- GCV
-
- generalized cross validation
-
- REML
-
- restricted marginal likelihood
-
- RSS
-
- residual sum of squares
-
- AIC
-
- Akaike information criterion
-
- Dev
-
- deviance
-
- cr
-
- cubic regression spline
-
- ps
-
- P-spline
-
- tp
-
- thin plate spline
-
- gp
-
- Gaussian process
Conflicts of Interest
The authors declare no conflicts of interest.
Author Contributions
Daniel Edinam Wormenor was responsible for the conceptualization, methodology, numerical simulations, editing, and analysis of results. Sampson Twumasi-Ankrah was responsible for supervision, review, and editing. Accam Burnett Tetteh was responsible for supervision, review, numerical simulations, editing, and analysis of results.
Funding
The authors did not receive support from any organization for the submitted work.
Appendix
The simulations in this study were performed using R Version 4.3.0. The corresponding codes are provided as an electronic appendix for reproducibility (https://github.com/Daniel-Edinam/Codes.git).
Open Research
Data Availability Statement
This study utilized the Triceps dataset from the MultiKink package in R for the empirical analysis. The dataset is publicly available and can be accessed by loading the MultiKink package. Additionally, the R code used to generate the simulated data is provided in the appendix.