Simple outlier labeling based on quantile regression, with application to the steelmaking process
Corresponding Author
Ruggero Bellio
Department of Economics and Statistics, University of Udine, Italy
Correspondence to: Ruggero Bellio, Department of Economics and Statistics, University of Udine, Via Tomadini 30/A, I-33100 Udine, Italy.
E-mail: [email protected]
Search for more papers by this authorMauro Coletto
IMT Institute for Advanced Studies, Lucca, Italy
CNR - ISTI, Pisa, Italy
Search for more papers by this authorCorresponding Author
Ruggero Bellio
Department of Economics and Statistics, University of Udine, Italy
Correspondence to: Ruggero Bellio, Department of Economics and Statistics, University of Udine, Via Tomadini 30/A, I-33100 Udine, Italy.
E-mail: [email protected]
Search for more papers by this authorMauro Coletto
IMT Institute for Advanced Studies, Lucca, Italy
CNR - ISTI, Pisa, Italy
Search for more papers by this authorAbstract
This paper introduces some methods for outlier identification in the regression setting, motivated by the analysis of steelmaking process data. The proposed methodology extends to the regression setting the boxplot rule, commonly used for outlier screening with univariate data. The focus here is on bivariate settings with a single covariate, but extensions are possible. The proposal is based on quantile regression, including an additional transformation parameter for selecting the best scale for linearity of the conditional quantiles. The resulting method is used to perform effective labeling of potential outliers, with a quite low computational complexity, allowing for simple implementation within statistical software as well as commonly used spreadsheets. Some simulation experiments have been carried out to study the swamping and masking properties of the proposal. The methodology is also illustrated by some real life examples, taking as the response variable the energy consumed in the melting process. Copyright © 2015 John Wiley & Sons, Ltd.
References
- 1Barnett V, Lewis T. Outliers in Statistical Data ( 3rd edn). Wiley: New York, 1994.
- 2Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection. Wiley: New York, 1987.
10.1002/0471725382 Google Scholar
- 3Cook RD, Weisberg S. Residuals and Influence in Regression. Chapman and Hall: New York, 1982.
- 4Fox J, Weisberg S. An R Companion to Applied Regression ( 2nd edn). Sage: Thousand Oaks, CA, 2011.
- 5Atkinson AC, Riani M. Robust Diagnostic Regression Analysis. Springer-Verlag: New York, 2000.
10.1007/978-1-4612-1160-0 Google Scholar
- 6Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach Based on Influence Functions. Wiley: New York, 1986.
- 7Huber PJ, Ronchetti E. Robust Statistics ( 2nd edn). Wiley: New York, 2009.
10.1002/9780470434697 Google Scholar
- 8Maronna RA, Martin RD, Yohai VJ. Robust Statistics: Theory and Methods. Wiley: New York, 2006.
- 9Aggarwal CC. Outlier Analysis. Springer-Verlag: New York, 2013.
10.1007/978-1-4614-6396-2 Google Scholar
- 10Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques ( 3rd edn). Morgan Kaufmann: Waltham MA, 2011.
- 11Hodge VJ, Austin J. A survey of outlier detection methodologies. Artificial Intelligence Review 2004; 22: 85–126. DOI: 10.1023/B:AIRE.0000045502.10941.a9
- 12Turkdogan ET. Fundamentals of Steelmaking. The Institute of Materials: London, 1996.
- 13Kano M, Nakagawa Y. Data-based process monitoring, process control, and quality improvement: recent developments and applications in steel industry. Computers & Chemical Engineering 2008; 32: 12–24. DOI: 10.1016/j.compchemeng.2007.07.005
- 14Belsley DA, Kuh E, Welsch RE. Regression Diagnostics. Wiley: New York, 1980.
10.1002/0471725153 Google Scholar
- 15Koenker R, Bassett G. Regression quantiles. Econometrica 1978; 46: 33–50.
- 16Koenker R. Quantile Regression. Cambridge University Press: New York, 2005.
- 17Tukey JW. Exploratory Data Analysis. Addison-Wesley, Reading: Reading, MA, 1977.
- 18Brant R. Comparing classical and resistant outlier rules. Journal of the American Statistical Association 1990; 85: 1083–1090. DOI: 10.1080/01621459.1990.10474979
- 19Atkinson AC, Riani M, Cerioli A. The forward search: theory and data analysis (with discussion). Journal of the Korean Statistical Society 2010; 39: 117–134. DOI: 10.1016/j.jkss.2010.02.007
- 20Hoaglin DC, Iglewicz B, Tukey JW. Performance of some resistant rules for outlier labeling. Journal of the American Statistical Association 1986; 81: 991–999. DOI: 10.1080/01621459.1986.10478363
- 21Rosner B. Percentage points for a generalized ESD many-outlier procedure. Technometrics 1983; 25: 165–172. DOI: 10.1080/00401706.1983.10487848
- 22Carling K. Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis 2000; 33: 249–258. DOI: 10.1016/S0167-9473(99)00057-2
- 23Kimber AC. Exploratory data analysis for possibly censored data from skewed distributions. Applied Statistics 1990; 39: 21–30. DOI: 10.2307/2347808
- 24Schwertman NC, Owens MA, Adnan R. A simple more general boxplot method for identifying outliers. Computational Statistics & Data Analysis 2004; 47: 165–174. DOI: 10.1016/j.csda.2003.10.012
- 25Schwertman NC, de Silva R. Identifying outliers with sequential fences. Computational Statistics & Data Analysis 2007; 51: 3800–3810. DOI: 10.1016/j.csda.2006.01.019
- 26Carter NJ, Schwertman NC, Kiser TL. A comparison of two boxplot methods for detecting univariate outliers which adjust for sample size and asymmetry. Statistical Methodology 2009; 6: 604–621. DOI: 10.1016/j.stamet.2009.07.001
10.1016/j.stamet.2009.07.001 Google Scholar
- 27Rousseeuw PJ, Ruts I, Tukey JW. The bagplot: a bivariate boxplot. The American Statistician 1999; 53: 382–387. DOI: 10.1080/00031305.1999.10474494
- 28Cho H, Kim YJ, Jung HJ, Lee SW, Lee JW. OutlierD: an R package for outlier detection using quantile regression on mass spectrometry data. Bioinformatics 2008; 24: 882–884. DOI: 10.1093/bioinformatics/btn012
- 29Eo SH, Hong SM, Cho H. Identification of outlying observations with quantile regression for censored data. arXiv preprint:1404.7710, 2014.
- 30 R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2015.
- 31Koenker R. Quantreg: Quantile Regression, 2015. R package version 5.11.
- 32Bondell HD, Reich BJ, Wang H. Noncrossing quantile regression curve estimation. Biometrika 2010; 97: 825–838. DOI: 10.1093/biomet/asq048
- 33Box GE, Cox DR. An analysis of transformations. Journal of the Royal Statistical Society, Series B 1964; 26: 211–252.
- 34Powell JL. Estimation of monotonic regression models under quantile restrictions. In Nonparametric and Semiparametric Methods in Econometrics, W Barnett, J Powell, G Tauchen (eds). Cambridge University Press: New York, 1991; 357–384.
- 35Fitzenberger B, Wilke RA, Zhang X. Implementing Box–Cox quantile regression. Econometric Reviews 2009; 29: 158–181. DOI: 10.1080/07474930903382166
- 36Chamberlain G. Quantile regression, censoring, and the structure of wages. In Advances in Econometrics: Sixth World Congress, CA Sims (ed.)., vol. 1. Cambridge University Press: Cambridge, 1994; 171–209, DOI: 10.1017/CCOL0521444594.005
10.1017/CCOL0521444594.005 Google Scholar
- 37Yeo IK, Johnson RA. A new family of power transformations to improve normality or symmetry. Biometrika 2000; 87: 954–959. DOI: 10.1093/biomet/87.4.954
- 38Yee TW. Quantile regression via vector generalized additive models. Statistics in Medicine 2004; 23: 2295–2315. DOI: 10.1002/sim.1822
- 39Yang Z. A modified family of power transformations. Economics Letters 2006; 92: 14–19. DOI: 10.1016/j.econlet.2006.01.011
- 40Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 1995; 57: 289–300.
- 41Cerioli A, Farcomeni A. Error rates for multivariate outlier detection. Computational Statistics & Data Analysis 2011; 55: 544–553. DOI: 10.1016/j.csda.2010.05.021
- 42Venables WN, Ripley BD. Modern Applied Statistics with S ( 4th edn). Springer-Verlag: New York, 2002.
10.1007/978-0-387-21706-2 Google Scholar
- 43Riani M, Cerioli A, Atkinson AC, Perrotta D. Monitoring robust regression. Electronic Journal of Statistics 2014; 8: 646–677. DOI: 10.1214/14-EJS897