Volume 37, Issue 5 e12473

SPECIAL ISSUE PAPER

A method for outlier detection based on cluster analysis and visual expert criteria

Juan A. Lara,

Corresponding Author

Juan A. Lara

[email protected]

orcid.org/0000-0001-5131-8447

Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain

Correspondence

Juan A. Lara, Madrid Open University, UDIMA, Engineering School, Carretera A6 km 38,500 – Vía de Servicio, 15-28400, Collado Villalba, Madrid, Spain.

Email: [email protected]

Search for more papers by this author

David Lizcano,

David Lizcano

Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain

Search for more papers by this author

Víctor Rampérez,

Víctor Rampérez

ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain

Search for more papers by this author

Javier Soriano,

Javier Soriano

ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain

Search for more papers by this author

Juan A. Lara,

Corresponding Author

Juan A. Lara

[email protected]

orcid.org/0000-0001-5131-8447

Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain

Correspondence

Juan A. Lara, Madrid Open University, UDIMA, Engineering School, Carretera A6 km 38,500 – Vía de Servicio, 15-28400, Collado Villalba, Madrid, Spain.

Email: [email protected]

Search for more papers by this author

David Lizcano,

David Lizcano

Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain

Search for more papers by this author

Víctor Rampérez,

Víctor Rampérez

ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain

Search for more papers by this author

Javier Soriano,

Javier Soriano

ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain

Search for more papers by this author

First published: 03 November 2019

https://doi.org/10.1111/exsy.12473

Citations: 7

Share a link

Email
Wechat
Bluesky

Abstract

Outlier detection is an important problem occurring in a wide range of areas. Outliers are the outcome of fraudulent behaviour, mechanical faults, human error, or simply natural deviations. Many data mining applications perform outlier detection, often as a preliminary step in order to filter out outliers and build more representative models. In this paper, we propose an outlier detection method based on a clustering process. The aim behind the proposal outlined in this paper is to overcome the specificity of many existing outlier detection techniques that fail to take into account the inherent dispersion of domain objects. The outlier detection method is based on four criteria designed to represent how human beings (experts in each domain) visually identify outliers within a set of objects after analysing the clusters. This has an advantage over other clustering-based outlier detection techniques that are founded on a purely numerical analysis of clusters. Our proposal has been evaluated, with satisfactory results, on data (particularly time series) from two different domains: stabilometry, a branch of medicine studying balance-related functions in human beings and electroencephalography (EEG), a neurological exploration used to diagnose nervous system disorders. To validate the proposed method, we studied method outlier detection and efficiency in terms of runtime. The results of regression analyses confirm that our proposal is useful for detecting outlier data in different domains, with a false positive rate of less than 2% and a reliability greater than 99%.

REFERENCES

Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. Proceedings of the 2001 ACM SIGMOD international conference on Management of data.
Google Scholar
Agrawal, R., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in sequence databases. Evanston, Illinois, USA: FODO.
10.1007/3-540-57301-1_5
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the 1998 ACM SIGMOD international conference on Management of data.
Google Scholar
Agrawal, S., & Agrawal, J. (2015). Survey on anomaly detection using data mining techniques. Procedia Computer Science, 60, 708–713.
10.1016/j.procs.2015.08.220
Google Scholar
Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C., David, P., & Elger, C. E. (2001). Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical review. E, Statistical, nonlinear, and soft matter physics, 64. https://doi.org/10.1103/PhysRevE.64.061907
10.1103/PhysRevE.64.061907
PubMed Web of Science® Google Scholar
Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18, 145–160. https://doi.org/10.1109/TKDE.2006.29
10.1109/TKDE.2006.29
Web of Science® Google Scholar
Barigant, P., Merlet, P., Orfait, J., & Tetar, C. (1972). New design of E.L.A. Statokinesemeter. Agressol, 13(C), 69–74.
Google Scholar
Baron, J. B. (1964). Presentation d'un appareil pour mettre en evidence les desplacements du centre de gravité du corps dans le polygone de sustentation. Applications pratiques. Arch Malad Profes, 25(1-2), 41–49.
CAS PubMed Google Scholar
Barona, R. (2003). Interés clínico del sistema NedSVE/IBV en el diagnóstico y valoración de las alteraciones del equilibrio. Revista de Biomecánica del Instituto de Biomecánica de Valencia (IBV), Ed. February.
Google Scholar
Barry, R. J., Clarke, A. R., & Johnstone, S. J. (2003). A review of electrophysiology in attention-deficit/hyperactivity disorder: 1 qualitative and quantitative electroencephalography 2. Event-related potentials. Clinical Neurophysiology, 114, 171–183.
10.1016/S1388-2457(02)00362-0
PubMed Web of Science® Google Scholar
Black, F. O., & Nashner, L. M. (1984). Vestibulo-spinal control differs in patients with reduced versus distorted vestibular function. Acta Otolaryngol (Stockh), 406, 100–114.
Google Scholar
Black, F. O., & Nashner, L. M. (1985). Postural control in four classes of vestibular abnormalities. In M. Igarashi, & F. O. Black (Eds.), Vestibular and Visual Control on Posture and Locomotor Equilibrium (pp. 271–281). New York: Karger Publications.
Google Scholar
Black, P. E. (2006). Manhattan distance. PhD Thesis, University of Westminster, 2009. in Dictionary of Algorithms and Data Structures [online], Paul E. Black ed., U.S. National Institute of Standards and Technology, (accessed November 2018) Available from: http://www.nist.gov/dads/HTML/manhattanDistance.html.
Google Scholar
Boniver, R. (1994). Posture et posturographie. Rev Med Liege, 49(5), 285–290.
CAS PubMed Google Scholar
Bowman, C., & Mangham, C. (1989). Clinical use of moving platform posturography. Seminars in Hearing, 10(2), 161–171.
Google Scholar
Breunig, M., Kriegel, H.-P., Ng, R., & Sander, J. (2000). LOF: Identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 29(2), pp. 93-104.
Google Scholar
Chakraborty, D., Narayanan, V., & Ghosh, A. (2019). Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognition, 89, 161–171. https://doi.org/10.1016/j.patcog.2019.01.002
10.1016/j.patcog.2019.01.002
Web of Science® Google Scholar
Chan, K., & Fu, A. W. (1999). Efficient Time Series Matching by Wavelets (pp. 126–133). Sydney-AUS: ICDE.
Google Scholar
Domingues, R., Filippone, M., Michiardi, P., & Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognition, 74(2018), 406–421. https://doi.org/10.1016/j.patcog.2017.09.037
10.1016/j.patcog.2017.09.037
Web of Science® Google Scholar
Ernst, M., & Haesbroeck, G. (2017). Comparison of local outlier detection techniques in spatial multivariate data. Data Mining and Knowledge Discovery, 31, 371–399. https://doi.org/10.1007/s10618-016-0471-0
10.1007/s10618-016-0471-0
Web of Science® Google Scholar
Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD´96), pp. 226-231.
Google Scholar
Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery: An overview. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances In Knowledge Discovery And Data Mining (pp. 1–34). Menlo Park, CA: AAAI Press/The MIT Press.
Google Scholar
Hassan, A. R., & Subasi, A. (2017). A decision support system for automated identification of sleep stages from single-channel EEG signals. Knowledge-Based Systems, 128(2017), 115–124. https://doi.org/10.1016/j.knosys.2017.05.005
10.1016/j.knosys.2017.05.005
Web of Science® Google Scholar
Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22, 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9
10.1023/B:AIRE.0000045502.10941.a9
Web of Science® Google Scholar
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin De La Société Vaudoise Des Sciences Naturelles, 44, 223–270.
CAS PubMed Web of Science® Google Scholar
Jiang, M. F., Tseng, S. S., & Su, C. M. (2001). Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6-7), 691–700.
10.1016/S0167-8655(00)00131-8
Web of Science® Google Scholar
Jolliffe, I. T. (1986). Principal Component Analysis. New York: Springer.
10.1007/978-1-4757-1904-8
Google Scholar
Knorr, E., & Ng, R. (1998). Algorithms for mining distance-based outliers in large databases. Proceedings of the 24th International Conference on Very Large Data Bases, pp. 392-403.
Google Scholar
Knorr, E., & Ng, R. (1999). Finding intensional knowledge of distance-based outliers. Proceedings of the 25th International Conference on Very Large Data Bases, pp. 211-222.
Google Scholar
Kollios, G., Gunopulos, D., Koudas, N., & Berchtold, S. (2003). Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187.
10.1109/TKDE.2003.1232271
Web of Science® Google Scholar
Kovalerchuk, B., Vityaev, E., & Ruiz, J. F. (2000). Consistent knowledge discovery in medical diagnosis. IEEE Engineering in Medicine and Biology Magazine, 19(4), 26–37.
10.1109/51.853479
CAS PubMed Google Scholar
Lara, J. A. (2011). Marco de descubrimiento de conocimiento para datos estructuralmente complejos con énfasis en el análisis de eventos en series temporales. Technical University of Madrid, PhD Thesis.
Google Scholar
Lara, J. A., Lizcano, D., Pérez, A., & Valente, J. P. (2014). A general framework for time series data mining based on event analysis: Application to the medical domains of electroencephalography and stabilometry. Journal of Biomedical Informatics, 14, 185–199.
Google Scholar
Lázaro, M., Cuesta, F., León, A., Sánchez, C., Feijoo, R., & Montiel, M. (2005). Valor de la posturografía en ancianos con caídas de repetición. Medicina clínica, 124, 207–210.
10.1157/13071759
PubMed Web of Science® Google Scholar
Loureiro, A., Torgo, L., & Soares, C. (2004). Outlier detection using clustering methods: A data cleaning application. Proceedings of KDNet Symposium on Knowledge-based Systems for the Public Sector.
Google Scholar
Martín, E., & Barona, R. (2007). Vértigo paroxístico benigno infantil: Categorización y comparación con el vertigo posicional paroxístico benigno del adulto. Acta Otorrinolaringología Española, 58(7), 296–301.
10.1016/S0001-6519(07)74932-4
PubMed Google Scholar
Neurocom® International. (2004). Balance Master Operator's Manual v8.2. www.onbalance.com (accessed February de 2019).
Google Scholar
Nguyen, D., Pongchaiyakul, C., Center, J. R., Eisman, J. A., & Nguyen, T. V. (2005). Identification of high-risk individuals for hip fracture: A 14-tear prospective study. Journal of Bone and Mineral Research, 20(11), 1921–1928.
10.1359/JBMR.050520
PubMed Web of Science® Google Scholar
Povinelli, R. (1999). Time series data mining: Identifying temporal patterns for characterization and prediction of time series, PhD. Thesis. Milwaukee.
Google Scholar
Raiva, V., Wannasetta, W., & Gulsatitporn, S. (2005). Postural stability and dynamic balance in Thai community dwelling adults. Chulalongkorn Medical Journal, 49(3), 129–141.
Google Scholar
Rama, J., & Pérez, N. (2003). Artículos de revisión: Pruebas vestibulares y posturografía. Revista Médica de la Universidad de Navarra, 47(4), 21–28.
Google Scholar
Ramaswamy, S., Rastogi, R., & Shim, K.. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data.
Google Scholar
Ren, D., Rahal, I., & Perrizo, W. (2004). A vertical outlier detection algorithm with clusters as by-product. Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, pp. 22-29.
Google Scholar
Romberg, M. H. (1853). Manual of the Nervous Disease of Man (pp. 395–401). London: Syndenham Society.
Google Scholar
Ronda, J. M., Galvañ, B., Monerris, E., & Ballester, F. (2002). Asociación entre síntomas clínicos y resultados de la posturografía computarizada dinámica. Acta Otorrinolaringología Española, 53, 252–255.
10.1016/S0001-6519(02)78308-8
CAS PubMed Google Scholar
Salma, N., Mai, B., Namuduri, K., Mamun, R., Hashem, Y., Takabi, H., … Nielsen, R. (2018). Using EEG signal to analyze IS decision making cognitive processes. In F. Davis, R. Riedl, J. Brocke, P. M. Léger, & A. Randolph (Eds.), Information Systems and Neuroscience. Lecture Notes in Information Systems and Organisation (Vol. 25). Gmunden, Austria: Springer, Cham.
10.1007/978-3-319-67431-5_24
Google Scholar
Sanz, R. (2000). Test vestibular de autorrotación y posturografía dinámica. Verteré, 25, 5–15.
Google Scholar
Sarle, W. (1987). Cubic clustering criterion. SAS Technical Report A-108, SAS Institute Inc.
Google Scholar
Sinaki, M., Brey, R. H., Hughes, C. A., Larson, D. R., & Kaufman, K. R. (2005). Significant reduction in risk of falls and back pain in osteoporotic-kyphotic women through a spinal proprioceptive extension exercise dynamic (SPEED) program. Mayo Clinic, 80(7), 849–855.
10.4065/80.7.849
PubMed Web of Science® Google Scholar
Song, D., Chung, F., Wong, J., & Yogendran, S. (2002). The assessment of postural stability after ambulatory anesthesia: A comparison of desflurane with propofol. Anesthesia & Analgesia, 94, 60–64.
10.1213/00000539-200201000-00011
CAS PubMed Web of Science® Google Scholar
Sorensen, T. (1957). A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter, Kongelige Danske Videnskabernes Selskab, 5(4), 1–34.
Google Scholar
Stefatos, G., & Hamza, A. B. (2007). Cluster PCA for outliers detection in high-dimensional data. Proceedings of the 2007 IEEE International Conference on Systems, Man and Cybernetics, pp. 3961.3966.
Google Scholar
Stockwell, C. W. (1981). Posturography. Otolaryngol Head Neck Surgery, 89, 333–335.
10.1177/019459988108900237
CAS PubMed Web of Science® Google Scholar
Takada, H. (2019). Stabilometry to evaluate severity of motion sickness on displays. In H. Takada, M. Miyao, & S. Fateh (Eds.), Stereopsis and Hygiene. Current Topics in Environmental Health and Preventive Medicine. Singapore: Springer.
10.1007/978-981-13-1601-2_1
Google Scholar
Tan, P., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. New York: Addison Wesley.
Google Scholar
Torgo, L. (2007). Resource-bounded Fraud Detection. In progress in artificial intelligence, 13th Portuguese conference in artificial intelligence.
Google Scholar
Torgo, L., Pereira, W., & Soares, C. (2009). Detecting errors in foreign trade transactions: Dealing with insufficient data. In Progress in Artificial Intelligence, Proceedings of the 14th Portuguese Conference in Artificial Intelligence.
Google Scholar
Tzallas, A. T., Tsipouras, M. G., & Fotiadis, D. I. (2007). Automatic seizure detection based on time-frequency analysis and artificial neural networks. Computational Intelligence and Neuroscience, 7(3), 1–13.
10.1155/2007/80510
Google Scholar
Wang, J.-S., & Chiang, J.-C. (2008). A cluster validity measure with outlier detection for support vector clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38(1), 78–89.
10.1109/TSMCB.2007.908862
CAS PubMed Google Scholar
Wolpaw, J. R., Birbaumer, N., Heetderks, W. J., McFarland, D. J., Peckham, P. H., Schalk, G., … Vaughan, T. M. (2000). Brain-computer interface technology: A review of the first international meeting. IEEE Transactions on Rehabilitation Engineering, 8(2), 164–173.
10.1109/TRE.2000.847807
CAS PubMed Web of Science® Google Scholar
Yamamoto, M., Ishikawa, K., Aoki, M., Mizuta, K., Ito, Y., Asai, M., … Yoshida, T. (2018). Japanese standard for clinical stabilometry assessment: Current status and future directions. Auris Nasus Larynx, 45(2), 201–206. https://doi.org/10.1016/j.anl.2017.06.006
10.1016/j.anl.2017.06.006
PubMed Web of Science® Google Scholar
Yang, P., & Huang, B. (2008). A spectral clustering algorithm for outlier detection. International Seminar on Future Information Technology and Management Engineering, pp. 33-36.
Google Scholar
Yoon, K.-A., Kwon, O.-S., & Bae, D.-H. (2007). An approach to outlier detection of software measurement data using the k-means clustering method. Proceedings of the 1st International Symposium on Empirical Software Engineering and Measurement, pp. 443-445.
Google Scholar

Citing Literature

Volume37, Issue5

Special Issue:Advances in visual analytics and mining visual data

October 2020

e12473

A method for outlier detection based on cluster analysis and visual expert criteria

Abstract

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A method for outlier detection based on cluster analysis and visual expert criteria

Abstract

REFERENCES

Citing Literature

References

Related

Information