A method for outlier detection based on cluster analysis and visual expert criteria
Corresponding Author
Juan A. Lara
Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain
Correspondence
Juan A. Lara, Madrid Open University, UDIMA, Engineering School, Carretera A6 km 38,500 – Vía de Servicio, 15-28400, Collado Villalba, Madrid, Spain.
Email: [email protected]
Search for more papers by this authorDavid Lizcano
Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain
Search for more papers by this authorVíctor Rampérez
ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain
Search for more papers by this authorJavier Soriano
ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain
Search for more papers by this authorCorresponding Author
Juan A. Lara
Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain
Correspondence
Juan A. Lara, Madrid Open University, UDIMA, Engineering School, Carretera A6 km 38,500 – Vía de Servicio, 15-28400, Collado Villalba, Madrid, Spain.
Email: [email protected]
Search for more papers by this authorDavid Lizcano
Department of Computer Science, Madrid Open University, UDIMA, Engineering School, Madrid, Spain
Search for more papers by this authorVíctor Rampérez
ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain
Search for more papers by this authorJavier Soriano
ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, Madrid, Spain
Search for more papers by this authorAbstract
Outlier detection is an important problem occurring in a wide range of areas. Outliers are the outcome of fraudulent behaviour, mechanical faults, human error, or simply natural deviations. Many data mining applications perform outlier detection, often as a preliminary step in order to filter out outliers and build more representative models. In this paper, we propose an outlier detection method based on a clustering process. The aim behind the proposal outlined in this paper is to overcome the specificity of many existing outlier detection techniques that fail to take into account the inherent dispersion of domain objects. The outlier detection method is based on four criteria designed to represent how human beings (experts in each domain) visually identify outliers within a set of objects after analysing the clusters. This has an advantage over other clustering-based outlier detection techniques that are founded on a purely numerical analysis of clusters. Our proposal has been evaluated, with satisfactory results, on data (particularly time series) from two different domains: stabilometry, a branch of medicine studying balance-related functions in human beings and electroencephalography (EEG), a neurological exploration used to diagnose nervous system disorders. To validate the proposed method, we studied method outlier detection and efficiency in terms of runtime. The results of regression analyses confirm that our proposal is useful for detecting outlier data in different domains, with a false positive rate of less than 2% and a reliability greater than 99%.
REFERENCES
- Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. Proceedings of the 2001 ACM SIGMOD international conference on Management of data.
- Agrawal, R., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in sequence databases. Evanston, Illinois, USA: FODO.
10.1007/3-540-57301-1_5 Google Scholar
- Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the 1998 ACM SIGMOD international conference on Management of data.
- Agrawal, S., & Agrawal, J. (2015). Survey on anomaly detection using data mining techniques. Procedia Computer Science, 60, 708–713.
10.1016/j.procs.2015.08.220 Google Scholar
- Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C., David, P., & Elger, C. E. (2001). Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical review. E, Statistical, nonlinear, and soft matter physics, 64. https://doi.org/10.1103/PhysRevE.64.061907
- Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18, 145–160. https://doi.org/10.1109/TKDE.2006.29
- Barigant, P., Merlet, P., Orfait, J., & Tetar, C. (1972). New design of E.L.A. Statokinesemeter. Agressol, 13(C), 69–74.
- Baron, J. B. (1964). Presentation d'un appareil pour mettre en evidence les desplacements du centre de gravité du corps dans le polygone de sustentation. Applications pratiques. Arch Malad Profes, 25(1-2), 41–49.
- Barona, R. (2003). Interés clínico del sistema NedSVE/IBV en el diagnóstico y valoración de las alteraciones del equilibrio. Revista de Biomecánica del Instituto de Biomecánica de Valencia (IBV), Ed. February.
- Barry, R. J., Clarke, A. R., & Johnstone, S. J. (2003). A review of electrophysiology in attention-deficit/hyperactivity disorder: 1 qualitative and quantitative electroencephalography 2. Event-related potentials. Clinical Neurophysiology, 114, 171–183.
- Black, F. O., & Nashner, L. M. (1984). Vestibulo-spinal control differs in patients with reduced versus distorted vestibular function. Acta Otolaryngol (Stockh), 406, 100–114.
- Black, F. O., & Nashner, L. M. (1985). Postural control in four classes of vestibular abnormalities. In M. Igarashi, & F. O. Black (Eds.), Vestibular and Visual Control on Posture and Locomotor Equilibrium (pp. 271–281). New York: Karger Publications.
- Black, P. E. (2006). Manhattan distance. PhD Thesis, University of Westminster, 2009. in Dictionary of Algorithms and Data Structures [online], Paul E. Black ed., U.S. National Institute of Standards and Technology, (accessed November 2018) Available from: http://www.nist.gov/dads/HTML/manhattanDistance.html.
- Boniver, R. (1994). Posture et posturographie. Rev Med Liege, 49(5), 285–290.
- Bowman, C., & Mangham, C. (1989). Clinical use of moving platform posturography. Seminars in Hearing, 10(2), 161–171.
- Breunig, M., Kriegel, H.-P., Ng, R., & Sander, J. (2000). LOF: Identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 29(2), pp. 93-104.
- Chakraborty, D., Narayanan, V., & Ghosh, A. (2019). Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognition, 89, 161–171. https://doi.org/10.1016/j.patcog.2019.01.002
- Chan, K., & Fu, A. W. (1999). Efficient Time Series Matching by Wavelets (pp. 126–133). Sydney-AUS: ICDE.
- Domingues, R., Filippone, M., Michiardi, P., & Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognition, 74(2018), 406–421. https://doi.org/10.1016/j.patcog.2017.09.037
- Ernst, M., & Haesbroeck, G. (2017). Comparison of local outlier detection techniques in spatial multivariate data. Data Mining and Knowledge Discovery, 31, 371–399. https://doi.org/10.1007/s10618-016-0471-0
- Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD´96), pp. 226-231.
- Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery: An overview. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances In Knowledge Discovery And Data Mining (pp. 1–34). Menlo Park, CA: AAAI Press/The MIT Press.
- Hassan, A. R., & Subasi, A. (2017). A decision support system for automated identification of sleep stages from single-channel EEG signals. Knowledge-Based Systems, 128(2017), 115–124. https://doi.org/10.1016/j.knosys.2017.05.005
- Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22, 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9
- Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin De La Société Vaudoise Des Sciences Naturelles, 44, 223–270.
- Jiang, M. F., Tseng, S. S., & Su, C. M. (2001). Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6-7), 691–700.
- Jolliffe, I. T. (1986). Principal Component Analysis. New York: Springer.
10.1007/978-1-4757-1904-8 Google Scholar
- Knorr, E., & Ng, R. (1998). Algorithms for mining distance-based outliers in large databases. Proceedings of the 24th International Conference on Very Large Data Bases, pp. 392-403.
- Knorr, E., & Ng, R. (1999). Finding intensional knowledge of distance-based outliers. Proceedings of the 25th International Conference on Very Large Data Bases, pp. 211-222.
- Kollios, G., Gunopulos, D., Koudas, N., & Berchtold, S. (2003). Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187.
- Kovalerchuk, B., Vityaev, E., & Ruiz, J. F. (2000). Consistent knowledge discovery in medical diagnosis. IEEE Engineering in Medicine and Biology Magazine, 19(4), 26–37.
- Lara, J. A. (2011). Marco de descubrimiento de conocimiento para datos estructuralmente complejos con énfasis en el análisis de eventos en series temporales. Technical University of Madrid, PhD Thesis.
- Lara, J. A., Lizcano, D., Pérez, A., & Valente, J. P. (2014). A general framework for time series data mining based on event analysis: Application to the medical domains of electroencephalography and stabilometry. Journal of Biomedical Informatics, 14, 185–199.
- Lázaro, M., Cuesta, F., León, A., Sánchez, C., Feijoo, R., & Montiel, M. (2005). Valor de la posturografía en ancianos con caídas de repetición. Medicina clínica, 124, 207–210.
- Loureiro, A., Torgo, L., & Soares, C. (2004). Outlier detection using clustering methods: A data cleaning application. Proceedings of KDNet Symposium on Knowledge-based Systems for the Public Sector.
- Martín, E., & Barona, R. (2007). Vértigo paroxístico benigno infantil: Categorización y comparación con el vertigo posicional paroxístico benigno del adulto. Acta Otorrinolaringología Española, 58(7), 296–301.
- Neurocom® International. (2004). Balance Master Operator's Manual v8.2. www.onbalance.com (accessed February de 2019).
- Nguyen, D., Pongchaiyakul, C., Center, J. R., Eisman, J. A., & Nguyen, T. V. (2005). Identification of high-risk individuals for hip fracture: A 14-tear prospective study. Journal of Bone and Mineral Research, 20(11), 1921–1928.
- Povinelli, R. (1999). Time series data mining: Identifying temporal patterns for characterization and prediction of time series, PhD. Thesis. Milwaukee.
- Raiva, V., Wannasetta, W., & Gulsatitporn, S. (2005). Postural stability and dynamic balance in Thai community dwelling adults. Chulalongkorn Medical Journal, 49(3), 129–141.
- Rama, J., & Pérez, N. (2003). Artículos de revisión: Pruebas vestibulares y posturografía. Revista Médica de la Universidad de Navarra, 47(4), 21–28.
- Ramaswamy, S., Rastogi, R., & Shim, K.. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data.
- Ren, D., Rahal, I., & Perrizo, W. (2004). A vertical outlier detection algorithm with clusters as by-product. Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, pp. 22-29.
- Romberg, M. H. (1853). Manual of the Nervous Disease of Man (pp. 395–401). London: Syndenham Society.
- Ronda, J. M., Galvañ, B., Monerris, E., & Ballester, F. (2002). Asociación entre síntomas clínicos y resultados de la posturografía computarizada dinámica. Acta Otorrinolaringología Española, 53, 252–255.
- Salma, N., Mai, B., Namuduri, K., Mamun, R., Hashem, Y., Takabi, H., … Nielsen, R. (2018). Using EEG signal to analyze IS decision making cognitive processes. In F. Davis, R. Riedl, J. Brocke, P. M. Léger, & A. Randolph (Eds.), Information Systems and Neuroscience. Lecture Notes in Information Systems and Organisation (Vol. 25). Gmunden, Austria: Springer, Cham.
10.1007/978-3-319-67431-5_24 Google Scholar
- Sanz, R. (2000). Test vestibular de autorrotación y posturografía dinámica. Verteré, 25, 5–15.
- Sarle, W. (1987). Cubic clustering criterion. SAS Technical Report A-108, SAS Institute Inc.
- Sinaki, M., Brey, R. H., Hughes, C. A., Larson, D. R., & Kaufman, K. R. (2005). Significant reduction in risk of falls and back pain in osteoporotic-kyphotic women through a spinal proprioceptive extension exercise dynamic (SPEED) program. Mayo Clinic, 80(7), 849–855.
- Song, D., Chung, F., Wong, J., & Yogendran, S. (2002). The assessment of postural stability after ambulatory anesthesia: A comparison of desflurane with propofol. Anesthesia & Analgesia, 94, 60–64.
- Sorensen, T. (1957). A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter, Kongelige Danske Videnskabernes Selskab, 5(4), 1–34.
- Stefatos, G., & Hamza, A. B. (2007). Cluster PCA for outliers detection in high-dimensional data. Proceedings of the 2007 IEEE International Conference on Systems, Man and Cybernetics, pp. 3961.3966.
- Stockwell, C. W. (1981). Posturography. Otolaryngol Head Neck Surgery, 89, 333–335.
- Takada, H. (2019). Stabilometry to evaluate severity of motion sickness on displays. In H. Takada, M. Miyao, & S. Fateh (Eds.), Stereopsis and Hygiene. Current Topics in Environmental Health and Preventive Medicine. Singapore: Springer.
10.1007/978-981-13-1601-2_1 Google Scholar
- Tan, P., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. New York: Addison Wesley.
- Torgo, L. (2007). Resource-bounded Fraud Detection. In progress in artificial intelligence, 13th Portuguese conference in artificial intelligence.
- Torgo, L., Pereira, W., & Soares, C. (2009). Detecting errors in foreign trade transactions: Dealing with insufficient data. In Progress in Artificial Intelligence, Proceedings of the 14th Portuguese Conference in Artificial Intelligence.
- Tzallas, A. T., Tsipouras, M. G., & Fotiadis, D. I. (2007). Automatic seizure detection based on time-frequency analysis and artificial neural networks. Computational Intelligence and Neuroscience, 7(3), 1–13.
10.1155/2007/80510 Google Scholar
- Wang, J.-S., & Chiang, J.-C. (2008). A cluster validity measure with outlier detection for support vector clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38(1), 78–89.
- Wolpaw, J. R., Birbaumer, N., Heetderks, W. J., McFarland, D. J., Peckham, P. H., Schalk, G., … Vaughan, T. M. (2000). Brain-computer interface technology: A review of the first international meeting. IEEE Transactions on Rehabilitation Engineering, 8(2), 164–173.
- Yamamoto, M., Ishikawa, K., Aoki, M., Mizuta, K., Ito, Y., Asai, M., … Yoshida, T. (2018). Japanese standard for clinical stabilometry assessment: Current status and future directions. Auris Nasus Larynx, 45(2), 201–206. https://doi.org/10.1016/j.anl.2017.06.006
- Yang, P., & Huang, B. (2008). A spectral clustering algorithm for outlier detection. International Seminar on Future Information Technology and Management Engineering, pp. 33-36.
- Yoon, K.-A., Kwon, O.-S., & Bae, D.-H. (2007). An approach to outlier detection of software measurement data using the k-means clustering method. Proceedings of the 1st International Symposium on Empirical Software Engineering and Measurement, pp. 443-445.