Abstract
Knowledge discovery in databases (KDD), commonly known as data mining, (In the KDD community, it is common to consider data mining as a step of KDD. This article, is concerned with the identification of knowledge from data. The discovered knowledge is represented as patterns defined in a broad sense, and is required to be novel, potentially useful, and ultimately understandable. Many pattern types have been studied, including classifiers, association rules, and clustering. Many new ones will be introduced in the future, to capture the diverse range of human knowledge and to meet the diverse needs of different applications.
It is a complex process to identify novel knowledge patterns from data, requiring different techniques and multiple steps, including data cleaning and preparation, the efficient search of patterns, and the evaluation of the usefulness and novelty of patterns. This process may be iterated many times, where later iterations utilize the insights from earlier iterations. Proper use of background knowledge can greatly improve the quality of the resulting knowledge patterns. A good data mining system architecture is needed to allow coupling with databases and data warehouses, iterative data selection, and interaction between different data mining algorithms and between patterns of different types and so on.
The subsequent sections of this article will give an overview of some of the major concepts and techniques of KDD, including a brief introduction and some pointers to representative literature. There are six sections in addition to the introduction, namely, data types and preprocessing, pattern and pattern search space, popular pattern types and search algorithms, understandability and interestingness, concluding remarks, and references.
Bibliography
- R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proceedings of the International Conference on Very Large Data Bases, 1994.
- R. Agrawal and R. Srikant, Mining Sequential Patterns, Proceedings of the International Conference on Data Engineering, 1995.
- R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, 1993.
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan., Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
- Y. Aumann and Y. Lindell, A Statistical Theory for Quantitative Association Rules, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
- R. J. Bayardo, Jr., Efficiently Mining Long Patterns from Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
- R. Bayardo, Jr., R. Agrawal, and D. Gunopulos, Constraint-based Rule Mining in Large, Dense Databases, Proceedings of the International Conference on Data Engineering, 1999.
- K. Beyer and R. Ramakrishnan, Bottom-up Computation of Sparse Iceberg Cubes, Proceedings of ACM SIGMOD International Conference on Management of Data, 1999.
- I. Bratko, S. Muggleton, and A. Karalic, Applications of Inductive Logic Programming, in R. S. Michalski, I. Bratko, and M. Kubat, eds., Machine Learning and Data Mining Methods and Applications, Wiley, New York, 1997.
- L. Breiman, Bagging Predictors, Machine Learning 24, 123–140 (1996).
- L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Wadsworth International Group, Belmont, CA, 1984.
- S. Brin, R. Motwani, J. Ullman, and S Tsur, Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proceedings of ACM SIGMOD International Conference on Management of Data, 1997.
- Y. Cai, N. Cercone, and J. Han, Attribute-oriented Induction in Relational Databases, in G. Piatetsky-Shapiro and W. J. Frawley, eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991.
- P. Cheeseman and J. Stutz, Bayesian Classification (AutoClass): Theory and Results, in V. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, MA, 1996.
- P. Domingos, The Role of Occam's Razor in Knowledge Discovery, Data Mining and Knowledge Discovery 3, 409–425 (1999).
- G. Dong and J. Li, Interestingness of Discovered Association Rules in Terms of Neighborhood-based Unexpectedness, Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining, 1998.
- G. Dong and J. Li, Efficient Mining of Emerging Patterns: Discovering Trends and Differences, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
- G. Dong, X. Zhang, L. Wong, and J. Li, CAEP: Classification by Aggregating Emerging Patterns, Proceedings of the International Conference on Discovery Science, Tokyo, 1999.
- R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973.
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996.
- M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, Computing Iceberg Queries Efficiently,. Proceedings of the International Conference on Very Large Data Bases, 1998.
- U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, MA, 1996.
- D. Fisher, Improving Inference through Conceptual Clustering, Proceedings of the AAAI (American Association for Artificial Intelligence) Conference, 1987.
- Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting, Journal of Computer and System Sciences 55, 119–139 (1997).
- Y. Fu and J. Han, Meta-rule-guided Mining of Association Rules in Relational Databases, Proceedings of the International Workshop on Integration of Knowledge Discovery with Deductive and Object-Oriented Databases, Singapore, 1995.
- T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data Mining Using Two-dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization, Proceedings of ACM SIGMOD International Conference on Management of Data, 1996.
- V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh, A Framework for Measuring Changes in Data Characteristics, Proceedings of ACM Symposium on Principles of Database Systems, 1999.
- H. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh, BOAT-Optimistic Decision Tree Construction, Proceedings of ACM SIGMOD International Conference on Management of Data, 1999.
- D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, MA, 1989.
- J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data cube: A Relational Operator Generalizing Group-by, Cross-tab and Sub-totals, Proceedings of the International Conference on Data Engineering, 1996.
- S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
- J. Han and Y. Fu, Discovery of Multiple-level Association Rules from Large Databases, Proceedings of the International Conference on Very Large Data Bases, 1995.
- J. Han and Y. Fu, Exploration of the Power of Attribute-oriented Induction in Data Mining, in V. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Cambridge, MA, 1996.
- J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Meteo, CA, 2000.
- J. Han, G. Dong, and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, Proceedings of the International Conference on Data Engineering, 1999.
- J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.
- S. S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River, NJ, 1999.
- J. Hertz, A. Krogh, and R.G. Palmer, Introduction to the Theory of Neural Networks, Addison-Wesley, Reading, MA, 1991.
- A. Hinneburg and D. A. Keim, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1998.
- W. H. Inmon, Building the Data Warehouse. Wiley, New York, 1996.
- M. James, Classification Algorithms, Wiley, New York, 1985.
-
L. Kaufman and
P. J. Rousseeuw,
Finding Groups in Data: An Introduction to Cluster Analysis,
Wiley,
New York,
1990.
10.1002/9780470316801 Google Scholar
- T. Kohonen, Self-organized Formation of Topologically Correct Feature Maps, Biological Cybernetics 43, 59–69 (1982).
- B. Lent, A. Swami, and J. Widom, Clustering Association Rules, Proceedings of the International Conference on Data Engineering, 1997.
- J. Li, G. Dong, and K. Ramamohanarao, Instance-based Classification by Emerging Patterns, European Conference of Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000a.
- J. Li, G. Dong, and K. Ramamohanarao, Making Use of the Most Expressive Jumping Emerging Patterns for Classification, Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, 2000b.
- H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, Boston, MA, 1998.
- B. Liu, W. Hsu, and Y. Ma, Integrating Classification and Association Rule Mining, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1998.
- B. Liu, W. Hsu, and Y. Ma, Pruning and Summarizing the Discovered Associations, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
- J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the Berkeley Symposium on Mathematical Statistics, 1967.
- H. Mannila, H. Toivonen, and A. I. Verkamo, Efficient Algorithms for Discovering Association Rules, Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, 1994.
- H. Mannila, H. Toivonen, and A. I. Verkamo, Discovering Frequent Episodes in Sequences, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.
- M. Mehta, R. Agrawal, and J. Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining, Proceedings of the International Conference on Extending Database Technology, 1996.
- D. Meretakis and B. Wuthrich, Extending Naive Bayes Classifiers using Long Itemsets, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
- R. S. Michalski and R. Stepp, Automated Construction of Classifications: Conceptual Clustering versus Numerical Taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 396–410 (1983).
- T. M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
- R. Ng and J. Han, Efficient and Effective Clustering Method for Spatial Data Mining, Proceedings of the International Conference on Very Large Data Bases, 1994.
- R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang, Exploratory Mining and Pruning Optimizations of Constrained Associations Rules, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
- J. S. Park, M. S. Chen, and P.S. Yu, An Effective Hash-based Algorithm for Mining Association Rules, Proceedings of ACM SIGMOD International Conference on Management of Data, 1995.
-
Z. Pawlak,
Rough Sets: Theoretical Aspects of Reasoning about Data.
Kluwer Academic Publishers,
Boston, MA,
1991.
10.1007/978-94-011-3534-4 Google Scholar
- G. Piatetsky-Shapiro, Discovery, Analysis, and Presentation of Strong Rules, In G. Piatetsky-Shapiro and W. J. Frawley, eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991.
- G. Piatetsky-Shapiro and C. J. Matheus, The Interestingness of Deviations, Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, 1994.
- J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993.
-
D. E. Rumelhart,
G. E. Hinton, and
R. J. Williams,
Learning Internal Representations by Error Propagation, In
D. E. Rumelhart and
J. L. McClelland, eds.,
Parallel Distributed Processing,
MIT Press,
Cambridge, MA,
1986.
10.7551/mitpress/5236.001.0001 Google Scholar
- A. Savasere, E. Omiecinski, and S. Navathe, An Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the International Conference on Very Large Data Bases, 1995.
- A. Silberschatz, and A. Tuzhilin, What Makes Patterns Interesting in Knowledge Discovery Systems, IEEE Transactions on Knowledge and Data Engineering 8, 970–974 (1996).
- R. Srikant, and R. Agrawal, Mining Generalized Association Rules, Proceedings of the International Conference on Very Large Data Bases, 1995.
- R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Relational Tables, Proceedings of ACM SIGMOD International Conference on Management of Data, 1996.
- R. Srikant, Q. Vu, and R. Agrawal, Mining Association Rules with Item Constraints, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.
- H. Toivonen, Sampling Large Databases for Association Rules, Proceedings of the International Conference on Very Large Data Bases, 1996.
- W. Wang, J. Yang, and R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, Proceedings of the International Conference on Very Large Data Bases, 1997.
- K. Wang, Y. He, and J. Han, Mining Frequent Itemsets using Support Constraints, Proceedings of the International Conference on Very Large Data Bases, 2000.
- L. A. Zadeh, Fuzzy Sets, Information and Control 8, 338–353 (1965).
- T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996.
- X. Zhang, G. Dong, and K. Ramamohanarao, Exploring Constraints to Efficiently Mine Emerging Patterns from Large High-dimensional Datasets, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.