Knowledge Discovery in Databases

Guozhu Dong,

Guozhu Dong

Wright State University

Search for more papers by this author

Guozhu Dong,

Guozhu Dong

Wright State University

Search for more papers by this author

First published: 15 January 2002

https://doi.org/10.1002/0471028959.sof076

Citations: 1

Read the full text

About

Tools

Share a link

Email
Wechat
Bluesky

Abstract

Knowledge discovery in databases (KDD), commonly known as data mining, (In the KDD community, it is common to consider data mining as a step of KDD. This article, is concerned with the identification of knowledge from data. The discovered knowledge is represented as patterns defined in a broad sense, and is required to be novel, potentially useful, and ultimately understandable. Many pattern types have been studied, including classifiers, association rules, and clustering. Many new ones will be introduced in the future, to capture the diverse range of human knowledge and to meet the diverse needs of different applications.

It is a complex process to identify novel knowledge patterns from data, requiring different techniques and multiple steps, including data cleaning and preparation, the efficient search of patterns, and the evaluation of the usefulness and novelty of patterns. This process may be iterated many times, where later iterations utilize the insights from earlier iterations. Proper use of background knowledge can greatly improve the quality of the resulting knowledge patterns. A good data mining system architecture is needed to allow coupling with databases and data warehouses, iterative data selection, and interaction between different data mining algorithms and between patterns of different types and so on.

The subsequent sections of this article will give an overview of some of the major concepts and techniques of KDD, including a brief introduction and some pointers to representative literature. There are six sections in addition to the introduction, namely, data types and preprocessing, pattern and pattern search space, popular pattern types and search algorithms, understandability and interestingness, concluding remarks, and references.

Bibliography

R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proceedings of the International Conference on Very Large Data Bases, 1994.
Google Scholar
R. Agrawal and R. Srikant, Mining Sequential Patterns, Proceedings of the International Conference on Data Engineering, 1995.
Google Scholar
R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, 1993.
Google Scholar
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan., Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
Y. Aumann and Y. Lindell, A Statistical Theory for Quantitative Association Rules, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Google Scholar
R. J. Bayardo, Jr., Efficiently Mining Long Patterns from Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
R. Bayardo, Jr., R. Agrawal, and D. Gunopulos, Constraint-based Rule Mining in Large, Dense Databases, Proceedings of the International Conference on Data Engineering, 1999.
Google Scholar
K. Beyer and R. Ramakrishnan, Bottom-up Computation of Sparse Iceberg Cubes, Proceedings of ACM SIGMOD International Conference on Management of Data, 1999.
Google Scholar
I. Bratko, S. Muggleton, and A. Karalic, Applications of Inductive Logic Programming, in R. S. Michalski, I. Bratko, and M. Kubat, eds., Machine Learning and Data Mining Methods and Applications, Wiley, New York, 1997.
Google Scholar
L. Breiman, Bagging Predictors, Machine Learning 24, 123–140 (1996).
10.1023/A:1018054314350
Web of Science® Google Scholar
L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Wadsworth International Group, Belmont, CA, 1984.
Google Scholar
S. Brin, R. Motwani, J. Ullman, and S Tsur, Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proceedings of ACM SIGMOD International Conference on Management of Data, 1997.
Google Scholar
Y. Cai, N. Cercone, and J. Han, Attribute-oriented Induction in Relational Databases, in G. Piatetsky-Shapiro and W. J. Frawley, eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991.
Google Scholar
P. Cheeseman and J. Stutz, Bayesian Classification (AutoClass): Theory and Results, in V. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, MA, 1996.
Google Scholar
P. Domingos, The Role of Occam's Razor in Knowledge Discovery, Data Mining and Knowledge Discovery 3, 409–425 (1999).
10.1023/A:1009868929893
Web of Science® Google Scholar
G. Dong and J. Li, Interestingness of Discovered Association Rules in Terms of Neighborhood-based Unexpectedness, Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining, 1998.
Google Scholar
G. Dong and J. Li, Efficient Mining of Emerging Patterns: Discovering Trends and Differences, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Google Scholar
G. Dong, X. Zhang, L. Wong, and J. Li, CAEP: Classification by Aggregating Emerging Patterns, Proceedings of the International Conference on Discovery Science, Tokyo, 1999.
Google Scholar
R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973.
Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996.
Google Scholar
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, Computing Iceberg Queries Efficiently,. Proceedings of the International Conference on Very Large Data Bases, 1998.
Google Scholar
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, MA, 1996.
Google Scholar
D. Fisher, Improving Inference through Conceptual Clustering, Proceedings of the AAAI (American Association for Artificial Intelligence) Conference, 1987.
Google Scholar
Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting, Journal of Computer and System Sciences 55, 119–139 (1997).
10.1006/jcss.1997.1504
Web of Science® Google Scholar
Y. Fu and J. Han, Meta-rule-guided Mining of Association Rules in Relational Databases, Proceedings of the International Workshop on Integration of Knowledge Discovery with Deductive and Object-Oriented Databases, Singapore, 1995.
Google Scholar
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data Mining Using Two-dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization, Proceedings of ACM SIGMOD International Conference on Management of Data, 1996.
Google Scholar
V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh, A Framework for Measuring Changes in Data Characteristics, Proceedings of ACM Symposium on Principles of Database Systems, 1999.
Google Scholar
H. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh, BOAT-Optimistic Decision Tree Construction, Proceedings of ACM SIGMOD International Conference on Management of Data, 1999.
Google Scholar
D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, MA, 1989.
CAS Google Scholar
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data cube: A Relational Operator Generalizing Group-by, Cross-tab and Sub-totals, Proceedings of the International Conference on Data Engineering, 1996.
Google Scholar
S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
J. Han and Y. Fu, Discovery of Multiple-level Association Rules from Large Databases, Proceedings of the International Conference on Very Large Data Bases, 1995.
Google Scholar
J. Han and Y. Fu, Exploration of the Power of Attribute-oriented Induction in Data Mining, in V. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Cambridge, MA, 1996.
Google Scholar
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Meteo, CA, 2000.
Google Scholar
J. Han, G. Dong, and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, Proceedings of the International Conference on Data Engineering, 1999.
Google Scholar
J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.
Google Scholar
S. S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River, NJ, 1999.
Google Scholar
J. Hertz, A. Krogh, and R.G. Palmer, Introduction to the Theory of Neural Networks, Addison-Wesley, Reading, MA, 1991.
Google Scholar
A. Hinneburg and D. A. Keim, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1998.
Google Scholar
W. H. Inmon, Building the Data Warehouse. Wiley, New York, 1996.
Google Scholar
M. James, Classification Algorithms, Wiley, New York, 1985.
Google Scholar
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, 1990.
10.1002/9780470316801
Google Scholar
T. Kohonen, Self-organized Formation of Topologically Correct Feature Maps, Biological Cybernetics 43, 59–69 (1982).
10.1007/BF00337288
Web of Science® Google Scholar
B. Lent, A. Swami, and J. Widom, Clustering Association Rules, Proceedings of the International Conference on Data Engineering, 1997.
Google Scholar
J. Li, G. Dong, and K. Ramamohanarao, Instance-based Classification by Emerging Patterns, European Conference of Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000a.
Google Scholar
J. Li, G. Dong, and K. Ramamohanarao, Making Use of the Most Expressive Jumping Emerging Patterns for Classification, Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, 2000b.
Google Scholar
H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, Boston, MA, 1998.
10.1007/978-1-4615-5689-3
Web of Science® Google Scholar
B. Liu, W. Hsu, and Y. Ma, Integrating Classification and Association Rule Mining, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1998.
Google Scholar
B. Liu, W. Hsu, and Y. Ma, Pruning and Summarizing the Discovered Associations, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Google Scholar
J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the Berkeley Symposium on Mathematical Statistics, 1967.
Google Scholar
H. Mannila, H. Toivonen, and A. I. Verkamo, Efficient Algorithms for Discovering Association Rules, Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, 1994.
Google Scholar
H. Mannila, H. Toivonen, and A. I. Verkamo, Discovering Frequent Episodes in Sequences, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.
Google Scholar
M. Mehta, R. Agrawal, and J. Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining, Proceedings of the International Conference on Extending Database Technology, 1996.
Google Scholar
D. Meretakis and B. Wuthrich, Extending Naive Bayes Classifiers using Long Itemsets, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Google Scholar
R. S. Michalski and R. Stepp, Automated Construction of Classifications: Conceptual Clustering versus Numerical Taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 396–410 (1983).
10.1109/TPAMI.1983.4767409
CAS PubMed Web of Science® Google Scholar
T. M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
Google Scholar
R. Ng and J. Han, Efficient and Effective Clustering Method for Spatial Data Mining, Proceedings of the International Conference on Very Large Data Bases, 1994.
Google Scholar
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang, Exploratory Mining and Pruning Optimizations of Constrained Associations Rules, Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
J. S. Park, M. S. Chen, and P.S. Yu, An Effective Hash-based Algorithm for Mining Association Rules, Proceedings of ACM SIGMOD International Conference on Management of Data, 1995.
Google Scholar
Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Boston, MA, 1991.
10.1007/978-94-011-3534-4
Google Scholar
G. Piatetsky-Shapiro, Discovery, Analysis, and Presentation of Strong Rules, In G. Piatetsky-Shapiro and W. J. Frawley, eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991.
Google Scholar
G. Piatetsky-Shapiro and C. J. Matheus, The Interestingness of Deviations, Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, 1994.
Google Scholar
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993.
Google Scholar
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal Representations by Error Propagation, In D. E. Rumelhart and J. L. McClelland, eds., Parallel Distributed Processing, MIT Press, Cambridge, MA, 1986.
10.7551/mitpress/5236.001.0001
Google Scholar
A. Savasere, E. Omiecinski, and S. Navathe, An Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the International Conference on Very Large Data Bases, 1995.
Google Scholar
A. Silberschatz, and A. Tuzhilin, What Makes Patterns Interesting in Knowledge Discovery Systems, IEEE Transactions on Knowledge and Data Engineering 8, 970–974 (1996).
10.1109/69.553165
Web of Science® Google Scholar
R. Srikant, and R. Agrawal, Mining Generalized Association Rules, Proceedings of the International Conference on Very Large Data Bases, 1995.
Google Scholar
R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Relational Tables, Proceedings of ACM SIGMOD International Conference on Management of Data, 1996.
Google Scholar
R. Srikant, Q. Vu, and R. Agrawal, Mining Association Rules with Item Constraints, Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.
Google Scholar
H. Toivonen, Sampling Large Databases for Association Rules, Proceedings of the International Conference on Very Large Data Bases, 1996.
Google Scholar
W. Wang, J. Yang, and R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, Proceedings of the International Conference on Very Large Data Bases, 1997.
Google Scholar
K. Wang, Y. He, and J. Han, Mining Frequent Itemsets using Support Constraints, Proceedings of the International Conference on Very Large Data Bases, 2000.
Google Scholar
L. A. Zadeh, Fuzzy Sets, Information and Control 8, 338–353 (1965).
10.1016/S0019-9958(65)90241-X
CAS Web of Science® Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996.
Google Scholar
X. Zhang, G. Dong, and K. Ramamohanarao, Exploring Constraints to Efficiently Mine Emerging Patterns from Large High-dimensional Datasets, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.
Google Scholar

Citing Literature

Encyclopedia of Software Engineering

Browse other articles of this reference work: