A biobjective feature selection algorithm for large omics datasets
Corresponding Author
Luís Cavique
MAS-BioISI, FCUL, Lisboa, Portugal
Universidade Aberta, Lisboa, Portugal
Correspondence
Luís Cavique, DCeT, Universidade Aberta, Lisbon, Portugal.
Email: [email protected]
Search for more papers by this authorArmando B. Mendes
Universidade Açores, Ponta Delgada, Portugal
Algoritmi, Universidade do Minho, Portugal
Search for more papers by this authorHugo F.M.C. Martiniano
MAS-BioISI, FCUL, Lisboa, Portugal
Instituto Dr. Ricardo Jorge, Lisboa, Portugal
Search for more papers by this authorCorresponding Author
Luís Cavique
MAS-BioISI, FCUL, Lisboa, Portugal
Universidade Aberta, Lisboa, Portugal
Correspondence
Luís Cavique, DCeT, Universidade Aberta, Lisbon, Portugal.
Email: [email protected]
Search for more papers by this authorArmando B. Mendes
Universidade Açores, Ponta Delgada, Portugal
Algoritmi, Universidade do Minho, Portugal
Search for more papers by this authorHugo F.M.C. Martiniano
MAS-BioISI, FCUL, Lisboa, Portugal
Instituto Dr. Ricardo Jorge, Lisboa, Portugal
Search for more papers by this authorAbstract
Feature selection is one of the most important concepts in data mining when dimensionality reduction is needed. The performance measures of feature selection encompass predictive accuracy and result comprehensibility. Consistency-based methods are a significant category of feature selection research that substantially improves the comprehensibility of the result using the parsimony principle. In this work, the biobjective version of the algorithm logical analysis of inconsistent data is applied to large volumes of data. In order to deal with hundreds of thousands of attributes, heuristic decomposition uses parallel processing to solve a set covering problem and a cross-validation technique. The biobjective solutions contain the number of reduced features and the accuracy. The algorithm is applied to omics datasets with genome-like characteristics of patients with rare diseases.
REFERENCES
- Almuallim, H., & Dietterich, T. G. (1991). Learning with many irrelevant features. In Proceedings of the 9th national conference on artificial intelligence (pp. 547–552). Menlo Park: MIT Press.
- Boros, E., Hammer, P. L., Ibaraki, T., Kogan, A., Mayoraz, E., & Muchnik, I. (2000). An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering, 12(2), 292–306.
- Boyd, S., Xiao, L., Mutapcic, A., & Mattingley, J. (2008). Notes on decomposition methods. Notes for EE364B, Stanford University, pp. 1–36.
- Cavique, L., Mendes, A. B., & Funk, M. (2011). Logical analysis of inconsistent data (LAID) for a paremiologic study. In: Processing 15th Portuguese Conference on Artificial Inteligence, EPIA.
- Cavique, L., Mendes, A. B., Funk, M., & Santos, J. M. A. (2013). A feature selection approach in the study of Azorean proverbs. In Exploring innovative and successful applications of soft computing, advances in computational intelligence and robotics (ACIR) book series (pp. 38–58). Hershey: IGI Global.
- Cavique, L., Mendes, A. B., & Martiniano, H. F. M. C. (2017). A feature selection algorithm based on heuristic decomposition. In E. Oliveira, J. Gama, Z. Vale, & H. Lopes Cardoso (Eds.), Progress in artificial intelligence, EPIA 2017, lecture notes in computer science (pp. 525–536). Porto, Portugal: Springer, vol. 10423.
10.1007/978-3-319-65340-2_43 Google Scholar
- Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers and Electrical Engineering, 40(1), 16–28.
- Chung, R. H., Tsai, W. Y., Hsieh, C. H., Hung, K. Y., Hsiung, C. A., & Hauser, E. R. (2014). SeqSIMLA2: Simulating correlated quantitative traits accounting for shared environmental effects in user-specified pedigree structure. Genetic Epidemiology, 39(1), 20–24.
- Chvatal, V. (1979). A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4, 233–235.
10.1287/moor.4.3.233 Google Scholar
- Collette, Y., & Siarry, P. (2011). Multiobjective optimization, principles and case studies, decision engineering series. Heidelberg: Springer.
- Crama, Y., Hammer, P. L., & Ibaraki, T. (1988). Cause-effect relationships and partially defined Boolean functions. Annals of Operations Research, 16, 299–326.
10.1007/BF02283750 Google Scholar
- European Commission (2018). The European declaration on high-performance computing, Retrieved from https://ec.europa.eu/digital-single-market/en/news/european-declaration-high-performance-computing
- John, G.H., Kohavi, R., Pfleger K. (1994). Irrelevant features and the subset selection problem. In: Proceedings of the 11th International Conference on Machine Learning, ICML 94, pp. 121–129.
- Joncour, C., Michel, S., Sadykov, R., Sverdlov, D., & Vanderbeck, F. (2010). Column generation based primal heuristics. Electronic Notes in Discrete Mathematics, Elsevier, 36, 695–702.
10.1016/j.endm.2010.05.088 Google Scholar
- Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. Proceedings of 9th National Conference on Artificial Intelligence, 129–134.
- Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502.
- Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Science, 1, 341–356.
10.1007/BF01001956 Google Scholar
- Pawlak, Z. (1991). Rough sets. In Theoretical aspects of reasoning about data. Boston: Kluwer Academic Publishers.
10.1007/978-94-011-3534-4 Google Scholar
- Peters, J. F., & Skowron, A. (2010). Transactions on rough sets XI. Lecture notes in computer science/transactions on rough sets. Berlin, Heidelberg: Springer.
10.1007/978-3-642-11479-3 Google Scholar
- Polkowski, L. (2002). Rough sets, mathematical foundations. Advances in soft computing. Germany: Physica-Verlag Heidelberg.
- Smet, P., Ernst, A., & Van den Berghe, G. (2016). Heuristic decomposition approaches for an integrated task scheduling and personnel rostering problem. Computers and Operations Research, 76, 60–72.
- Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., … Robinson, G. E. (2015). Big data, astronomical or genomical? PLoS Biology, 13(7), e1002195. https://doi.org/10.1371/journal.pbio.1002195
- Talbi, E. G. (2009). Metaheuristics, from design to implementation. Hoboken, New Jersey: John Wiley & Sons, Inc.
10.1002/9780470496916 Google Scholar
- The 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526, 68–74.
- Yao, P. J., & Chung, R. H. (2016). SeqSIMLA2_exact, simulate multiple disease sites in large pedigrees with given disease status for diseases with low prevalence. Bioinformatics, 32(4), 557–562.