Analysis of DNA sequences is a data and computational intensive problem, and therefore, it requires suitable parallel computing resources and algorithms. In this paper, we describe our parallel algorithm for DNA sequence analysis that determines how many times a pattern appears in the DNA sequence. The algorithm is engineered for heterogeneous platforms that comprise a host with multi-core processors and one or more many-core devices. For combinatorial optimization, we use the simulated annealing algorithm. The optimization goal is to determine the number of threads, thread affinities, and DNA sequence fractions for host and device, such that the overall execution time of DNA sequence analysis is minimized. We evaluate our approach experimentally using real-world DNA sequences of various organisms on a heterogeneous platform that comprises two Intel Xeon E5 processors and an Intel Xeon Phi 7120P co-processing device. By running only about 5% of possible experiments, our optimization method finds a near-optimal system configuration for DNA sequence analysis that yields with average speedup of 1.6 × and 2 × compared with the host-only and device-only execution. Copyright © 2016 John Wiley & Sons, Ltd.

References

1 NCBI. National Center for Biotechnology Information U.S. National Library of Medicine. Available from: http://www.ncbi.nlm.nih.gov/genbank Accessed on Dec. 2015; 2015.
Google Scholar
2Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big data: astronomical or genomical?PLoS Biology 2015; 13(7): 1–11,e1002195.
10.1371/journal.pbio.1002195
PubMed Web of Science® Google Scholar
3Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature 2003; 422(6934): 835–847.
10.1038/nature01626
CAS PubMed Web of Science® Google Scholar
4Pizzolante R, Castiglione A, Carpentieri B, Santis AD, Palmieri F, Castiglione A. Format-independent protection of dna microarray images. 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), Krakow, Poland; November 2015: 351–357.
Google Scholar
5Mellmann A, Harmsen D, Cummings CA, Zentz EB, Leopold SR, Rico A, Prior K, Szczepanowski R, Ji Y, Zhang W, McLaughlin SF, Henkhaus JK, Leopold B, Bielaszewska M, Prager R, Brzoska PM, Moore RL, Guenther S, Rothberg JM, Karch H. Prospective genomic characterization of the german enterohemorrhagic escherichia coli O104:H4 outbreak by rapid next generation sequencing technology. PLoS ONE 2011; 6(7): 1–9,e22751.
10.1371/journal.pone.0022751
Web of Science® Google Scholar
6Luftig MA, Richey S. DNA and forensic science. New Engineering L Reviews 2001; 35: 609–613.
Google Scholar
7Herath D, Lakmali C, Ragel R. Accelerating string matching for bio-computing applications on multi-core CPUs. 2012 7th IEEE International Conference on Industrial and Information Systems (ICIIS), Chennai, India; August 2012: 1–6.
Google Scholar
8xDrews F, Lichtenberg J, Welch LR. Scalable parallel word search in multicore/multiprocessor systems. The Journal of Supercomputing 2010; 51(1): 58–75.
10.1007/s11227-009-0308-3
Web of Science® Google Scholar
9Luchaup D, Smith R, Estan C, Jha S. Speculative parallel pattern matching. IEEE Transactions on Information Forensics and Security 2011; 6(2): 438–451.
10.1109/TIFS.2011.2112647
Web of Science® Google Scholar
10Mushtaq H, Al-Ars Z. Cluster-based apache spark implementation of the gatk DNA analysis pipeline. 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington D.C, USA; November 2015: 1471–1477.
Google Scholar
11Delmerico JA, Byrnes NA, Bruno AE, Jones MD, Gallo SM, Chaudhary V. Comparing the performance of clusters, hadoop, and active disks on microarray correlation computations. 2009 International Conference on High Performance Computing (HiPC), Kochi (Cochin), India; December 2009: 378–387.
Google Scholar
12Lin CH, Liu CH, Chien LS, Chang SC. Accelerating pattern matching using a novel parallel algorithm on GPUs. IEEE Transactions on Computers 2013; 62(10): 1906–1916.
10.1109/TC.2012.254
Web of Science® Google Scholar
13Bellekens X, Andonovic I, Atkinson RC, Renfrew C, Kirkham T. Investigation of GPU-based pattern matching. The 14th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNet2013) (PGNet2013), Liverpool, UK; 2013: 1–5.
Google Scholar
14Tumeo A, Villa O. Accelerating DNA analysis applications on GPU clusters. 2010 IEEE 8th Symposium on Application Specific Processors (SASP), Anaheim, CA, USA; June 2010: 71–76.
Google Scholar
15Villa O, Chavarra-Miranda DG, Maschhoff KJ. Input-independent, scalable and fast string matching on the Cray XMT. IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009. IEEE, Rome, Italy; 2009: 1–12.
Google Scholar
16Viebke A, Sabri P. The potential of the intel (R) Xeon phi for supervised deep learning. 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC). IEEE, New York, USA; 2015: 758–765.
Google Scholar
17Kessler CW, Dastgeer U, Thibault S, Namyst R, Richards A, Dolinsky U, Benkner S, Trff JL, Pllana S. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. 2012 Design, Automation Test in Europe Conference Exhibition (DATE). IEEE, Dresden, Germany; March 2012: 1403–1408.
Google Scholar
18Sandrieser M, Benkner S, Pllana S. Using explicit platform descriptions to support programming of heterogeneous many-core systems. Parallel Computing 2012; 38(1-2): 52–56.
10.1016/j.parco.2011.10.008
Web of Science® Google Scholar
19Chrysos G. Intel® Xeon Phi coprocessor-the architecture. Intel Whitepaper 2012. (Available from: https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner) .
Google Scholar
20Benkner S, Pllana S, Traff JL, Tsigas P, Dolinsky U, Augonnet C, Bachmayer B, Kessler C, Moloney D, Osipov V. PEPPHEr: efficient and productive usage of hybrid computing systems. Micro, IEEE 2011; 31(5): 28–41.
10.1109/MM.2011.67
Web of Science® Google Scholar
21Memeti S, Pllana S. Analyzing large-scale DNA sequences on multi-core architectures. 18th IEEE International Conference on Computational Science and Engineering (CSE-2015). IEEE, Porto, Portugal; 2015: 208–215.
Google Scholar
22Press WH, Teukolsky SA, Vetterling WT, Brian PF. Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3rd ed. New York: Cambridge University Press; 2007.
Google Scholar
23Roy I, Aluru S. Finding Motifs in biological sequences using the micron automata processor. Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, Phoenix, Arizona USA; May 2014: 415–424.
Google Scholar
24Nordström KJV, et al. Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers. Nature Biotechnology 2013; 31(4): 325–330.
10.1038/nbt.2515
CAS PubMed Web of Science® Google Scholar
25Tian X, Saito H, Preis S, Garcia EN, Kozhukhov S, Masten M, Cherkasov AG, Panchenko N. Practical SIMD vectorization techniques for intel Xeon Phi coprocessors. IPDPS Workshops. IEEE, Boston, Massachusetts USA; 2013: 1149–1158.
Google Scholar
26Chacón A, Moure JC, Espinosa A, Hernndez P. n-step FM-index for faster pattern matching.. In: VN Alexandrov, M Lees, VV Krzhizhanovskaya, J Dongarra, PMA Sloot, eds. ICCS, Procedia Computer Science, vol. 18. Amsterdam, Netherlands: Elsevier; 2013: 70–79.
10.1016/j.procs.2013.05.170
Google Scholar
27Albayrak OE, Akturk I, Ozturk O. Improving application behavior on heterogeneous manycore systems through Kernel mapping. Parallel Computing 2013; 39(12): 867–878.
10.1016/j.parco.2013.08.011
Web of Science® Google Scholar
28Memeti S, Sabri P. PaREM: A novel approach for parallel regular expression matching. 17th International Conference on Computational Science and Engineering (CSE-2014), Chengdu, China; December 2014: 690–697.
Google Scholar
29Arudchutha S, Nishanthy T, Ragel RG. String matching with multicore CPUs: Performing better with the Aho-Corasick algorithm. 2013 IEEE 8th International Conference on Industrial and Information Systems; 2013: 231–236.
Google Scholar
30Kouzinopoulos CS, Margaritis KG. String Matching on a Multicore GPU Using CUDA. 13th Panhellenic Conference on Informatics, 2009. PCI Š09, Corfu, Greece; September 2009: 14–18.
Google Scholar
31Li H, Ni B, Wong MH, Leung KS. A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching. 2011 IEEE 9th Symposium on Application Specific Processors (SASP). IEEE Computer Society, San Diego, CA, USA; 2011: 74–77.
Google Scholar
32Li M, Zeng L, Meng S, Tan J, Li Z, Butt AR, Nicholas F. Mronline: Mapreduce online performance tuning. Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, Vancouver, Canada; 2014: 165–176.
Google Scholar
33Cooper KD, Grosul A, Harvey TJ, Reeves S, Subramanian D, Torczon L, xWaterman T. ACME: adaptive compilation made efficient. ACM SIGPLAN Notices, vol. 40. Chicago, Illinois, USA: ACM; 2005: 69–77.
10.1145/1065910.1065921
Google Scholar
34Kołodziej J, Khan SU. Data scheduling in data grids and data centers: A short taxonomy of problems and intelligent resolution techniques.. In: NT Nguyen, J Kolodziej, T Burczyski, M Biba, eds. Transactions on Computational Collective Intelligence X, Lecture Notes in Computer Science, vol. 7776. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013: 103–119.
10.1007/978-3-642-38496-7_7
Google Scholar
35Khan FA, Han Y, Pllana S, Brezany P. An ant-colony-optimization based approach for determination of parameter significance of scientific workflows. 2010 24th IEEE International Conference on Advanced Information Networking and Applications (AINA), Perth, Australia; April 2010: 1241–1248.
Google Scholar
36Brandic I, Pllana S, Benkner S. An approach for the high-level specification of QoS-aware grid workflows considering location affinity. Scientific Programming 2006; 14(3–4): 231–250.
10.1155/2006/670375
Google Scholar

Citing Literature

Volume29, Issue7

Combined Special Issues on Security and privacy in social networks (NSS2015) and 18th IEEE International Conference on Computational Science and Engineering (CSE2015)

10 April 2017

e4037

Combinatorial optimization of DNA sequence analysis on heterogeneous systems

Summary

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Combinatorial optimization of DNA sequence analysis on heterogeneous systems

Summary

References

Citing Literature

References

Related

Information