Memory aware load balance strategy on a parallel branch-and-bound application
Juliana M.N. Silva
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Search for more papers by this authorCristina Boeres
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Search for more papers by this authorCorresponding Author
Lúcia M.A. Drummond
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Correspondence to: Lúcia M. A. Drummond, Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil.
E-mail: [email protected]
Search for more papers by this authorArtur A. Pessoa
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Search for more papers by this authorJuliana M.N. Silva
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Search for more papers by this authorCristina Boeres
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Search for more papers by this authorCorresponding Author
Lúcia M.A. Drummond
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Correspondence to: Lúcia M. A. Drummond, Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil.
E-mail: [email protected]
Search for more papers by this authorArtur A. Pessoa
Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil
Search for more papers by this authorAbstract
The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute-intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch-and-bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in-depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd.
References
- 1
Savage JE,
Zubair M. A unified model for multicore architectures. In Proceedings of the 1st International Forum on Next-Generation Multicore/Manycore Technologies, IFMT ’08. ACM: New York, NY, USA, 2008; 9:1–9:12.
10.1145/1463768.1463780 Google Scholar
- 2
Tang L,
Mars J,
Soffa ML. Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures. In Proceedings of the 1st International Workshop on Adaptive Self-tuning Computing Systems for the Exaflop Era, EXADAPT ’11. ACM: New York, NY, USA, 2011; 12‒21.
10.1145/2000417.2000419 Google Scholar
- 3 Alam SR, Barrett RF, Kuehn JA, Roth PC, Vetter JS. Characterization of scientific workloads on systems with multi-core processors, IEEE International Symposium on Workload CharacterizationIISWC, IEEE, San Jose, CA, EUA, 2006; 225–236.
- 4 Song F, YarKhan A, Dongarra J. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, International Conference for High Performance Computing, Networking Storage and Analysis, Portland, Oregon, 2009; 19:1–19:11.
- 5
Mars J,
Tang L,
Soffa ML. Directly characterizing cross core interference through contention synthesis. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, HiPEAC ’11. ACM: New York, NY, USA, 2011; 167–176.
10.1145/1944862.1944887 Google Scholar
- 6 Rashid H, Novoa C, Qasem A. An evaluation of parallel knapsack algorithms on multicore architectures, CSC’10, Las Vegas, Nevada, USA, 2010; 230– 235.
- 7 Fortune S, Wyllie J. Parallelism in random access machine, 10th ACM Symposium on Theory of Computation (STOC),New York, USA, 1978; 114–118.
- 8 Jajá J. An Introduction to Parallel Algorithms, Addison Wesley Longman Publishing Co., Inc.: Redwood City, CA, USA, 1992.
- 9
Cole R,
Zajicek O. The APRAM : incorporating asynchrony into the PRAM model. In Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’89. ACM: New York, NY, USA, 1989; 169–178.
10.1145/72935.72954 Google Scholar
- 10 Gibbons PB. A more practical PRAM model. In Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’89. ACM: New York, NY, USA, 1989; 158–168.
- 11 Gibbons PB, Matias Y, Ramachandran V. Can shared-memory model serve as a bridging model for parallel computation? In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’97. ACM: New York, NY, USA, 1997; 72–83.
- 12 Gibbons PB, Matias Y, Ramachandran V. The QRQW PRAM: accounting for contention in parallel algorithms. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’94. Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1994; 638–648.
- 13 Maggs BM, Matheson LR, Tarjan RE. Models of parallel computation: a survey and synthesis, 1995.
- 14 Ramachandran V. QSM: a general purpose shared-memory model for parallel computation, Foundations of Software Technology and Theoretical Computer Science, Kharagpur, India, 1997; 1–5.
- 15
Aggarwal A,
Alpern B,
Chandra A,
Snir M. A model for hierarchical memory. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC ’87. ACM: New York, NY, USA, 1987; 305–314.
10.1145/28395.28428 Google Scholar
- 16
Aggarwal A,
Chandra AK,
Snir M. Hierarchical memory with block transfer. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science, SFCS ’87. IEEE Computer Society: Washington, DC, USA, 1987; 204–216.
10.1109/SFCS.1987.31 Google Scholar
- 17
Juurlink B,
Juurlink BHH,
Wijshoff HAG. The parallel hierarchical memory model. In Proceedings of Scandinavian Workshop on Algorithms Theory, LNCS 824. Springer-Verlag: New York, NY, USA, 1994; 240–251.
10.1007/3-540-58218-5_22 Google Scholar
- 18
Alpern B,
Carter L,
Ferrante J. Modeling parallel computers as memory hierarchies. In Proceedings Programming Models for Massively Parallel Computers, IEEE Computer Society Press: Berlin, Germany, 1993; 116‒123.
10.1109/PMMP.1993.315548 Google Scholar
- 19 Alpern B, Carter L, Feig E, Selker T. The uniform memory hierarchy model of computation. Algorithmica 1994 12: 72– 109.
- 20
Papadimitriou C,
Yannakakis M. Towards an architecture-independent analysis of parallel algorithms. In STOC ’88: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, ACM: New York, NY, USA, 1988; 510–513.
10.1145/62212.62262 Google Scholar
- 21 Nascimento AP, Sena A, Boeres C, Rebello VEF. On the feasibility of dynamically scheduling dag applications on shared heterogeneous systems. In Euro-Par 2009 Parallel Processing, vol. 5704, H Sips, D Epema, HX Lin (eds)., Lecture Notes in Computer Science. Springer: New York, NY, USA, 2009; 191–202.
- 22 Valiant LG. A bridging model for parallel computation. Communication ACM 1990 33(8): 103–111.
- 23
Williams TL,
Parsons RJ. The heterogeneous bulk synchronous parallel model. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing, IPDPS ’00.Springer-Verlag: London, UK, UK, 2000; 102– 108.
10.1007/3-540-45591-4_12 Google Scholar
- 24 Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T. LogP: a practical model of parallel computation. Communication ACM 1996 39: 78–85.
- 25 Alexandrov A, Ionescu MF, Schauser KE, Scheiman C. LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation, 1995.
- 26 Ino F, Fujimoto N, Hagihara K. LogGPS: a parallel computational model for synchronization analysis. SIGPLAN Notices 2001 36: 133–142.
- 27 Frank M, Agarwal A, Vernon MK. LoPC: modeling contention in parallel algorithms. PPOPP, Las Vegas, Nevada, USA, 1997; 276–287.
- 28 Cameron KW, Ge R, Sun X-H. log np and log 3p: accurate analytical models of point-to-point communication in distributed systems. IEEE Transactions on Computers 2007 56: 314–327. DOI: 10.1109/TC.2007.38.
- 29 Tam ATC, Wang CL. Realistic communication model for parallel computing on cluster. 1st IEEE Computer Society International Workshop on Cluster Computing, Melbourne, Australia, 1999; 92–101.
- 30 de Amorim Mendes H. HlogP : Um modelo de escalonamento para execução de aplicações MPI em grades computacionais. Master's Thesis, Universidade Federal Fluminense, 2004.
- 31 Badia RM, Perez JM, Ayguade E, Labarta J. Impact of the memory hierarchy on shared memory architectures in multicore programming models. In Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. IEEE Computer Society: Washington, DC, USA, 2009; 437–445.
- 32 Chai L, Gao Q, Panda DK. Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system. In Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07.IEEE Computer Society: Washington, DC, USA, 2007; 471–478.
- 33 Savage JE, Zubair M. Evaluating multicore algorithms on the unified memory model. Scientific Programming - Software Development for Multi-core Computing Systems 2009 17: 295–308.
- 34
Tu B,
Fan J,
Zhan J,
Zhao X. Accurate analytical models for message passing on multi-core clusters. In Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing. IEEE Computer Society: Washington, DC, USA, 2009; 133–139.
10.1109/PDP.2009.18 Google Scholar
- 35 Mercier G, Clet-Ortega J. Towards an efficient process placement policy for MPI applications in multicore environments. In EuroPVM/MPI, vol. 5759, Lecture Notes in Computer Science.Springer: Espoo, Finland, September 2009; 104–115.
- 36 Song F, Moore S, Dongarra J. Analytical modeling for affinity-based thread scheduling on multicore platforms. Symposim on Principles and Parctice of Parallel Programming, Heraklion, Crete, Greece, 2009; 1–10.
- 37
Xia Y,
Prasanna VK,
Li J. Hierarchical scheduling of dag structured computations on manycore processors with dynamic thread grouping. In Proceedings of the 15th International Conference on Job Scheduling Strategies for Parallel Processing, JSSPP’10.Springer-Verlag: Berlin, Heidelberg, 2010; 154–174.
10.1007/978-3-642-16505-4_9 Google Scholar
- 38 Valiant LG. A bridging model for multi-core computing. Journal of Computer and System Sciences 2011 77(1): 154–166.
- 39 González-Domínguez J, Taboada GL, Fraguela BB, Martín MJ, Touriño J. Servet: a benchmark suite for autotuning on multicore clusters. IPDPS’10, Atlanta, GA, USA, 2010; 1–9.
- 40 Smith AJ, Saavedra RH. Measuring cache and TLB performance and their effect on benchmark runtimes. IEEE Transactions on Computers 1995 44(10): 1223–1235.
- 41 Innovating Computing Laboratory University of Tennessee. Performance application programming interface, 2004. (Available from: http://icl.cs.utk.edu/papi/) [Accessed on 27 March 2014].
- 42 Dongarra J, Moore S, Mucci P, Seymour K, You H. Accurate cache and tlb characterization using hardware counters. International Conference on Computational Science Krakow, Poland, 2004; 432–439.
- 43
Barreto L,
Bauer M. Parallel branch and bound algorithm—a comparison between serial, OpenMP and MPI implementations. Journal of Physics: Conference Series 2010 256(1): 012018.
10.1088/1742-6596/256/1/012018 Google Scholar
- 44 Djerrah A, Cun BL, Cung VD, Roucairol C. Bob++: framework for solving optimization problems with branch-and-bound methods. HPDC, Paris, France, 2006; 369–370.
- 45 Galea F, Cun BL. A parallel exact solver for the three-index quadratic assignment problem. IPDPS Workshops, Anchorage,Alaska, USA, 2011; 1940–1949.
- 46 de A Drummond LM, Uchoa E, Gonçalves AD, Silva JMN, Santos MCP, de Castro MCS. A grid-enabled distributed branch-and-bound algorithm with application on the steiner problem in graphs. Parallel Computing 2006 32(9): 629–642.
- 47 Ralphs TK, Güzelsoy M, Mahajan A. SYMPHONY version 5.3 user's manual. Technical Report, COR@L Laboratory, Lehigh University, 2011. (Available from: http://www.coin-or.org/SYMPHONY/doc/SYMPHONY-5.3.4-Manual.pdf) [Accessed on 27 March 2014].
- 48 Sanjuan-Estrada JF, Casado LG, García I. Adaptive parallel interval branch and bound algorithms based on their performance for multicore architectures. The Journal of Supercomputing 2011 58(3): 376–384.
- 49
Park S,
Kim T,
Park J,
Kim J,
Im H. Parallel skyline computation on multicore architectures. In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09.IEEE Computer Society: Washington, DC, USA, 2009; 760–771.
10.1109/ICDE.2009.42 Google Scholar
- 50 Eckstein J, Phillips CA, Hart WE. Pico: an object-oriented framework for parallel branch and bound, 2001.
- 51
Shinano Y,
Higaki M,
Hirabayashi R. A generalized utility for parallel branch and bound algorithms. In Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing, SPDP ’95.IEEE Computer Society: Washington, DC, USA, 1995; 392–401.
10.1109/SPDP.1995.530710 Google Scholar
- 52 Budiu M, Delling D, Werneck R. DryadOpt: branch-and-bound on distributed data-parallel execution engines. IEEE International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, Alaska, USA, 2011.
- 53 Vu TT, Derbel B, Asim A, Bendjoudi A, Melab N. Overlay-centric load balancing: applications to UTS and B&B. CLUSTER, Beijing, China, 2012; 382–390.
- 54 Boschetti MA, Mingozzi A, Ricciardelli S. A dual ascent procedure for the set partitioning problem. Discrete Optimization 2008; 5(4): 735–747. DOI: 10.1016/j.disopt.2008.06.001.
- 55 DS Hochbaum (ed.). Approximation Algorithms for NP-Hard Problems. PWS Publishing Co.: Boston, MA, USA, 1997.
- 56 Bendjoudi A, Melab N, Talbi EG. An adaptive hierarchical master-worker (ahmw) framework for grids—application to b&b algorithms. Journal of Parallel and Distributed Computing 2012 72(2): 120–131.
- 57 Bendjoudi A, Melab N, Talbi EG. Hierarchical branch and bound algorithm for computational grids. Future Generation Computer Systems 2012 28(8): 1168–1176.
- 58 Group TO. The open group base specifications issue 6 @ONLINE, 2004. (Available from: http://pubs.opengroup.org/onlinepubs/009696699) [Accessed on 27 March 2014].
- 59 Stevens RW, Rago SA. Advanced programming in the UNIX(R) environment ( 2nd edition). Addison-Wesley Professional, 2005.
- 60 Ma KL. Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures. In Proceedings of the IEEE Symposium on Parallel Rendering, PRS ’95.ACM: New York, NY, USA, 1995; 23–30.