Concurrency and Computation: Practice and Experience

Volume 27, Issue 5 pp. 1122-1144

Research Article

Memory aware load balance strategy on a parallel branch-and-bound application

Juliana M.N. Silva,

Juliana M.N. Silva

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Search for more papers by this author

Cristina Boeres,

Cristina Boeres

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Search for more papers by this author

Lúcia M.A. Drummond,

Corresponding Author

Lúcia M.A. Drummond

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Correspondence to: Lúcia M. A. Drummond, Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil.

E-mail: [email protected]

Search for more papers by this author

Artur A. Pessoa,

Artur A. Pessoa

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Search for more papers by this author

Juliana M.N. Silva,

Juliana M.N. Silva

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Search for more papers by this author

Cristina Boeres,

Cristina Boeres

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Search for more papers by this author

Lúcia M.A. Drummond,

Corresponding Author

Lúcia M.A. Drummond

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Correspondence to: Lúcia M. A. Drummond, Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil.

E-mail: [email protected]

Search for more papers by this author

Artur A. Pessoa,

Artur A. Pessoa

Universidade Federal Fluminense, Rua Passo da Pátria 156, Bloco E. São Domingos, Niterói, Rio de Janeiro, Brazil

Search for more papers by this author

First published: 16 April 2014

https://doi.org/10.1002/cpe.3276

Citations: 8

Share a link

Email
Wechat
Bluesky

Abstract

The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute-intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch-and-bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in-depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd.

References

1 Savage JE, Zubair M. A unified model for multicore architectures. In Proceedings of the 1st International Forum on Next-Generation Multicore/Manycore Technologies, IFMT ’08. ACM: New York, NY, USA, 2008; 9:1–9:12.
10.1145/1463768.1463780
Google Scholar
2 Tang L, Mars J, Soffa ML. Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures. In Proceedings of the 1st International Workshop on Adaptive Self-tuning Computing Systems for the Exaflop Era, EXADAPT ’11. ACM: New York, NY, USA, 2011; 12‒21.
10.1145/2000417.2000419
Google Scholar
3 Alam SR, Barrett RF, Kuehn JA, Roth PC, Vetter JS. Characterization of scientific workloads on systems with multi-core processors, IEEE International Symposium on Workload CharacterizationIISWC, IEEE, San Jose, CA, EUA, 2006; 225–236.
Google Scholar
4 Song F, YarKhan A, Dongarra J. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, International Conference for High Performance Computing, Networking Storage and Analysis, Portland, Oregon, 2009; 19:1–19:11.
Google Scholar
5 Mars J, Tang L, Soffa ML. Directly characterizing cross core interference through contention synthesis. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, HiPEAC ’11. ACM: New York, NY, USA, 2011; 167–176.
10.1145/1944862.1944887
Google Scholar
6 Rashid H, Novoa C, Qasem A. An evaluation of parallel knapsack algorithms on multicore architectures, CSC’10, Las Vegas, Nevada, USA, 2010; 230– 235.
Google Scholar
7 Fortune S, Wyllie J. Parallelism in random access machine, 10th ACM Symposium on Theory of Computation (STOC),New York, USA, 1978; 114–118.
Google Scholar
8 Jajá J. An Introduction to Parallel Algorithms, Addison Wesley Longman Publishing Co., Inc.: Redwood City, CA, USA, 1992.
Google Scholar
9 Cole R, Zajicek O. The APRAM : incorporating asynchrony into the PRAM model. In Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’89. ACM: New York, NY, USA, 1989; 169–178.
10.1145/72935.72954
Google Scholar
10 Gibbons PB. A more practical PRAM model. In Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’89. ACM: New York, NY, USA, 1989; 158–168.
10.1145/72935.72953
Web of Science® Google Scholar
11 Gibbons PB, Matias Y, Ramachandran V. Can shared-memory model serve as a bridging model for parallel computation? In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’97. ACM: New York, NY, USA, 1997; 72–83.
10.1145/258492.258500
Web of Science® Google Scholar
12 Gibbons PB, Matias Y, Ramachandran V. The QRQW PRAM: accounting for contention in parallel algorithms. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’94. Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1994; 638–648.
Google Scholar
13 Maggs BM, Matheson LR, Tarjan RE. Models of parallel computation: a survey and synthesis, 1995.
Google Scholar
14 Ramachandran V. QSM: a general purpose shared-memory model for parallel computation, Foundations of Software Technology and Theoretical Computer Science, Kharagpur, India, 1997; 1–5.
Google Scholar
15 Aggarwal A, Alpern B, Chandra A, Snir M. A model for hierarchical memory. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC ’87. ACM: New York, NY, USA, 1987; 305–314.
10.1145/28395.28428
Google Scholar
16 Aggarwal A, Chandra AK, Snir M. Hierarchical memory with block transfer. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science, SFCS ’87. IEEE Computer Society: Washington, DC, USA, 1987; 204–216.
10.1109/SFCS.1987.31
Google Scholar
17 Juurlink B, Juurlink BHH, Wijshoff HAG. The parallel hierarchical memory model. In Proceedings of Scandinavian Workshop on Algorithms Theory, LNCS 824. Springer-Verlag: New York, NY, USA, 1994; 240–251.
10.1007/3-540-58218-5_22
Google Scholar
18 Alpern B, Carter L, Ferrante J. Modeling parallel computers as memory hierarchies. In Proceedings Programming Models for Massively Parallel Computers, IEEE Computer Society Press: Berlin, Germany, 1993; 116‒123.
10.1109/PMMP.1993.315548
Google Scholar
19 Alpern B, Carter L, Feig E, Selker T. The uniform memory hierarchy model of computation. Algorithmica 1994 12: 72– 109.
10.1007/BF01185206
Web of Science® Google Scholar
20 Papadimitriou C, Yannakakis M. Towards an architecture-independent analysis of parallel algorithms. In STOC ’88: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, ACM: New York, NY, USA, 1988; 510–513.
10.1145/62212.62262
Google Scholar
21 Nascimento AP, Sena A, Boeres C, Rebello VEF. On the feasibility of dynamically scheduling dag applications on shared heterogeneous systems. In Euro-Par 2009 Parallel Processing, vol. 5704, H Sips, D Epema, HX Lin (eds)., Lecture Notes in Computer Science. Springer: New York, NY, USA, 2009; 191–202.
10.1007/978-3-642-03869-3_21
Web of Science® Google Scholar
22 Valiant LG. A bridging model for parallel computation. Communication ACM 1990 33(8): 103–111.
10.1145/79173.79181
Web of Science® Google Scholar
23 Williams TL, Parsons RJ. The heterogeneous bulk synchronous parallel model. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing, IPDPS ’00.Springer-Verlag: London, UK, UK, 2000; 102– 108.
10.1007/3-540-45591-4_12
Google Scholar
24 Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T. LogP: a practical model of parallel computation. Communication ACM 1996 39: 78–85.
10.1145/240455.240477
Web of Science® Google Scholar
25 Alexandrov A, Ionescu MF, Schauser KE, Scheiman C. LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation, 1995.
Google Scholar
26 Ino F, Fujimoto N, Hagihara K. LogGPS: a parallel computational model for synchronization analysis. SIGPLAN Notices 2001 36: 133–142.
10.1145/568014.379592
Web of Science® Google Scholar
27 Frank M, Agarwal A, Vernon MK. LoPC: modeling contention in parallel algorithms. PPOPP, Las Vegas, Nevada, USA, 1997; 276–287.
Google Scholar
28 Cameron KW, Ge R, Sun X-H. log _np and log ₃p: accurate analytical models of point-to-point communication in distributed systems. IEEE Transactions on Computers 2007 56: 314–327. DOI: 10.1109/TC.2007.38.
10.1109/TC.2007.38
Web of Science® Google Scholar
29 Tam ATC, Wang CL. Realistic communication model for parallel computing on cluster. 1st IEEE Computer Society International Workshop on Cluster Computing, Melbourne, Australia, 1999; 92–101.
Google Scholar
30 de Amorim Mendes H. HlogP : Um modelo de escalonamento para execução de aplicações MPI em grades computacionais. Master's Thesis, Universidade Federal Fluminense, 2004.
Google Scholar
31 Badia RM, Perez JM, Ayguade E, Labarta J. Impact of the memory hierarchy on shared memory architectures in multicore programming models. In Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. IEEE Computer Society: Washington, DC, USA, 2009; 437–445.
10.1109/PDP.2009.56
Web of Science® Google Scholar
32 Chai L, Gao Q, Panda DK. Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system. In Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07.IEEE Computer Society: Washington, DC, USA, 2007; 471–478.
10.1109/CCGRID.2007.119
Web of Science® Google Scholar
33 Savage JE, Zubair M. Evaluating multicore algorithms on the unified memory model. Scientific Programming - Software Development for Multi-core Computing Systems 2009 17: 295–308.
Web of Science® Google Scholar
34 Tu B, Fan J, Zhan J, Zhao X. Accurate analytical models for message passing on multi-core clusters. In Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing. IEEE Computer Society: Washington, DC, USA, 2009; 133–139.
10.1109/PDP.2009.18
Google Scholar
35 Mercier G, Clet-Ortega J. Towards an efficient process placement policy for MPI applications in multicore environments. In EuroPVM/MPI, vol. 5759, Lecture Notes in Computer Science.Springer: Espoo, Finland, September 2009; 104–115.
10.1007/978-3-642-03770-2_17
Web of Science® Google Scholar
36 Song F, Moore S, Dongarra J. Analytical modeling for affinity-based thread scheduling on multicore platforms. Symposim on Principles and Parctice of Parallel Programming, Heraklion, Crete, Greece, 2009; 1–10.
Google Scholar
37 Xia Y, Prasanna VK, Li J. Hierarchical scheduling of dag structured computations on manycore processors with dynamic thread grouping. In Proceedings of the 15th International Conference on Job Scheduling Strategies for Parallel Processing, JSSPP’10.Springer-Verlag: Berlin, Heidelberg, 2010; 154–174.
10.1007/978-3-642-16505-4_9
Google Scholar
38 Valiant LG. A bridging model for multi-core computing. Journal of Computer and System Sciences 2011 77(1): 154–166.
10.1016/j.jcss.2010.06.012
Web of Science® Google Scholar
39 González-Domínguez J, Taboada GL, Fraguela BB, Martín MJ, Touriño J. Servet: a benchmark suite for autotuning on multicore clusters. IPDPS’10, Atlanta, GA, USA, 2010; 1–9.
Google Scholar
40 Smith AJ, Saavedra RH. Measuring cache and TLB performance and their effect on benchmark runtimes. IEEE Transactions on Computers 1995 44(10): 1223–1235.
10.1109/12.467697
Web of Science® Google Scholar
41 Innovating Computing Laboratory University of Tennessee. Performance application programming interface, 2004. (Available from: http://icl.cs.utk.edu/papi/) [Accessed on 27 March 2014].
Google Scholar
42 Dongarra J, Moore S, Mucci P, Seymour K, You H. Accurate cache and tlb characterization using hardware counters. International Conference on Computational Science Krakow, Poland, 2004; 432–439.
Google Scholar
43 Barreto L, Bauer M. Parallel branch and bound algorithm—a comparison between serial, OpenMP and MPI implementations. Journal of Physics: Conference Series 2010 256(1): 012018.
10.1088/1742-6596/256/1/012018
Google Scholar
44 Djerrah A, Cun BL, Cung VD, Roucairol C. Bob++: framework for solving optimization problems with branch-and-bound methods. HPDC, Paris, France, 2006; 369–370.
Google Scholar
45 Galea F, Cun BL. A parallel exact solver for the three-index quadratic assignment problem. IPDPS Workshops, Anchorage,Alaska, USA, 2011; 1940–1949.
Google Scholar
46 de A Drummond LM, Uchoa E, Gonçalves AD, Silva JMN, Santos MCP, de Castro MCS. A grid-enabled distributed branch-and-bound algorithm with application on the steiner problem in graphs. Parallel Computing 2006 32(9): 629–642.
10.1016/j.parco.2005.09.006
Web of Science® Google Scholar
47 Ralphs TK, Güzelsoy M, Mahajan A. SYMPHONY version 5.3 user's manual. Technical Report, COR@L Laboratory, Lehigh University, 2011. (Available from: http://www.coin-or.org/SYMPHONY/doc/SYMPHONY-5.3.4-Manual.pdf) [Accessed on 27 March 2014].
Google Scholar
48 Sanjuan-Estrada JF, Casado LG, García I. Adaptive parallel interval branch and bound algorithms based on their performance for multicore architectures. The Journal of Supercomputing 2011 58(3): 376–384.
10.1007/s11227-011-0594-4
Web of Science® Google Scholar
49 Park S, Kim T, Park J, Kim J, Im H. Parallel skyline computation on multicore architectures. In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09.IEEE Computer Society: Washington, DC, USA, 2009; 760–771.
10.1109/ICDE.2009.42
Google Scholar
50 Eckstein J, Phillips CA, Hart WE. Pico: an object-oriented framework for parallel branch and bound, 2001.
Google Scholar
51 Shinano Y, Higaki M, Hirabayashi R. A generalized utility for parallel branch and bound algorithms. In Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing, SPDP ’95.IEEE Computer Society: Washington, DC, USA, 1995; 392–401.
10.1109/SPDP.1995.530710
Google Scholar
52 Budiu M, Delling D, Werneck R. DryadOpt: branch-and-bound on distributed data-parallel execution engines. IEEE International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, Alaska, USA, 2011.
Google Scholar
53 Vu TT, Derbel B, Asim A, Bendjoudi A, Melab N. Overlay-centric load balancing: applications to UTS and B&B. CLUSTER, Beijing, China, 2012; 382–390.
Google Scholar
54 Boschetti MA, Mingozzi A, Ricciardelli S. A dual ascent procedure for the set partitioning problem. Discrete Optimization 2008; 5(4): 735–747. DOI: 10.1016/j.disopt.2008.06.001.
10.1016/j.disopt.2008.06.001
Web of Science® Google Scholar
55 DS Hochbaum (ed.). Approximation Algorithms for NP-Hard Problems. PWS Publishing Co.: Boston, MA, USA, 1997.
Google Scholar
56 Bendjoudi A, Melab N, Talbi EG. An adaptive hierarchical master-worker (ahmw) framework for grids—application to b&b algorithms. Journal of Parallel and Distributed Computing 2012 72(2): 120–131.
10.1016/j.jpdc.2011.10.002
Web of Science® Google Scholar
57 Bendjoudi A, Melab N, Talbi EG. Hierarchical branch and bound algorithm for computational grids. Future Generation Computer Systems 2012 28(8): 1168–1176.
10.1016/j.future.2012.03.001
Web of Science® Google Scholar
58 Group TO. The open group base specifications issue 6 @ONLINE, 2004. (Available from: http://pubs.opengroup.org/onlinepubs/009696699) [Accessed on 27 March 2014].
Google Scholar
59 Stevens RW, Rago SA. Advanced programming in the UNIX(R) environment ( 2nd edition). Addison-Wesley Professional, 2005.
Google Scholar
60 Ma KL. Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures. In Proceedings of the IEEE Symposium on Parallel Rendering, PRS ’95.ACM: New York, NY, USA, 1995; 23–30.
10.1145/218327.218333
Web of Science® Google Scholar

Citing Literature

Volume27, Issue5

10 April 2015

Pages 1122-1144

Memory aware load balance strategy on a parallel branch-and-bound application

Abstract

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Memory aware load balance strategy on a parallel branch-and-bound application

Abstract

References

Citing Literature

References

Related

Information