Rapid advancements in multicore and chip-level multithreading technologies open new challenges and make multicore and manycore systems a part of the computing landscape. From high-end servers to mobile phones, multicores and manycores are steadily entering every single aspect of the information technology.

However, most programmers are trained in sequential programming, yet most existing parallel programming models are prone to errors such as data race and deadlock. Therefore, to fully use multicore and manycore hardware, parallel programming models that allow easy transition of sequential programs to parallel programs with good performance and enable development of error-free codes are urgently needed.

This special issue is intended to collate representative research articles that were presented at the Seventh International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM 2016), held in conjunction with the Twenty First SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016). The general objectives are to provide a discussion forum for people interested in programming environments, models, tools, and applications specifically designed for parallel multicore and manycore hardware environments.

2 THEMES OF THIS SPECIAL ISSUE

This special issue contains research papers addressing the state-of-the-art technologies related to multicore and manycore systems. The set of accepted papers can be organized under the following key themes: Programming Models, Performance Improvements, and Applications.

2.1 Programming models

There are several developments in programming models that allow automated parallelization of code, and eliminate, or at least detect, programming errors such as data race. The paper1 proposes a model where a function calls other functions by using communication channels.1 This completely eliminates passing states with the callee functions, making the results deterministic. As a result, the underlying hardware can automate parallelization of the code by spawning these callee functions as tasks running concurrently with the parent function as hardware cores become available. In this way, this model allows automatic, data race-free parallelization of existing applications that can scale well on manycore hardware.

As the multicore and manycore systems proliferate in the market, it is common to parallelize existing applications with shared memory models, where access of shared variables between threads are managed by synchronization primitives and/or lock-free data mechanisms. However, it is challenging to use these interfaces appropriately. As a result, data race can often happen, which are difficult to detect and reproduce. Race detectors such as Intel Cilkscreen can be used to detect data race, but they often introduce performance penalties, and give false positives if they are unaware of the underlying lock-free structure semantics. To mitigate this issue, the paper2 extends the race detector ThreadSanitizer, with the semantic of 2 lock-free data structures: the Single-Producer/Single-Consumer (SPSC) and the Multiple-Producer/Multiple-Consumer (MPMC) queues. Experimental results demonstrate that these improvements eliminated 60% of the false-positive warning and can accurately detect the wrong use of these data race-free structures.2

To improve programmability over manycore architectures, several high-level programming models are proposed, such as Kokkos, RAJA, OpenACC, and OpenMP 4.0. The paper3 benchmarks these programming models against mature low-level programming models CUDA and OpenCL using applications such as TeaLeaf. In most cases, the performance penalty of these high-level programming models are within 5% to 30%, yet these high-level programming models allow easy porting of existing applications with minimal risk of introducing programming errors. Therefore, if the problem domain of the application is not performance critical, programmability will be the main criterion in selecting programming models.

2.2 Performance improvements

As the number of core increases in a multicore or manycore chip, scalability can be easily compromised by cache misses and bandwidth bottlenecks. The paper4 proposes a software-managed cache coherence system that improves cache coherence for MPI 1-sided communication on a non–cache-coherent manycore CPU.4 This system decreases the communication overhead by a factor of 5, and a corresponding 5-fold performance improvement in benchmarks.

On the other hand, the paper5 examines task scheduler algorithms for GPUs. In this study, this paper concludes that5

Small kernels benefit from running kernels concurrently;
The combination of small, high-priority kernels with longer runtime and lower-priority kernels with shorter runtime benefits from a CPU scheduler that dynamically changes kernel order on the Fermi architecture; and
Due to limitations of existing GPU architecture, current CPU schedulers outperform their GPU counterparts.

This paper also evaluates the schedulers on NVIDIA Fermi and the Jetson TX1 system-on-a-chip architecture and develops methods to ensure correct scheduler timings on all architectures to further improves performance.

2.3 Applications

Due to advanced developments in manycore architectures, there has been considerable progress in parallelizing existing serial applications, and more importantly, improves their scalability on manycore architectures. For example, the paper6 successfully adapts the BLAS algorithms for heterogeneous multi-GPU, multicore, and multi-MIC architectures.6 Also, the paper7 proposes the JParEnt algorithm, which vastly improves the scalability of existing JPEG decompression algorithms on heterogeneous multicore and manycore architectures.7

In addition, to improve performance and to decrease power consumption, multicore architectures are also increasingly adopted in embedded systems. In the automotive industry, embedded system hardware is increasingly used. The paper8 parallelizes an engine management algorithm and shows substantial performance improvements.8 In the future, it will be increasingly important to have intuitive programming models that have low overheads to parallelize existing embedded system applications.

3 CONCLUSION

The articles presented in this special issue provide insights in fields related to multicore and manycore architecture, including parallel programming models, performance evaluation and improvements, and application developments. We wish that the readers can benefit from the insights of these papers and contribute to these rapidly growing areas.

ACKNOWLEDGMENTS

We would like to thank all of the authors who provided valuable contributions to this special issue. We are also grateful to the Review Committee for the feedback provided to the authors, which are essential in further enhancing the papers. Finally, we would like to express our sincere gratitude to Professor Geoffrey Fox, the Editor in Chief, for providing us with this unique opportunity to present our works in the international journal of Concurrency and Computation: Practice and Experience.

REFERENCES

1Goossens B, Parello D, Porada K, Rahmoune D. Computing on many cores. Concurrency Computat: Pract Exper. 2017; 29:e4120. https://doi.org/10.1002/cpe.4120
10.1002/cpe.4120
Web of Science® Google Scholar
2Dolz MF, del Rio Astorga D, Fernández J, et al. Enabling semantics to improve detection of data races and misuses of lock-free data structures. Concurrency Computat: Pract Exper. 2017; 29:e4114. https://doi.org/10.1002/cpe.4114
10.1002/cpe.4114
Web of Science® Google Scholar
3Martineau M, McIntosh-Smith S, Gaudin W. Assessing the performance portability of modern parallel programming models using tealeaf. Concurrency Computat: Pract Exper. 2017; 29:e4117. https://doi.org/10.1002/cpe.4117
10.1002/cpe.4117
Web of Science® Google Scholar
4Christgau S, Schnor B. Exploring one-sided communication and synchronization on a non-cache-coherent many-core architecture. Concurrency Computat: Pract Exper. 2017; 29:e4113. https://doi.org/10.1002/cpe.4113
10.1002/cpe.4113
Web of Science® Google Scholar
5Muyan-Ozcelik P, Owens JD. Methods for multitasking among real-time embedded compute tasks running on the GPU. Concurrency Computat: Pract Exper. 2017; 29:e4118. https://doi.org/10.1002/cpe.4118
10.1002/cpe.4118
Web of Science® Google Scholar
6Cuenca J, García LP, Giménez D, Herrera FJ. Guided installation of basic linear algebra routines in a cluster with manycore components. Concurrency Computat: Pract Exper. 2017; 29:e4112. https://doi.org/10.1002/cpe.4112
10.1002/cpe.4112
Web of Science® Google Scholar
7Sodsong W, Jung M, Park J, Burgstaller B. JParEnt: Parallel entropy decoding for JPEG decompression on heterogeneous multicore architectures. Concurrency Computat: Pract Exper. 2017; 29:e4111. https://doi.org/10.1002/cpe.4111
10.1002/cpe.4111
Web of Science® Google Scholar
8Kienberger J, Schmidhuber S, Saad C, Kuntz S, Bauer B. Parallelizing highly complex engine management systems. Concurrency Computat: Pract Exper. 2017; 29:e4115. https://doi.org/10.1002/cpe.4115
10.1002/cpe.4115
Web of Science® Google Scholar

Volume29, Issue15

Combined Special Issues on Euro‐Par 2016 and the Workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

10 August 2017

e4158

Foreword to the Special Issue of the workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

1 INTRODUCTION

2 THEMES OF THIS SPECIAL ISSUE

2.1 Programming models

2.2 Performance improvements

2.3 Applications

3 CONCLUSION

ACKNOWLEDGMENTS

REFERENCES

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Foreword to the Special Issue of the workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

1 INTRODUCTION

2 THEMES OF THIS SPECIAL ISSUE

2.1 Programming models

2.2 Performance improvements

2.3 Applications

3 CONCLUSION

ACKNOWLEDGMENTS

REFERENCES

References

Related

Information