Foreword to the Special Issue of the workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)
1 INTRODUCTION
Rapid advancements in multicore and chip-level multithreading technologies open new challenges and make multicore and manycore systems a part of the computing landscape. From high-end servers to mobile phones, multicores and manycores are steadily entering every single aspect of the information technology.
However, most programmers are trained in sequential programming, yet most existing parallel programming models are prone to errors such as data race and deadlock. Therefore, to fully use multicore and manycore hardware, parallel programming models that allow easy transition of sequential programs to parallel programs with good performance and enable development of error-free codes are urgently needed.
This special issue is intended to collate representative research articles that were presented at the Seventh International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM 2016), held in conjunction with the Twenty First SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016). The general objectives are to provide a discussion forum for people interested in programming environments, models, tools, and applications specifically designed for parallel multicore and manycore hardware environments.
2 THEMES OF THIS SPECIAL ISSUE
This special issue contains research papers addressing the state-of-the-art technologies related to multicore and manycore systems. The set of accepted papers can be organized under the following key themes: Programming Models, Performance Improvements, and Applications.
2.1 Programming models
There are several developments in programming models that allow automated parallelization of code, and eliminate, or at least detect, programming errors such as data race. The paper1 proposes a model where a function calls other functions by using communication channels.1 This completely eliminates passing states with the callee functions, making the results deterministic. As a result, the underlying hardware can automate parallelization of the code by spawning these callee functions as tasks running concurrently with the parent function as hardware cores become available. In this way, this model allows automatic, data race-free parallelization of existing applications that can scale well on manycore hardware.
As the multicore and manycore systems proliferate in the market, it is common to parallelize existing applications with shared memory models, where access of shared variables between threads are managed by synchronization primitives and/or lock-free data mechanisms. However, it is challenging to use these interfaces appropriately. As a result, data race can often happen, which are difficult to detect and reproduce. Race detectors such as Intel Cilkscreen can be used to detect data race, but they often introduce performance penalties, and give false positives if they are unaware of the underlying lock-free structure semantics. To mitigate this issue, the paper2 extends the race detector ThreadSanitizer, with the semantic of 2 lock-free data structures: the Single-Producer/Single-Consumer (SPSC) and the Multiple-Producer/Multiple-Consumer (MPMC) queues. Experimental results demonstrate that these improvements eliminated 60% of the false-positive warning and can accurately detect the wrong use of these data race-free structures.2
To improve programmability over manycore architectures, several high-level programming models are proposed, such as Kokkos, RAJA, OpenACC, and OpenMP 4.0. The paper3 benchmarks these programming models against mature low-level programming models CUDA and OpenCL using applications such as TeaLeaf. In most cases, the performance penalty of these high-level programming models are within 5% to 30%, yet these high-level programming models allow easy porting of existing applications with minimal risk of introducing programming errors. Therefore, if the problem domain of the application is not performance critical, programmability will be the main criterion in selecting programming models.
2.2 Performance improvements
As the number of core increases in a multicore or manycore chip, scalability can be easily compromised by cache misses and bandwidth bottlenecks. The paper4 proposes a software-managed cache coherence system that improves cache coherence for MPI 1-sided communication on a non–cache-coherent manycore CPU.4 This system decreases the communication overhead by a factor of 5, and a corresponding 5-fold performance improvement in benchmarks.
- Small kernels benefit from running kernels concurrently;
- The combination of small, high-priority kernels with longer runtime and lower-priority kernels with shorter runtime benefits from a CPU scheduler that dynamically changes kernel order on the Fermi architecture; and
- Due to limitations of existing GPU architecture, current CPU schedulers outperform their GPU counterparts.
This paper also evaluates the schedulers on NVIDIA Fermi and the Jetson TX1 system-on-a-chip architecture and develops methods to ensure correct scheduler timings on all architectures to further improves performance.
2.3 Applications
Due to advanced developments in manycore architectures, there has been considerable progress in parallelizing existing serial applications, and more importantly, improves their scalability on manycore architectures. For example, the paper6 successfully adapts the BLAS algorithms for heterogeneous multi-GPU, multicore, and multi-MIC architectures.6 Also, the paper7 proposes the JParEnt algorithm, which vastly improves the scalability of existing JPEG decompression algorithms on heterogeneous multicore and manycore architectures.7
In addition, to improve performance and to decrease power consumption, multicore architectures are also increasingly adopted in embedded systems. In the automotive industry, embedded system hardware is increasingly used. The paper8 parallelizes an engine management algorithm and shows substantial performance improvements.8 In the future, it will be increasingly important to have intuitive programming models that have low overheads to parallelize existing embedded system applications.
3 CONCLUSION
The articles presented in this special issue provide insights in fields related to multicore and manycore architecture, including parallel programming models, performance evaluation and improvements, and application developments. We wish that the readers can benefit from the insights of these papers and contribute to these rapidly growing areas.
ACKNOWLEDGMENTS
We would like to thank all of the authors who provided valuable contributions to this special issue. We are also grateful to the Review Committee for the feedback provided to the authors, which are essential in further enhancing the papers. Finally, we would like to express our sincere gratitude to Professor Geoffrey Fox, the Editor in Chief, for providing us with this unique opportunity to present our works in the international journal of Concurrency and Computation: Practice and Experience.