Volume 29, Issue 15 e4120
SPECIAL ISSUE PAPER

Computing on many cores

Bernard Goossens

Corresponding Author

Bernard Goossens

DALI, UPVD, 52 avenue Paul Alduy, 66860, Perpignan, Cedex 9 France

LIRMM, CNRS: UMR 5506 - UM2, 161 rue Ada, 34095, Montpellier, Cedex 5 France

Correspondence

Bernard Goossens, DALI, UPVD, 52 avenue Paul Alduy, 66860 Perpignan Cedex 9 France. LIRMM, CNRS: UMR 5506 - UM2, 161 rue Ada, 34095 Montpellier Cedex 5, France.

Email: [email protected]

Search for more papers by this author
David Parello

David Parello

DALI, UPVD, 52 avenue Paul Alduy, 66860, Perpignan, Cedex 9 France

LIRMM, CNRS: UMR 5506 - UM2, 161 rue Ada, 34095, Montpellier, Cedex 5 France

Search for more papers by this author
Katarzyna Porada

Katarzyna Porada

DALI, UPVD, 52 avenue Paul Alduy, 66860, Perpignan, Cedex 9 France

LIRMM, CNRS: UMR 5506 - UM2, 161 rue Ada, 34095, Montpellier, Cedex 5 France

Search for more papers by this author
Djallal Rahmoune

Djallal Rahmoune

DALI, UPVD, 52 avenue Paul Alduy, 66860, Perpignan, Cedex 9 France

LIRMM, CNRS: UMR 5506 - UM2, 161 rue Ada, 34095, Montpellier, Cedex 5 France

Search for more papers by this author
First published: 28 March 2017
Citations: 1

Summary

This paper presents an alternative method to parallelize programs, better suited to manycore processors than actual operating system–/API-based approaches like OpenMP and MPI. The method relies on a parallelizing hardware and an adapted programming style. It frees and captures the instruction-level parallelism (ILP). A many-core design is presented in which cores are multithreaded and able to fork new threads. The programming style is based on functions. The hardware creates a concurrent thread at each function call. The programming style and the hardware create the conditions to free the ILP, by eliminating the architectural dependences between a call and its continuation after return. We illustrate the method on a sum reduction, a matrix multiplication and a sort. We measure the ILP of the parallel runs and show that it is high enough to feed thousands of cores because it increases with data size. We compare our method to pthread parallelization, showing that (1) our parallel execution is deterministic, (2) our thread management is cheap, (3) our parallelism is implicit, and (4) our method parallelizes functions and loops. Implicit parallelism makes parallel code easy to write and read. Deterministic parallel execution makes parallel code easy to debug.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.