Volume 33, Issue 13 e6206
RESEARCH ARTICLE

Elastic scheduler: Heterogeneous and dynamic deep Learning in the cloud

Lujia Yin

Corresponding Author

Lujia Yin

National University of Defense Technology, Changsha, China

Correspondence

Lujia Yin, National University of Defense Technology, Changsha, Hunan, 410005, China.

Email: [email protected]

Search for more papers by this author
Yiming Zhang

Yiming Zhang

National University of Defense Technology, Changsha, China

Search for more papers by this author
Yuxing Peng

Yuxing Peng

National University of Defense Technology, Changsha, China

Search for more papers by this author
Dongsheng Li

Dongsheng Li

National University of Defense Technology, Changsha, China

Search for more papers by this author
First published: 08 May 2021

Abstract

GPUs and CPUs have been widely used for model training of deep learning (DL) in the cloud, where both DL workloads and resource usage might heavily change over time. Traditional training methods require beforehand specification on the type (either GPUs or CPUs) and amount of computing devices, and thus cannot elastically schedule the dynamic DL workloads onto available GPUs/CPUs. In this paper, we propose Elastic Scheduler (ES), a novel approach that efficiently supports both heterogeneous training (with different device types) and dynamic training (with varying device numbers). ES (i) accumulates local gradients and simulates multiple virtual workers on one GPU to alleviate the performance gap between GPUs and CPUs for achieving similar accuracy in heterogeneous GPU-CPU-hybrid training as in homogeneous training and (ii) uses local gradients stabilizes batch sizes for high accuracy without long compensation. Experiments show that ES achieves significantly higher performance than existing methods for heterogeneous and dynamic training as well as inference.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the corresponding author upon reasonable request.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.