An efficient iterative graph data processing framework based on bulk synchronous parallel model
Chao Liu
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorCorresponding Author
Deze Zeng
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Deze Zeng, School of Computer Science, China University of Geosciences, No.388, Lumo Road, Wuhan 430074, China.
Email: [email protected]
Search for more papers by this authorHong Yao
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorXuesong Yan
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorLinchen Yu
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorZhangjie Fu
School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China
Search for more papers by this authorChao Liu
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorCorresponding Author
Deze Zeng
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Deze Zeng, School of Computer Science, China University of Geosciences, No.388, Lumo Road, Wuhan 430074, China.
Email: [email protected]
Search for more papers by this authorHong Yao
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorXuesong Yan
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorLinchen Yu
School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China
Search for more papers by this authorZhangjie Fu
School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China
Search for more papers by this authorSummary
Graph data processing has been widely applied in a variety of domains such as industry, science, social network, and so on. It therefore has stimulated many efforts devoted to this area. To embrace the fast development trend of big graph data, graph data processing based on Pregel-like systems has been regarded as one of the most promising ways and has widely attracted the attention of researchers. However, it still remains in its early stage and there still exist many challenges. In Pregel, the superstep synchronization is time consuming as the graph data iteration operation requires multiple synchronizations. Furthermore, the graph data partition strategy adopted by Pregel fails to support load balancing, therefore causing the increase of network I/O overhead as the scale of graph data grows. To address these issues, this paper presents an efficient computational framework for graph data processing based on the bulk synchronous parallel model. The global synchronization control mechanism is improved by determining the start time of the next round of superstep through counting the number of global message files. Furthermore, an improved graph data partition mechanism based on a balanced hash method is proposed to reduce the communication overhead between different partitions of sub-graph computational tasks. We also re-design the PageRank algorithm to verify the effectiveness of the proposed framework. Experimental results on different real-world datasets verify the efficiency of our proposed framework as it outperforms Giraph (an open source Pregel-like system) by 58%−69%, and achieves 10×−17× performance improvement over Hadoop.
References
- 1Wang K, Shao Y, Shu L, Zhu C, Zhang Y. Mobile big data fault-tolerant processing for ehealth networks. IEEE Netw. 2016; 30(1): 36-42.
- 2Wang K, Zhuo L, Shao Y, Yue D, Tsang KF. Toward distributed data processing on intelligent leak-points prediction in petrochemical industries. IEEE Trans Ind Informatics. 2016; 12(6): 2091-2102.
- 3Wang K, Yu Y. A query–matching mechanism over out–of–order event stream in IoT. Int J Ad Hoc Ubiquit Comput. 2013; 13(3-4): 197-208.
- 4Li P, Guo S, Miyazaki T, et al. Traffic-aware geo-distributed big data analytics with predictable job completion time. IEEE Trans Parallel Distrib Syst. 2017; 28(6): 1785-1796.
- 5Zhong J, He B. Towards GPU-accelerated large-scale graph processing in the cloud. Paper presented at: IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom); 2013; Bristol, UK.
- 6Nabti C, Seba H. Querying massive graph data: a compress and search approach. Futur Gener Comput Syst. 2017; 74: 63-75.
- 7Bello-Orgaz G, Jung JJ, Camacho D. Social big data: recent achievements and new challenges. Inf Fusion. 2016; 28: 45-59.
- 8Zhou W, Han J, Gao Y, Xu Z. An efficient graph data processing system for large-scale social network service applications. Concurr Comput Pract Exp. 2016; 28(3): 729-747.
- 9Dayarathna M, Suzumura T. High-performance graph data management and mining in cloud environments with x10. In: L Gillam, ed. Cloud Computing: Principles, Systems and Applications, 1st ed. London, UK: Springer; 2017: 173-210.
10.1007/978-3-319-54645-2_7 Google Scholar
- 10Meysman P, Saeys Y, Sabaghian E, et al. Mining the enriched subgraphs for specific vertices in a biological graph. IEEE/ACM Trans Comput Biol Bioinf. 2016: 1-12.
- 11Han M, Daudjee K, Ammar K, Özsu MT, Wang X, Jin T. An experimental comparison of Pregel-like graph processing systems. Proc VLDB Endowment. 2014; 7(12): 1047-1058.
10.14778/2732977.2732980 Google Scholar
- 12Sakr S. Processing large-scale graph data: a guide to current technology. IBM DeveloperWorks. https://www.ibm.com/developerworks/library/os-giraph/
- 13Kambatla K, Kollias G, Kumar V, Grama A. Trends in big data analytics. J Parallel Distrib Comput. 2014; 74(7): 2561-2573.
- 14Zhang Y, Gao Q, Gao L, Wang C. Maiter: an asynchronous graph processing framework for delta-based accumulative iterative computation. IEEE Trans Parallel Distrib Syst. 2014; 25(8): 2091-2100.
- 15Sakr S. Large-scale graph processing systems. Big Data 2.0 Processing Systems: a Survey. Cham, Switzerland: Springer; 2016: 53-73.
10.1007/978-3-319-38776-5_4 Google Scholar
- 16Zhang Y, Gao Q, Gao L, Wang C. iMapReduce: a distributed computing framework for iterative computation. J Grid Comput. 2012; 10(1): 47-68.
- 17Elnikety E, Elsayed T, Ramadan HE. iHadoop: Asynchronous iterations for MapReduce. Paper presented at: IEEE 3rd International Conference on Cloud Computing Technology and Science; 2011; Athens, Greece.
- 18Li F, Ooi BC, Wu S. Distributed data management using MapReduce. ACM Comput Surv. 2014; 46(3): 1-42.
- 19Liu C, Zeng D, Yao H, Hu C, Yan X, Fan Y. MR-COF: a genetic MapReduce configuration optimization framework. Paper presented at: International Conference on Algorithms and Architectures for Parallel Processing; 2015; Zhangjiajie, China.
- 20Malewicz G, Austern MH, Bik AJC, et al. Pregel: a system for large-scale graph processing. Paper presented at: ACM SIGMOD International Conference on Management of Data; 2010; Indianapolis, IN.
- 21Khan A, Elnikety S. Systems for big-graphs. Proc VLDB Endowment. 2014; 7(13): 1709-1710.
10.14778/2733004.2733067 Google Scholar
- 22Rong H, Ma T, Tang M, Cao J. A novel subgraph K+-isomorphism method in social network based on graph similarity detection. Soft Comput. 2017: 1-19.
- 23Ma T, Wang Y, Tang M, et al. LED: a fast overlapping communities detection algorithm based on structural clustering. Neurocomputing. 2016; 207: 488-500.
- 24Ma T, Zhang Y, Cao J, et al. KDVEM: a k-degree anonymity with vertex and edge modification algorithm. Computing. 2015; 97(12): 1165-1184.
- 25Fu Z-J, Shu J-G, Wang J, Liu Y-L, Lee S-Y. Privacy-preserving smart similarity search based on SimHash over encrypted data in cloud computing. J Internet Technol. 2015; 16(3): 453-460.
- 26Fu Z, Huang F, Ren K, Weng J, Wang C. Privacy-preserving smart semantic search based on conceptual graphs over encrypted outsourced data. IEEE Trans Inf Forensics Secur. 2017; 12(8): 1874-1884.
- 27Ekanayake J, Li H, Zhang B, et al. Twister: a runtime for iterative MapReduce. Paper presented at: ACM International Symposium on High-Performance Distributed Computing; 2010; Chicago, IL.
- 28Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endowment. 2010; 3(1-2): 285-296.
10.14778/1920841.1920881 Google Scholar
- 29Siddique K, Akhtar Z, Kim Y, Jeong YS, Yoon EJ. Investigating Apache Hama: a bulk synchronous parallel computing framework. J Supercomput. 2017: 1-16.
- 30Han M, Daudjee K. Giraph unchained: barrierless asynchronous parallel execution in Pregel-like graph processing systems. Proc VLDB Endowment. 2015; 8(9): 950-961.
- 31Salihoglu S, Widom J. GPS: a graph processing system. Paper presented at: International Conference on Scientific and Statistical Database Management; 2013; Baltimore, MD.
- 32Liu C, Yao H, Zeng D, Liang Q, Hu C, Yan X. MyBSP: an iterative processing framework based on the cloud platform for graph data. Paper presented at: 2014 2nd International Conference on Advanced Cloud and Big Data; 2014; Huangshan, China.
- 33Sagharichian M, Naderi H, Haghjoo M. ExPregel: a new computational model for large-scale graph processing. Concurr Comput Pract Exp. 2015; 27(17): 4954-4969.
- 34Khayyat Z, Awara K, Alonazi A, Jamjoom H, Williams D, Kalnis P. Mizan: a system for dynamic load balancing in large-scale graph processing. Paper presented at: Proceedings of the 8th ACM European Conference on Computer Systems; 2013; Prague, Czech Republic.
- 35Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endowment. 2012; 5(8): 716-727.
10.14778/2212351.2212354 Google Scholar
- 36Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I. GraphX: graph processing in a distributed dataflow framework. Paper presented at: USENIX Conference on Operating Systems Design and Implementation; 2014; Broomfield, CO.
- 37Kim M, Candan KS. SBV-Cut: vertex-cut based graph partitioning using structural balance vertices. Data Knowl Eng. 2012; 72(1): 285-303.
- 38Bansal N, Feige U, Krauthgamer R, Makarychev K. Min-max graph partitioning and small set expansion. SIAM J Comput. 2014; 43(2): 872-904.
- 39Chen R, Shi J, Chen Y, Chen H. PowerLyra: differentiated graph computation and partitioning on skewed graphs. Paper presented at: Proceedings of the 10th European Conference on Computer Systems; 2015; Bordeaux, France.
- 40Sun R, Zhang L, Chen Z, Hao Z. A balanced vertex cut partition method in distributed graph computing. Paper presented at: International Conference on Intelligent Science and Big Data Engineering; 2015; Suzhou, China.
- 41Martella C, Logothetis D, Loukas A, Siganos G. Spinner: scalable graph partitioning in the cloud. Paper presented at: IEEE 33rd International Conference on Data Engineering (ICDE); 2017; San Diego, CA.
- 42Verma S, Leslie LM, Shin Y, Gupta I. An experimental comparison of partitioning strategies in distributed graph processing. Proc VLDB Endowment. 2017; 10(5): 493-504.
- 43Buluç A, Meyerhenke H, Safro I, Sanders P, Schulz C. Recent advances in graph partitioning. In: L Kliemann, P Sanders, eds. Algorithm Engineering. Cham, Switzerland: Springer; 2016: 117-158.
10.1007/978-3-319-49487-6_4 Google Scholar
- 44Dong F, Zhang J, Luo J, Shen D, Jin J. Enabling application-aware flexible graph partition mechanism for parallel graph processing systems. Concurr Comput Pract Exp. 2017; 29(6): 1-17.
- 45Brenner S, Wulf C, Kapitza R. Running Zookeeper coordination services in untrusted clouds. Paper presented at: Proceedings of the 10th USENIX conference on Hot Topics in System Dependability; 2014; Broomfield, CO.
- 46Lee B, Jeong YH, Song HY, Lee Y. A scalable and highly available network management architecture on consistent hashing. Paper presented at: Global Telecommunications Conference; 2011; Miami, FL.
- 47Fan C. A brief survey of PageRank algorithms. IEEE Trans Netw Sci Eng. 2014; 1(1): 38-42.
10.1109/TNSE.2014.2380315 Google Scholar
- 48Leskovec J, Krevl A. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. Accessed on December 2016.
- 49Haselgrove H. Using the Wikipedia page-to-page link database. http://haselgrove.id.au/wikipedia/. Accessed on December 2016.
- 50Alan Mislove, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B. Measurement and analysis of online social networks. Paper presented at: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement; 2007; San Diego, CA.