Concurrency and Computation: Practice and Experience

SPECIAL ISSUE PAPER

An efficient iterative graph data processing framework based on bulk synchronous parallel model

Chao Liu

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Deze Zeng,

Corresponding Author

Deze Zeng

[email protected]

orcid.org/0000-0003-3276-1202

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Deze Zeng, School of Computer Science, China University of Geosciences, No.388, Lumo Road, Wuhan 430074, China.

Email: [email protected]

Search for more papers by this author

Hong Yao,

Hong Yao

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Xuesong Yan,

Xuesong Yan

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Linchen Yu,

Linchen Yu

orcid.org/0000-0002-0620-101X

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Zhangjie Fu,

Zhangjie Fu

School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China

Search for more papers by this author

Chao Liu,

Chao Liu

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Deze Zeng,

Corresponding Author

Deze Zeng

[email protected]

orcid.org/0000-0003-3276-1202

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Deze Zeng, School of Computer Science, China University of Geosciences, No.388, Lumo Road, Wuhan 430074, China.

Email: [email protected]

Search for more papers by this author

Hong Yao,

Hong Yao

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Xuesong Yan,

Xuesong Yan

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Linchen Yu,

Linchen Yu

orcid.org/0000-0002-0620-101X

School of Computer Science, Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China

Search for more papers by this author

Zhangjie Fu,

Zhangjie Fu

School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China

Search for more papers by this author

First published: 22 January 2018

https://doi.org/10.1002/cpe.4432

Citations: 5

Share a link

Email
Wechat
Bluesky

Summary

Graph data processing has been widely applied in a variety of domains such as industry, science, social network, and so on. It therefore has stimulated many efforts devoted to this area. To embrace the fast development trend of big graph data, graph data processing based on Pregel-like systems has been regarded as one of the most promising ways and has widely attracted the attention of researchers. However, it still remains in its early stage and there still exist many challenges. In Pregel, the superstep synchronization is time consuming as the graph data iteration operation requires multiple synchronizations. Furthermore, the graph data partition strategy adopted by Pregel fails to support load balancing, therefore causing the increase of network I/O overhead as the scale of graph data grows. To address these issues, this paper presents an efficient computational framework for graph data processing based on the bulk synchronous parallel model. The global synchronization control mechanism is improved by determining the start time of the next round of superstep through counting the number of global message files. Furthermore, an improved graph data partition mechanism based on a balanced hash method is proposed to reduce the communication overhead between different partitions of sub-graph computational tasks. We also re-design the PageRank algorithm to verify the effectiveness of the proposed framework. Experimental results on different real-world datasets verify the efficiency of our proposed framework as it outperforms Giraph (an open source Pregel-like system) by 58%−69%, and achieves 10×−17× performance improvement over Hadoop.

References

1Wang K, Shao Y, Shu L, Zhu C, Zhang Y. Mobile big data fault-tolerant processing for ehealth networks. IEEE Netw. 2016; 30(1): 36-42.
10.1109/MNET.2016.7389829
Web of Science® Google Scholar
2Wang K, Zhuo L, Shao Y, Yue D, Tsang KF. Toward distributed data processing on intelligent leak-points prediction in petrochemical industries. IEEE Trans Ind Informatics. 2016; 12(6): 2091-2102.
10.1109/TII.2016.2537788
Web of Science® Google Scholar
3Wang K, Yu Y. A query–matching mechanism over out–of–order event stream in IoT. Int J Ad Hoc Ubiquit Comput. 2013; 13(3-4): 197-208.
10.1504/IJAHUC.2013.055453
CAS Web of Science® Google Scholar
4Li P, Guo S, Miyazaki T, et al. Traffic-aware geo-distributed big data analytics with predictable job completion time. IEEE Trans Parallel Distrib Syst. 2017; 28(6): 1785-1796.
10.1109/TPDS.2016.2626285
Web of Science® Google Scholar
5Zhong J, He B. Towards GPU-accelerated large-scale graph processing in the cloud. Paper presented at: IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom); 2013; Bristol, UK.
Google Scholar
6Nabti C, Seba H. Querying massive graph data: a compress and search approach. Futur Gener Comput Syst. 2017; 74: 63-75.
10.1016/j.future.2017.04.005
Web of Science® Google Scholar
7Bello-Orgaz G, Jung JJ, Camacho D. Social big data: recent achievements and new challenges. Inf Fusion. 2016; 28: 45-59.
10.1016/j.inffus.2015.08.005
PubMed Web of Science® Google Scholar
8Zhou W, Han J, Gao Y, Xu Z. An efficient graph data processing system for large-scale social network service applications. Concurr Comput Pract Exp. 2016; 28(3): 729-747.
10.1002/cpe.3393
Web of Science® Google Scholar
9Dayarathna M, Suzumura T. High-performance graph data management and mining in cloud environments with x10. In: L Gillam, ed. Cloud Computing: Principles, Systems and Applications, 1st ed. London, UK: Springer; 2017: 173-210.
10.1007/978-3-319-54645-2_7
Google Scholar
10Meysman P, Saeys Y, Sabaghian E, et al. Mining the enriched subgraphs for specific vertices in a biological graph. IEEE/ACM Trans Comput Biol Bioinf. 2016: 1-12.
Google Scholar
11Han M, Daudjee K, Ammar K, Özsu MT, Wang X, Jin T. An experimental comparison of Pregel-like graph processing systems. Proc VLDB Endowment. 2014; 7(12): 1047-1058.
10.14778/2732977.2732980
Google Scholar
12Sakr S. Processing large-scale graph data: a guide to current technology. IBM DeveloperWorks. https://www.ibm.com/developerworks/library/os-giraph/
Google Scholar
13Kambatla K, Kollias G, Kumar V, Grama A. Trends in big data analytics. J Parallel Distrib Comput. 2014; 74(7): 2561-2573.
10.1016/j.jpdc.2014.01.003
Web of Science® Google Scholar
14Zhang Y, Gao Q, Gao L, Wang C. Maiter: an asynchronous graph processing framework for delta-based accumulative iterative computation. IEEE Trans Parallel Distrib Syst. 2014; 25(8): 2091-2100.
10.1109/TPDS.2013.235
Web of Science® Google Scholar
15Sakr S. Large-scale graph processing systems. Big Data 2.0 Processing Systems: a Survey. Cham, Switzerland: Springer; 2016: 53-73.
10.1007/978-3-319-38776-5_4
Google Scholar
16Zhang Y, Gao Q, Gao L, Wang C. iMapReduce: a distributed computing framework for iterative computation. J Grid Comput. 2012; 10(1): 47-68.
10.1007/s10723-012-9204-9
Web of Science® Google Scholar
17Elnikety E, Elsayed T, Ramadan HE. iHadoop: Asynchronous iterations for MapReduce. Paper presented at: IEEE 3rd International Conference on Cloud Computing Technology and Science; 2011; Athens, Greece.
Google Scholar
18Li F, Ooi BC, Wu S. Distributed data management using MapReduce. ACM Comput Surv. 2014; 46(3): 1-42.
10.1145/2512962
Web of Science® Google Scholar
19Liu C, Zeng D, Yao H, Hu C, Yan X, Fan Y. MR-COF: a genetic MapReduce configuration optimization framework. Paper presented at: International Conference on Algorithms and Architectures for Parallel Processing; 2015; Zhangjiajie, China.
Google Scholar
20Malewicz G, Austern MH, Bik AJC, et al. Pregel: a system for large-scale graph processing. Paper presented at: ACM SIGMOD International Conference on Management of Data; 2010; Indianapolis, IN.
Google Scholar
21Khan A, Elnikety S. Systems for big-graphs. Proc VLDB Endowment. 2014; 7(13): 1709-1710.
10.14778/2733004.2733067
Google Scholar
22Rong H, Ma T, Tang M, Cao J. A novel subgraph K⁺-isomorphism method in social network based on graph similarity detection. Soft Comput. 2017: 1-19.
Web of Science® Google Scholar
23Ma T, Wang Y, Tang M, et al. LED: a fast overlapping communities detection algorithm based on structural clustering. Neurocomputing. 2016; 207: 488-500.
10.1016/j.neucom.2016.05.020
Web of Science® Google Scholar
24Ma T, Zhang Y, Cao J, et al. KDVEM: a k-degree anonymity with vertex and edge modification algorithm. Computing. 2015; 97(12): 1165-1184.
10.1007/s00607-015-0453-x
Web of Science® Google Scholar
25Fu Z-J, Shu J-G, Wang J, Liu Y-L, Lee S-Y. Privacy-preserving smart similarity search based on SimHash over encrypted data in cloud computing. J Internet Technol. 2015; 16(3): 453-460.
Web of Science® Google Scholar
26Fu Z, Huang F, Ren K, Weng J, Wang C. Privacy-preserving smart semantic search based on conceptual graphs over encrypted outsourced data. IEEE Trans Inf Forensics Secur. 2017; 12(8): 1874-1884.
10.1109/TIFS.2017.2692728
Web of Science® Google Scholar
27Ekanayake J, Li H, Zhang B, et al. Twister: a runtime for iterative MapReduce. Paper presented at: ACM International Symposium on High-Performance Distributed Computing; 2010; Chicago, IL.
Google Scholar
28Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endowment. 2010; 3(1-2): 285-296.
10.14778/1920841.1920881
Google Scholar
29Siddique K, Akhtar Z, Kim Y, Jeong YS, Yoon EJ. Investigating Apache Hama: a bulk synchronous parallel computing framework. J Supercomput. 2017: 1-16.
Web of Science® Google Scholar
30Han M, Daudjee K. Giraph unchained: barrierless asynchronous parallel execution in Pregel-like graph processing systems. Proc VLDB Endowment. 2015; 8(9): 950-961.
10.14778/2777598.2777604
Web of Science® Google Scholar
31Salihoglu S, Widom J. GPS: a graph processing system. Paper presented at: International Conference on Scientific and Statistical Database Management; 2013; Baltimore, MD.
Google Scholar
32Liu C, Yao H, Zeng D, Liang Q, Hu C, Yan X. MyBSP: an iterative processing framework based on the cloud platform for graph data. Paper presented at: 2014 2nd International Conference on Advanced Cloud and Big Data; 2014; Huangshan, China.
Google Scholar
33Sagharichian M, Naderi H, Haghjoo M. ExPregel: a new computational model for large-scale graph processing. Concurr Comput Pract Exp. 2015; 27(17): 4954-4969.
10.1002/cpe.3482
Web of Science® Google Scholar
34Khayyat Z, Awara K, Alonazi A, Jamjoom H, Williams D, Kalnis P. Mizan: a system for dynamic load balancing in large-scale graph processing. Paper presented at: Proceedings of the 8th ACM European Conference on Computer Systems; 2013; Prague, Czech Republic.
Google Scholar
35Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endowment. 2012; 5(8): 716-727.
10.14778/2212351.2212354
Google Scholar
36Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I. GraphX: graph processing in a distributed dataflow framework. Paper presented at: USENIX Conference on Operating Systems Design and Implementation; 2014; Broomfield, CO.
Google Scholar
37Kim M, Candan KS. SBV-Cut: vertex-cut based graph partitioning using structural balance vertices. Data Knowl Eng. 2012; 72(1): 285-303.
10.1016/j.datak.2011.11.004
Web of Science® Google Scholar
38Bansal N, Feige U, Krauthgamer R, Makarychev K. Min-max graph partitioning and small set expansion. SIAM J Comput. 2014; 43(2): 872-904.
10.1137/120873996
Web of Science® Google Scholar
39Chen R, Shi J, Chen Y, Chen H. PowerLyra: differentiated graph computation and partitioning on skewed graphs. Paper presented at: Proceedings of the 10th European Conference on Computer Systems; 2015; Bordeaux, France.
Google Scholar
40Sun R, Zhang L, Chen Z, Hao Z. A balanced vertex cut partition method in distributed graph computing. Paper presented at: International Conference on Intelligent Science and Big Data Engineering; 2015; Suzhou, China.
Google Scholar
41Martella C, Logothetis D, Loukas A, Siganos G. Spinner: scalable graph partitioning in the cloud. Paper presented at: IEEE 33rd International Conference on Data Engineering (ICDE); 2017; San Diego, CA.
Google Scholar
42Verma S, Leslie LM, Shin Y, Gupta I. An experimental comparison of partitioning strategies in distributed graph processing. Proc VLDB Endowment. 2017; 10(5): 493-504.
10.14778/3055540.3055543
Web of Science® Google Scholar
43Buluç A, Meyerhenke H, Safro I, Sanders P, Schulz C. Recent advances in graph partitioning. In: L Kliemann, P Sanders, eds. Algorithm Engineering. Cham, Switzerland: Springer; 2016: 117-158.
10.1007/978-3-319-49487-6_4
Google Scholar
44Dong F, Zhang J, Luo J, Shen D, Jin J. Enabling application-aware flexible graph partition mechanism for parallel graph processing systems. Concurr Comput Pract Exp. 2017; 29(6): 1-17.
10.1002/cpe.3849
Web of Science® Google Scholar
45Brenner S, Wulf C, Kapitza R. Running Zookeeper coordination services in untrusted clouds. Paper presented at: Proceedings of the 10th USENIX conference on Hot Topics in System Dependability; 2014; Broomfield, CO.
Google Scholar
46Lee B, Jeong YH, Song HY, Lee Y. A scalable and highly available network management architecture on consistent hashing. Paper presented at: Global Telecommunications Conference; 2011; Miami, FL.
Google Scholar
47Fan C. A brief survey of PageRank algorithms. IEEE Trans Netw Sci Eng. 2014; 1(1): 38-42.
10.1109/TNSE.2014.2380315
Google Scholar
48Leskovec J, Krevl A. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. Accessed on December 2016.
Google Scholar
49Haselgrove H. Using the Wikipedia page-to-page link database. http://haselgrove.id.au/wikipedia/. Accessed on December 2016.
Google Scholar
50Alan Mislove, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B. Measurement and analysis of online social networks. Paper presented at: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement; 2007; San Diego, CA.
Google Scholar

Citing Literature

Volume32, Issue3

Special Issue on Graph Computing (GRAPH 2017) and Special Issue of the Workshop on Exascale MPI (EXAMPI2017)

10 February 2020

e4432

An efficient iterative graph data processing framework based on bulk synchronous parallel model

Summary

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

An efficient iterative graph data processing framework based on bulk synchronous parallel model

Summary

References

Citing Literature

References

Related

Information