Cloud computing in a distributed e-infrastructure using the web processing service standard
Corresponding Author
Gianpaolo Coro
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Correspondence
Gianpaolo Coro, Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (ISTI)— CNR Via G. Moruzzi, 1—56124, Pisa, Italy,
Email: [email protected]
Search for more papers by this authorGiancarlo Panichi
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Search for more papers by this authorPaolo Scarponi
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Search for more papers by this authorPasquale Pagano
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Search for more papers by this authorCorresponding Author
Gianpaolo Coro
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Correspondence
Gianpaolo Coro, Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (ISTI)— CNR Via G. Moruzzi, 1—56124, Pisa, Italy,
Email: [email protected]
Search for more papers by this authorGiancarlo Panichi
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Search for more papers by this authorPaolo Scarponi
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Search for more papers by this authorPasquale Pagano
Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo”—CNR, Pisa, Italy
Search for more papers by this authorSummary
New Science paradigms have recently evolved to promote open publication of scientific findings as well as multi-disciplinary collaborative approaches to scientific experimentation. These approaches can face modern scientific challenges but must deal with large quantities of data produced by industrial and scientific experiments. These data, so-called Big Data, require to introduce new computer science systems to help scientists cooperate, extract information, and possibly produce new knowledge out of the data. E-infrastructures are distributed computer systems that foster collaboration between users and can embed distributed and parallel processing systems to manage big data. However, in order to meet modern Science requirements, e-Infrastructures impose several requirements to computational systems in turn, eg, being economically sustainable, managing community-provided processes, using standard representations for processes and data, managing big data size and heterogeneous representations, supporting reproducible Science, collaborative experimentation, and cooperative online environments, managing security and privacy for data and services. In this paper, we present a cloud computing system (gCube DataMiner) that meets these requirements and operates in an e-Infrastructure, while sharing characteristics with state-of-the-art cloud computing systems. To this aim, DataMiner uses the web processing service standard of the open geospatial consortium and introduces features like collaborative experimental spaces, automatic installation of processes and services on top of a flexible and sustainable cloud computing architecture. We compare DataMiner with another mature cloud computing system and highlight the benefits our system brings, the new paradigms requirements it satisfies, and the applications that can be developed based on this system.
REFERENCES
- 1Hey T, Tansley S, Tolle KM, et al. The Fourth Paradigm: Data-Intensive Scientific Discovery, Vol. 1. WA: Microsoft Research Redmond; 2009.
- 2Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Inst. 2011.
- 3 EU Commission. Open science (open access). https://ec.europa.eu/programmes/horizon2020/en/h2020-section/open-science-open-access; 2016.
- 4Andronico G, Ardizzone V, Barbera R, et al. e-Infrastructures for e-Science: A global view. J Grid Comput. 2011; 9(2): 155-184.
- 5Waldrop MM. Science 2.0. Sci Am. 2008; 298(5): 68-73.
- 6Attiya H, Welch J. Distributed Computing: Fundamentals, Simulations, and Advanced Topics, Vol. 19. New Jersey:John Wiley & Sons, Inc.; 2004.
10.1002/0471478210 Google Scholar
- 7Yang LT, Guo M. High-Performance Computing: Paradigm and Infrastructure, Vol. 44. New Jersey:John Wiley & Sons, Inc.; 2005.
10.1002/0471732710 Google Scholar
- 8Pike RC, Thompson KL. Distributed computing system. US Patent 5,623,666. Accessed April 22, 1997.
- 9Hunter AA, Macgregor AB, Szabo TO, Wellington CA, Bellgard MI. Yabi: An online research environment for grid, high performance and cloud computing. Source Code Biol Med. 2012; 7(1): 1.
- 10Wang L, Tao J, Ranjan R, et al. G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Futur Gener Comput Syst. 2013; 29(3): 739-750.
- 11Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster computing with working sets. HotCloud. 2010; 10: 10-10.
- 12National Research Council of Italy. The open-source gCube system to build distributed e-Infrastructures. www.gcube-system.org; 2016.
- 13National Research Council of Italy. The gCube DataMiner usage statistics. https://gcube.wiki.gcube-system.org/gcube/Data_Mining_Facilities#Current_usage_statistics; 2017.
- 14National Research Council of Italy. The D4Science distributed e-Infrastructure. http://www.d4science.org; 2016.
- 15Schut P, Whiteside A. OpenGIS web processing service. OGC project document. http://www.opengeospatial.org/standards/wps; 2007.
- 16Lebo T, Sahoo S, McGuinness D, et al. Prov-o: the prov ontology. W3C Recommendation. 2013; 30.
- 17European Grid Infrastructure. The European grid infrastructure federated cloud. www.egi.eu/federation/; 2016.
- 18Candela L, Castelli D, Coro G, Pagano P, Sinibaldi F. Species distribution modeling in the cloud. Concurrency Computat Pract Exper. 2016; 28(4): 1056-1079.
- 19Coro G, Candela L, Pagano P, Italiano A, Liccardo L. Parallelizing the execution of native data mining algorithms for computational biology. Concurrency Computat Pract Exper. 2015; 27(17): 4630-4644.
- 20Kaisler S, Armour F, Espinosa JA, Money W. Big data: Issues and challenges moving forward. Paper presented at: 2013 46th Hawaii International Conference on System Sciences (HICSS). Waikoloa, HI:IEEE; 2013.
- 21Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU. The rise of big data on cloud computing: review and open research issues. Inf Syst. 2015; 47: 98-115. https://doi.org/10.1016/j.is.2014.07.006. http://www.sciencedirect.com/science/article/pii/S0306437914001288
- 22Chen M, Mao S, Liu Y. Big data: a survey. Mob Netw Appl. 2014; 19(2): 171-209.
- 23Fernández A, del Río S, López V, et al. Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdiscip Rev: Data Min Knowl Disc. 2014; 4(5): 380-409.
- 24Chen H, Chiang RH, Storey VC. Business intelligence and analytics: From big data to big impact. MIS Q. 2012; 36(4): 1165-1188.
- 25Wu X, Zhu X, Wu GQ, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014; 26(1): 97-107.
- 26Russom P et al. Big data analytics. TDWI best practices report, fourth quarter. 2011: 1-35.
- 27McAfee A, Brynjolfsson E, Davenport TH, Patil DJ, Barton D. Big data. The management revolution. Harvard Bus Rev. 2012; 90(10): 61-67.
- 28Ranjan R. Streaming big data processing in datacenter clouds. IEEE Cloud Comput. 2014; 1(1): 78-83.
10.1109/MCC.2014.22 Google Scholar
- 29Tsuchiya S, Sakamoto Y, Tsuchimoto Y, Lee V. Big data processing in cloud environments. Fujitsu Sci Tech J. 2012; 48(2): 159-168.
- 30Demchenko Y, Grosso P, De Laat C, Membrey P. Addressing big data issues in scientific data infrastructure. Paper presented at: 2013 International Conference on Collaboration Technologies and Systems (CTS). San Diego, CA, USA:IEEE; 2013.
- 31Chaudhuri S. What next?: A half-dozen data management research goals for big data and the cloud. Paper presented at: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. Scottsdale, AZ, USA:ACM; May 20-24, 2012.
- 32Agrawal D, Das S, El Abbadi A. Big data and cloud computing: Current state and future opportunities. Paper presented at: Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11. ACM; 2011; New York, NY, USA. https://doi.org/10.1145/1951365.1951432. https://doi.acm.org/10.1145/1951365.1951432
- 33Chen Y, Alspaugh S, Katz R. Interactive analytical processing in big data systems: A cross-industry study of MapReduce workloads. Proc VLDB Endow. 2012; 5(12): 1802-1813. https://doi.org/10.14778/2367502.2367519
10.14778/2367502.2367519 Google Scholar
- 34Santana F, Fonseca R, Saraiva A, Corrêa P, Bravo C, Giovanni R. OpenModeller-an open framework for ecological niche modeling: analysis and future improvements. Paper presented at: World Conference on Computers in Agriculture and Natural Resources. Orlando, Florida, USA; 2006.
- 35Stockwell DR, Beach JH, Stewart A, Vorontsov G, Vieglais D, Pereira RS. The use of the GARP genetic algorithm and Internet grid computing in the Lifemapper world atlas of species biodiversity. Ecol Model. 2006; 195(1): 139-145.
- 36Jetz W, McPherson JM, Guralnick RP. Integrating biodiversity distribution knowledge: Toward a global map of life. Trends Ecol Evol. 2012; 27(3): 151-159.
- 37Yang C, Goodchild M, Huang Q, et al. Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing? Int J Digit Earth. 2011; 4(4): 305-329.
- 38Nativi S, Mazzetti P, Saarenmaa H, Kerr J, Tuama ÉÓ. Biodiversity and climate change use scenarios framework for the GEOSS interoperability pilot process. Ecological Informatics. 2009; 4(1): 23-33.
- 39Coro G, Pagano P, Ellenbroek A. Comparing heterogeneous distribution maps for marine species. GIScience Remote Sens. 2014; 51(5): 593-611.
- 40Coro G, Magliozzi C, Ellenbroek A, Kaschner K, Pagano P. Automatic classification of climate change effects on marine species distributions in 2050 using the AquaMaps model. Environ Ecol Stat. 2016; 23(1): 155-180.
- 41Coro G, Large S, Magliozzi C, Pagano P. Analysing and forecasting fisheries time series: Purse seine in Indian Ocean as a case study. ICES Journal of Marine Science: Journal du Conseil. 2016: fsw131.
- 42Yang C, Huang Q, Li Z, Liu K, Hu F. Big data and cloud computing: innovation opportunities and challenges. Int J Digit Earth. 2017; 10(1): 13-53.
- 43Yang C, Yu M, Hu F, Jiang Y, Li Y. Utilizing cloud computing to address big geospatial data challenges. Comput Environ Urban Syst. 2017; 61: 120-128.
- 44Thain D, Tannenbaum T, Livny M. Distributed computing in practice: The Condor experience. Concurrency Computat Pract Exper. 2005; 17(2-4): 323-356.
- 45Laure E, Edlund A, Pacini F, et al. Programming the grid with gLite: CERN; 2006.
- 46Foster I, Kesselman C. Globus: a metacomputing infrastructure toolkit. Int J High Perform Comput Appl. 1997; 11(2): 115-128.
- 47Mell P, Grance T. The NIST definition of cloud computing. National Inst Stand Technol. 2009; 53(6): 50.
- 48Apache Software Fundation. Apache Hadoop. http://hadoop.apache.org; 2011.
- 49Worldprogramming. WPS configuration for Hadoop. https://www.worldprogramming.com/docs/wps/documentation/3.2/07%20interop_for_hadoop/WPS-Configuration-for-Hadoop.pdf/WPS-Configuration-for-Hadoop-en.pdf; 2016.
- 50Amazon Inc. Amazon elastic compute cloud (Amazon EC2). Amazon Elastic Compute Cloud (Amazon EC2); 2010.
- 51Google Inc. Google Cloud Platform. cloud.google.com; 2016.
- 52Microsoft Inc. Microsoft Azure Platform. azure.microsoft.com; 2016.
- 53Hull D, Wolstencroft K, Stevent R, et al. Taverna: A tool for building and running workflows of services. Nucleic Acids Res. 2006; 1(34): 729-732.
- 54Giardine B, Riemer C, Hardison RC, et al. Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 2005; 10(15): 1451-1455.
- 55Berthold MR, Cebron N, Dill F, et al. KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD explorations Newsletter. 2009; 11(1): 26-31.
10.1145/1656274.1656280 Google Scholar
- 5652North. The 52North WPS Service. http://52north.org/communities/geoprocessing/wps/; 2016.
- 57Zoo-Project. Zoo open WPS platform. www.zoo-project.org; 2016.
- 58Steiniger S, Hunter AJ. Free and open source GIS software for building a spatial data infrastructure. Geospatial Free and Open Source Software in the 21st Century. Proceedings of the first Open Source Geospatial Research Symposium, OGRS 2009. Part 5. 2012: 247-261. https://doi.org/10.1007/978-3-642-10595-1_15
- 59Degree. Open source software for spatial data infrastructures. www.deegree.org; 2016.
- 60 QGIS. A free and open source geographic information system. http://qgis.org/en/site/; 2016.
- 61Pollock N, Williams R. E-infrastructures: How do we know and understand them? Strategic ethnography and the biography of artefacts. Comput Supported Coop Work (CSCW). 2010; 19(6): 521-556.
- 62Chapman A, Russell R. JISC shared infrastructure services synthesis study: A review of the shared infrastructure for the JISC Information Environment. http://opus.bath.ac.uk/17890/; 2006.
- 63Hey T, Trefethen AE. Cyberinfrastructure for e-Science. Science. 2005; 308(5723): 817-821.
- 64Redkar T, Guidici T, Meister T. Windows Azure Platform, Vol. 1. Cham, Switzerland: Springer; 2011.
10.1007/978-1-4302-3564-4 Google Scholar
- 65Apache Software Fundation. Apache Tomcat. http://tomcat.apache.org/; 2016.
- 66National Research Council of Italy. SmartGears - gCube system. Java libraries to turn Servlet-based containers and applications into gCube resources transparently. https://wiki.gcube-system.org/gcube/SmartGears; 2016.
- 67National Research Council of Italy. The D4Science information system. https://wiki.gcube-system.org/gcube/Information_System; 2016.
- 68Dimakis AG, Godfrey PB, Wu Y, Wainwright MJ, Ramchandran K. Network coding for distributed storage systems. IEEE Trans Inf Theory. 2010; 56(9): 4539-4551.
- 69Lakshman A, Malik P. Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev. 2010; 44(2): 35-40.
10.1145/1773912.1773922 Google Scholar
- 70Banker K. MongoDB in action. Shelter Island, New York: Manning Publications Co.; 2011.
- 71Gilbert S, Lynch N. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News. 2002; 33(2): 51-59.
10.1145/564585.564601 Google Scholar
- 72National Research Council of Italy. The D4Science distributed storage management system. https://wiki.gcube-system.org/gcube/Storage_Management; 2016.
- 73Assante M, Candela L, Castelli D, Coro G, Lelii L, Pagano P. Virtual research environments as-a-service by gCube. PeerJ Preprints. 2016; 4: e2511v1.
- 74Groot R, McLaughlin JD. Geospatial data infrastructure: concepts, cases, and good practice. Oxford, UK: Oxford University Press; 2000.
- 75Trumpy E, Coro G, Manzella A, et al. Building a European geothermal information network using a distributed e-Infrastructure. International Journal of Digital Earth. 2016; 9(5): 499-519.
- 76Shekhar S, Xiong H. Web Coverage Service. Encyclopedia of GIS. Cham, Switzerland:Springer; 2008: 1255-1255.
- 77Cornillon P, Gallagher J, Sgouros T. OPeNDAP: Accessing data in a distributed, heterogeneous environment. Data Sci J. 2003; 2: 164-174.
10.2481/dsj.2.164 Google Scholar
- 78Ožana R, Horáková B. Actual state in developing GeoNetwork opensource and metadata network standardization. 15th year of International Symposium GIS Ostrava 2008. New hall - Congress centre VSB-TU Ostrava Czech Republic; January 27th-30th, 2008.
- 79Bergamasco A, Benetazzo A, Carniel S, et al. Knowledge discovery in large model datasets in the marine environment: The THREDDS Data Server example. Adv Oceanogr Limnol. 2012; 3(1): 41-50.
10.4081/aiol.2012.5325 Google Scholar
- 80de La Beaujardiere J. OpenGIS® web map server implementation specification. Open Geospatial Consortium Inc., (OGC). 2006: 06-042.
- 81 INSPIRE. Generic Conceptual Model. http://inspire.ec.europa.eu/documents/Data_Specifications/D2.5_v3.4rc3_vs_3.4rc2.pdf; 2016.
- 82Fraser M. Virtual research environments: overview and activity. Ariadne. 2005; 1(44).
- 83Lupp M. Open Geospatial Consortium. Encyclopedia of GIS. Cham, Switzerland:Springer; 2008: 815-815.
10.1007/978-0-387-35973-1_918 Google Scholar
- 84Deelman E, Gannon D, Shields M, Taylor I. Workflows and e-Science: An overview of workflow system features and capabilities. Futur Gener Comput Syst. 2009; 25(5): 528-540.
- 85Rogers H, Rogers H. Theory of recursive functions and effective computability, Vol. 126. New York: McGraw-Hill; 1967.
- 86Freire J, Koop D, Santos E, Silva CT. Provenance for computational tasks: A survey. Comput Sci Eng. 2008; 10(3): 11-21.
- 87Simmhan YL, Plale B, Gannon D. A survey of data provenance in e-science. ACM Sigmod Record. 2005; 34(3): 31-36.
- 88Moreau L, Missier P. PROV-DM: The PROV data model. https://www.w3.org/TR/prov-dm/; 2013.
- 89Sefraoui O, Aissaoui M, Eleuldj M. OpenStack: Toward an open-source solution for cloud computing. Int J Comput Appl. 2012; 55(3): 38-42.
10.5120/8738-2991 Google Scholar
- 90Ansible. Playbook documentation. http://docs.ansible.com/ansible/playbooks.html; 2016.
- 91OpenStack. Telemetry. https://wiki.openstack.org/wiki/Telemetry; 2017.
- 92ArcMap. ArcGIS for desktop. http://desktop.arcgis.com/en/arcmap/; 2016.
- 93National Research Council of Italy. gCube WPS thin clients. https://wiki.gcube-system.org/gcube/How_to_Interact_with_the_DataMiner_by_client; 2016.
- 94OpenCPU. Producing and reproducing results. https://www.opencpu.org; 2016.
- 95National Research Council of Italy. gCube token-based authorization system. https://gcube.wiki.gcube-system.org/gcube/Authorization_Client_Library; 2016.
- 96National Research Council of Italy. DataMiner process integration framework. https://wiki.gcube-system.org/gcube/How-to_Implement_Algorithms_for_DataMiner; 2016.
- 97Coro G, Panichi G, Pagano P. A Web application to publish R scripts as-a-Service on a cloud computing platform. Bollettino di Geofisica Teorica e Applicata. 2016; 52. article n. 51 53.
- 98Burggraf DS. Geography markup language. Data Sci J. 2006; 5: 178-204.
10.2481/dsj.5.178 Google Scholar
- 99Berghe EV, Coro G, Bailly N, et al. Retrieving taxa names from large biodiversity data collections using a flexible matching workflow. Ecological Informatics. 2015; 28: 29-41.
- 100Froese R, Pauly D. FishBase 2000: Concepts Designs and Data Sources, Vol. 1594. Penang, Malaysia:WorldFish; 2000.
- 101Bard GV. Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. Paper presented at: Proceedings of the fifth Australasian symposium on ACSW frontiers-Volume 68. Ballarat, Australia:Australian Computer Society, Inc.; 2007.
- 102BlueBRIDGE European Project. The Ichthyop model use case. https://support.d4science.org/projects/bluebridge/wiki/Ichthyop; 2016.
- 103National Research Council of Italy. The gCube tabular data manager data preparation system. https://wiki.gcube-system.org/gcube/Tabular_Data_Manager; 2016.
- 104Candela L, Castelli D, Coro G, et al. An infrastructure-oriented approach for supporting biodiversity research. Ecological Informatics. 2015; 26: 162-172.
- 105RStudio Inc. An online RStudio instance for data preparation of the D4Science Biodiversity Lab Virtual Research Environment. https://i-marine.d4science.org/group/biodiversitylab/r-studio; 2016.
- 106Nii HP. Blackboard application systems, blackboard systems and a knowledge engineering perspective. AI magazine. 1986; 7(3): 82.
- 107The BlueBRIDGE European Project. Project Web site. www.bluebridge-vres.eu; 2016.
- 108The EnvriPlus European Project. Project Web site. www.envriplus.eu; 2016.
- 109The Parthenos European Project. Project Web site. www.parthenos-project.eu; 2016.
- 110The SoBigDataeu European Project. Project Web site. www.sobigdata.eu; 2016.
- 111Lasi H, Fettke P, Kemper HG, Feld T, Hoffmann M. Industry 4.0. Bus Inf Syst Eng. 2014; 6(4): 239.
- 112National Research Council of Italy. The gCube DataMiner integrated processes. https://wiki.gcube-system.org/gcube/DataMiner_Algorithms; 2016.