Abstract
The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O-related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work, we propose the usage of the Hercules system within DMCF as an ad hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. We evaluated the performance of Hercules compared with the Microsoft Azure Storage solution by using synthetic benchmarks with the objective of demonstrating the viability of the proposed solution. Then, we evaluated the integration of Hercules and DMCF on a real application consisting of a workflow that accesses temporary data using either Azure storage or Hercules. The I/O overhead in this real-life scenario using Hercules has been reduced by 36 % with respect to Azure storage, leading to a 13 % reduction of the total execution time. This confirms that our in-memory approach is effective in improving the performance of data-intensive workflow executions in cloud-based platforms.










Similar content being viewed by others
References
Al-Kiswany S, Gharaibeh A, Ripeanu M (2010) The case for a versatile storage system. Oper Syst Rev 44(1):10–14
Costa LB, Yang H, Vairavanathan E, Barros A, Maheshwari K, Fedak G, Katz D, Wilde M, Ripeanu M, Al-Kiswany S (2014) The case for workflow-aware storage:an opportunity study. J Grid Comput 1–19
Donnelly P, Hazekamp N, Thain D (2015) Confuga: scalable data intensive computing for POSIX Workflows. In: IEEE/ACM international symposium on cluster, cloud and grid computing
Duro FR, Blas JG, Carretero J (2013) A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, EuroMPI ’13, , New York. ACM, pp 139–140
Fitzpatrick B (2004) Distributed caching with memcached. Linux J 2004(124):5
Florin I, Javier GBF, Jesús C, Wei-Keng L, Alok C (2010) A scalable message passing interface implementation of an ad-hoc parallel I/O system. Int J High Perform Comput Appl 24(2):164–184
John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In : Eleventh conference on uncertainty in artificial intelligence,San Mateo. Morgan Kaufmann, pp 338–345
Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to platt’s smo algorithm for svm classifier design. Neural Comput 13(3):637–649
Li H, Ghodsi A, Zaharia M, Shenker S , Stoica I (2014) Reliable, memory speed storage for cluster computing frameworks. Technical Report UCB/EECS-2014-135, EECS Department, University of California, Berkeley, Jun
Marozzo F, Talia D, Trunfio P (2011) A cloud framework for parameter sweeping data mining applications. In: Proc. of the 3rd IEEE international conference on cloud computing technology and science (CloudCom 2011), Athens, Greece, 1 December. IEEE Computer Society Press. ISBN 978-0-7695-4622-3, pp 367–374
Marozzo F, Talia D, Trunfio P (2013) A cloud framework for big data analytics workflows on azure. In: Charlie C, Wolfgang G, Lucio G, Gerhard J, Jos Luis V-P (eds) Post-Proc. of the high performance computing workshop 2012, volume 23 of advances in parallel computing, Cetraro, Italy, IOS Press. ISBN 978-1-61499-321-6, pp 182–191
Marozzo F, Talia D, Trunfio P (2015) JS4Cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput Pract Exp 27(17):5214–5237
Ross Quinlan J (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA
Duro FR, Marozzo F, García BJ, Pérez JC, Talia D, Trunfio P (2015) Evaluating data caching techniques in DMCF workflows using Hercules. In: Proceedings of the second international workshop on sustainable ultrascale computing systems (NESUS 2015), Krakow, Poland, pp 95–106
Thain D, Livny M (2005) Parrot: Transparent user-level middleware for data-intensive computing. Scalable Comput Pract Exp 6(3):9–18
Xindong W, Vipin Kumar J, Quinlan R, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12, Berkeley, CA. USENIX Association, pp 2–2
Zhang Z, Katz DS, Armstrong TG, Wozniak JM, Foster I (2013) Parallelizing the execution of sequential scripts. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13, New York. ACM, pp 31:1–31:12
Zhao D, Qiao K, Raicu I (2014) Hycache+: Towards scalable high-performance caching middleware for parallel file systems. In: IEEE/ACM CCGrid
Zhao D, Yang X, Sadooghi I, Garzoglio G, Timm S, Raicu I (2015) High-performance storage support for scientific applications on the cloud. In: Proceedings of the 6th workshop on scientific cloud computing, ScienceCloud ’15. ACM, New York, pp 33–36
Zhao D, Zhang Z, Zhou X, Li T, Wang K, Kimpe D, Carns P, Ross R, Raicu I (2014) FusionFS: toward supporting data-intensive scientific applications on extreme-scale high performance computing systems. In: 2014 IEEE international conference on big data (Big Data), pp 61–70
Acknowledgments
This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). This work is partially supported by the grant TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rodrigo Duro, F., Marozzo, F., Garcia Blas, J. et al. Exploiting in-memory storage for improving workflow executions in cloud platforms. J Supercomput 72, 4069–4088 (2016). https://doi.org/10.1007/s11227-016-1678-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1678-y