Survey On Emerging Technologies For Secure Computations Of Big Data

N. Madhusudhana Reddy *    C. Naga Raju **  
* Associate Professor, Department of Computer Science and Engineering,Syamala Devi Institute of Technology for Women, Andhra Pradesh, India.
** Associate Professor and Head, Department of Computer Science and Engineering, Yogi Vemana University, Andhra Pradesh, India.

Abstract

Big data is vast amount of data characterized by volume, velocity and variety. Mining such data can provide comprehensive business intelligence. However, conventional environments are not sufficient to handle and mine such data. Distributed programming frameworks like Hadoop are used to process big data. Such frameworks use a new programming paradigm known as Map Reduce. The challenging issue in big data mining is security implications. This paper explores the merits and demerits of the distributed data mining frameworks such as Hadoop, Haloop, Sailfish and AROM and describes the techniques on how the distributed frameworks for managing and mining big data can help enterprises to make expert decisions. Moreover there is a need for secure computations in distributed programming frameworks. This paper provides useful insights on big data, big data mining and the need for secure computations for processing big data.

Keywords :

Introduction

Big data is characterized by volume, velocity and variety. Volume refers to the huge amount of data; velocity refers to the streaming nature of data while the variety refers to the different types of data. Such huge data processing is not possible in traditional environments[12]. Of later there has been increased attention towards big data mining. Distributed programming frameworks are used for processing big data. These frameworks such as Hadoop use different programming paradigm known as MapReduce. Hadoop is a distributed programming framework which is widely used in the real world [1]. It supports distributed parallel processing which helps in processing huge amount of data. Hadoop is able to process such voluminous data in distributed environment. However, there are certain challenges in handling big data. They include stream handling, parallel processing and data correlation[2]. Big data processing has potential to economy of governments and businesses positively. It has its influence on the society [3], [10], [16]. When big data is not analyzed comprehensively; it can result in biased conclusions, where individuals or organizations misunderstand by looking at limited view of data. Right kind of tools can serve the purpose of big data mining [4]. Big data mining becomes the strategy of Information and Communication Technology for organizations to achieve success[5]. There are phases in processing big data. They are data acquisition and recording, information extraction and cleaning, data representation, aggregation and integration, data modeling, analysis and quer y processing, and interpretation. The real challenges in big data mining include scale of data, human collaboration, timeliness, incompleteness and heterogeneity [6]. Financial performance of enterprises can be improved through effective use of big data. Prioritizing business activities with extracted business intelligence from big data is possible[7]. Big enterprises encounter big data problems [8]. “NoSQL” principle is followed by many frameworks while processing big data[9]. Different people interpret big data differently. IBM's survey says that it is for people with lot of scope, real time information, the latest buzzword, huge amount of data and non-traditional data [10].

As discussed earlier, three V’s are involved with big data processing[31]. They are Volume, Velocity and Variety. Big data is measured with terabytes, petabytes and even exabytes. It can be transformed into big value so as to help organizations to have business intelligence and grow faster [13], [15]. Map Reduce is the new programming paradigm for processing big data[14]. Processing big data is possible with frameworks like Twister, Boom, Twitter Storm, Apache Hama, Spark and GraphLab[17]. Solid State Drives (SSDs) are used for storing big data using NAND flash memory for high performance and ensure ROI [18]. With cutting edge technologies, big data mining can save money to governments and companies [19]. . Different sectors in government and all departments of a company can use the results of big data for making expert decisions [20]. Many researchers have come into existence on big data mining. In [21], Map Reduce programming model is used for big data mining. This paradigm is also optimized further. In[22], large networks are used to study big data. They used various applications and bench mark datasets for processing big data. In [23], cloud computing environment is used for processing big data. They made experiments on distributed file system, non-structural storage and semi-structured cloud storage. They could also discover certain challenges in mining big data, analysis, and security. A variant of Hadoop by name Haloop was explored in [24] where Map Reduce paradigm is used. Their experiments revealed that Haloop could reduce query time. In [25], big data analysis was explored with different technologies while in [26], infrastructure requirements for big data process is given importance. They mentioned the need for GPUs1 (Graphics Processor Unit) for processing big data. Various techniques for processing big data are SAS macros, data step, indexes and PROC SQL [27]. In [28], Sailfish framework was explored which has features such as autotuning. AROM is another such framework discussed in [29] which makes use of data flow graphs and functional programming. In [30], eraser codes for secure big data processing were given importance. The remainder of the paper is structured as follows.

1. Emerging Technologies for Managing Big Data

Shared nothing architecture and distributed processing frameworks such as Hadoop, Haloop and Sailfish are emerging technologies for managing big data. The shared nothing technology can scale with huge amount of data to be processed by thousands of stateless nodes with underlying hardware architecture, data architecture and application architecture. This architecture is being used by companies like Google, Amazon etc.

2. Distributed Programming Frameworks for Processing Big Data

2.1 Processing Big Data with Apache Hadoop

For big data analytics, Hadoop evolved as a best approach. It is one of the distributed programming frameworks that offer many advantages such as storing data in its native format, scaling for big data, delivering new insights, reducing costs, higher availability and lower risk.

The Hadoop software stack has support for massively scalable storage and retrieval of data. It supports the concept of master and slave nodes with Map Reduce programming paradigm in distributed environment. Interactive business intelligence can be obtained using the Apache Hadoop by processing big data.

2.2 Map Reduce: The New Programming Paradigm

Map Reduce is the new programming approach supported by distributed programming frameworks like Hadoop. There are two phases in the processing of data. They are Map phase and Reduce phase. The map phase takes set of key/value pairs and processes them to generate output key/value pairs. The reduce step combines values pertaining to all distinct keys and the result is returned. The whole process is carried out in a distributed environment like Hadoop [31].

2.3 Processing Big Data with Haloop

A modified version of Hadoop was proposed by Bu, Howe, and Ernst[24] for processing big data. It extends Map Reduce with additional capabilities such as task scheduler which is loop aware and caching mechanisms. Map Reduce programming paradigm is used by many companies in the real world such as Yahoo, Facebook, Google and so on. There are many components in the architecture of Haloop. The components are organized into three layers namely file system, framework and application. There are two file systems such as local and distributed. As the names imply, local file system stores data locally while distributed file system stores data in multiple machines distributed environment. The framework layer has task scheduler and task tracker. The task scheduler has support for loop control while the task tracker is able to communicate with local file system with features like caching and indexing. Task queue is also maintained for improving efficiency [24]. Caching reduces hits to file system. Task tracking is done by slave nodes while the task scheduling is done by master nodes. Jobs are managed by master nodes while tasks are managed by slave nodes. The invocation of slaves is done by masters either in parallel or sequential fashion. Master node has communication with framework for getting jobs done.

2.4 Large Scale Data Processing with Sailfish

Another distributed programming framework by name Sailfish was proposed by Rao, Ramakrishnan, and Silberstien [28] for big data processing. It also has Map Reduce paradigm. However, it has improved forms of Map and Reduce phases. The framework is 20% faster when it is compared with Hadoop with auto-tuning feature. These researchers also studied other frameworks like Hive, Hadoop, Dryad and Map Reduce and improved functionality in Sailfish[28] for big data processing. In Sailfish frame work, there will be number of map and reduce tasks involved. With auto-tuning, the Sailfish improves performance by 20%. This was confirmed by testing it using benchmark datasets at Yahoo. It records higher performance when compared with Hadoop. With auto-tuning facility in place, it exploits parallel processing efficiently. It can also handle bursts and skewness in data. It can also process intermediate data well. The size of intermediate data has its influence on tuning. When data is well beyond 16TB, Sailfish can handle it with ease. It also has batching data for reducing overhead with respect to disk I/O [28]. The batch processing overhead is reduced in Sailfish. It is achieved by processing intermediate data well. Sailfish also supports I files[28]. There are certain components to coordinate data flow in Sailfish. They are iappender, chunksorter, chunkserver, imerger and work builder. Map task is processed by iappender and then it is handed over to chunk server for storing. Then the chunksorter, takes data in the form of I files, performs processing. The outputs are given to imerger which does merging process and the result is given to reduce task. Work builder coordinates the further task [28].

2.5 Functional Programming and Data Flow Graphs for Big Data Processing

A new framework by name AROM was proposed by Tran, Skhiri, Lesuisse, and Zimanyi [29]. This tool is mainly characterized by the usage of data flow graphs and functional programming for processing big data efficiently. The Google's MapReduce has known limitations that are overcome by using data flow graphs. AROM improves performance and scalability as it is capable of using functional programming. Pipelined tasks can be carried out by this with ease. There are Data Flow Graphs (DFGs) and MapReduce paradigms for processing big data. The former was from Google while the latter is proposed in[29]. There are many phases in the MapReduce model of AROM. However, the important phases are Map and Reduce as usual. The big data is taken and split by DFS (Distributed File System) component. Such data is mapped by the Map phase. Then the Shuffle phase is carried out to process maps. Then the final output is generated by Reduce phase. The shuffle phase and joins are not efficient in Map Reduce framework. Microsoft's framework Dryad makes use of DFG where the pipelining is better than Map Reduce [29]. The Map Reduce framework completes phases sequentially while the Dryad does it simultaneously with the help of DFGs. The PageRank version of AROM has improvements over MapReduce. AROM's architecture has more flexibility for parallel processing [29].

3. Top 10 Big Data Challenges

The Cloud Security Alliance (CSA) [32] which is the big data working group disclosed the top 10 big data security and privacy challenges in 2013. Security and privacy issues  are increased due to the characteristics2 of big data. The focus was on infrastructure security, data privacy, data management and integrity and reactive security. The CSA provides modeling, analysis and implementation guidelines for addressing all these challenges. In distributed programming frameworks like MapReduce, the mapper can be subjected to attacks in the presence of untrusted mapper. The security threats come from malfunctioning compute worker nodes, infrastructure attacks and rogue data nodes. The suggested solution includes trust establishment and mandatory access control (MAC) [33]. With respect to non – relational data stores, NoSQL databases can be used with security infrastructures[34]. Privacy preserving data mining and analytics need to address problems such as invasive marketing, increased state and corporate control, and invasions of privacy [35]. Cryptographically enforced data centric security should deal with covert side-channel attacks [36], [37]. Granular access control is required for security with more precision [32]. Secure data storage and transaction logs are to be given paramount importance as the auto-tiering solutions do not keep track of the whereabouts of data. Here the threat is for seven scenarios such as confidentiality and integrity, provenance, availability, consistency, collusion attacks, roll-back attacks, and disputes[32]. Granular audits are required for real time security monitoring to discover missed attacks even [32]. Data provenance practices are required in order to secure big data processing from outsider attacks and malfunctioning infrastructure [32]. End to end validation and filtering is to be enabled as adversaries, tamper data collection device, perform ID cloning attacks, manipulate input sources and so on [32]. Real time security monitoring is required for monitoring big data infrastructure and to detect fraudulent claims through evasion attacks[38] and poisoning attacks [39].

4. Need for Secure Computations in Distributed Programming Frameworks

As explored by CSA [32], distributed programming frameworks like MapReduce makes use of two phases for processing big data. In the first phase, for each chunk a Mapper reads data, computes it and returns a list of key/value pairs. In the second phase of Map Reduce, a Reducer combines values pertaining to all distinct keys and the result is returned. With respect to mappers, there are many threat scenarios such as malfunctioning compute worker nodes that can return incorrect results, besides leaking users' confidential data; infrastructure attacks where compromised Worker nodes may tap the communication among other Workers and the Master with the objective of replay, Man-In-the-Middle, and DoS attacks to the MapReduce computations; Rogue data nodes that can be added to a cluster, and subsequently receive replicated data or deliver altered MapReduce code. The ability to create snapshots of legitimate nodes and re-introduce altered copies is a straightforward attack in cloud and virtual environments and is difficult to detect. Security and privacy guarantees are to be ensured in MapReduce based systems as users of the data do not have expertise on securing data and distributed processing of data in the presence of untrusted mapper[7].

Conclusions and FutureWork

In this paper, a study has been done on big data and managing big data with distributed programming frameworks. It provides insights into distributed programming frameworks like Map Reduce, Hadoop, Haloop, Sailfish, Dryad and AROM. These frameworks are widely used for processing big data in a distributed environment. Big data mining can provide business intelligence to enterprises for making well informed decisions besides avoiding biased conclusions. There are many advantages of using distributed programming frameworks for processing big data. There are many security issues with big data mining as explored by CSA [32]. This paper discusses the need for secure computations in distributed programming frameworks. This research can be extended further to have empirical study to analyze security issues in big data processing and propose necessary framework with mechanisms to address them.

Note

A GPU is a highly parallel computing device designed for the task of graphics rendering

Big data is characterized by volume, velocity and variety.

References

[1]. Takahashichieko, sera naohiko, Tsukumotokenji, osakihirotatsu (2012). “osshadoop use in big data processing”, Nec Technical Journal. pp.1-5.
[2]. Dibyendubhattacharya (2013). “Analytics on Big Fast data using real time stream data processing architecture”. Emc Proven Professional Knowledge Sharing, pp.1-34.
[3]. Liraneinav and Jonathanlevin (2013). “The Data Revolution and Economic Analysis”, Prepared for the NBER Innovation Policy and the Economy Conference. pp. 1-29.
[4]. An Oracle White Paper (2013). Oracle: Big Data for the Enterprise. USA: Oracle , pp.1-16.
[5]. Australian Government. (2013). Big Data Strategy – Issues Paper. Department of Finance and Deregulation, pp.1-12.
[6]. Leading researchers across the United States. (n.d). Challenges and Opportunities with Big Data. Leading Researchers. pp.1-17.
[7]. SAS. (2012). Big data Lessons from the leaders. Economist Intelligence Unit Limited. pp. 1-30.
[8]. Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu (2010). “Epic: An Extensible and Scalable System for Processing Big Data”, pp.1-12.
[9]. Michael Cooper & Peter Mell (2013). “Tackling Big Data”, National Institute of Standards and Technology, pp.1-40.
[10]. Michael Schroeck, Rebecca Shockley, Dr. Janet Smart, Professor Dolores Romero-Morales and Professor Peter Tufano (2012). “Analytics: The real-world use of big data”, IBM Global Business Services, pp.1-20.
[11]. Intel. (2013). Planning Guide Getting Started With Big Data. Intel IT Center, pp.1-24.
[12]. Mike Ferguson (2012). “Architecting A Big Data Platform for Analytics”, Intelligent Business Strategies, pp.1-36.
[13]. Intel. (n.d). Transforming Big Data into Big Value. Inte Distribution for Apache Hadoop, pp.1-10.
[14]. Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”, pp. 1-13.
[15]. McKinsey (2011). “Big data: The next frontier for innovation, competition, and productivity”, MGI, pp. 1- 156.
[16]. Economics Intelligent Unit. (2011). Big data Harnessing a game-changing asset. Economics Intelligent Unit, pp.1-32.
[17]. IDF13. (n.d). Beyond Hadoop Map Reduce: Processing Big Data. Intel, pp.1-56.
[18]. Micron. (n.d). SSDs for Big Data – Fast Processing Requires High-Performance Storage. Micron Technology Inc, pp. 1-4.
[19]. Chris Yiu. (2012). “The Big Data Opportunity”, Policy Exchange, pp.1-36.
[20]. Dr. Nathan EagleBs (2010). “Big Data, Big Impact: New Possibilities for International Development”, The World Economic Forum, pp.1-10.
[21]. BogdanGhit¸ AlexandruIosup and Dick Epema (2005). “Towards an Optimized Big Data Processing System”, IEEE, pp.1-4.
[22]. ToyotaroSuzumura (2012), “Big Data Processing in Large-Scale Network Analysis and Billion-Scale Social Simulation”, IBM Research, pp.1-2.
[23]. ChangqingJi, YuLi, WenmingQiu , UchechukwuAwada, Keqiu Li (2012). “Big Data Processing in Cloud Computing Environments”, International Symposium on Pervasive Systems, pp. 1-7.
[24]. Yingyi Bu, Bill Howe, Magdalena Balazinska and Michael D. Ernst (2010). “HaLoop: Efficient Iterative Data Processing on Large Clusters”, IEEE, pp.1-12.
[25]. XIAO DAWEI, AO LEI (2013). “Exploration on Big Data Oriented Data Analyzing and Processing Technology”, IJCSI International Journal of Computer Science, pp.1-6.
[26]. Ling LIU (2012). “Computing Infrastructure for Big Data Processing”, USA: IEEE, pp.1-9.
[27]. Kevin McGowan (2013). “Big data: The next frontier for innovation, competition, and productivity”, USA: SAS Solutions on Demond, pp.1-16.
[28]. SriramRao, Raghu Ramakrishnan and Adam Silberstein (2012). “Sailfish: A Framework For Large Scale Data Processing”, USA: Microsoft, pp.1-14.
[29]. Nam-Luc Tran and SabriSkhiri and Arthur Lesuisse and Esteban Zim´anyi (2012). “AROM: Processing Big Data With Data Flow Graphs and Functional Programming”, Belgium: Amazon, pp.1-8.
[30]. MaheswaranSathiamoorthy, MegasthenisAsteris and DimitrisPapailiopoulos (2013). “XORing Elephants: Novel Erasure Codes for Big Data”, Proceedings of the VLDB Endowment, pp.1-12.
[31]. Steps IT managers can take to move Forward with apache Hadoop Software. (2013). Planning Giuide Getting Started With Big Data, pp.1-24.
[32]. Sriram Rao, Raghu Ramakrishnan and Adam Silberstein (2012). “A Framework For Large Scale Data Processing”, USA: Microsoft, pp.1-14.
[33]. I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov and E. Witchel, (2010). “Airavat: security and privacy for MapReduce” in USENIX Conference on Networked systems design and implementation, pp 20.
[34]. B. Sullivan, (2011). “NoSQL, But Even Less Security”, http://blogs.adobe.com/asset/files/2011/04/NoSQL-But- Even-Less-Security.pdf.
[35]. D. Boyd, and K. Crawford (2012). “Criticial Questions for Big Data,” Information, Communication & Society, pp. 662-675.
[36]. Acıiçmez, Onur, ÇetinKoç, and Jean-Pierre Seifert (2006). “Predicting secret keys via branch prediction”, Topics in Cryptology–CT-RSA, pp. 225-242.
[37]. C. Percival, (2005). “Cache missing for fun and profit”, BSDCan.
[38]. T. Ptacek and T. Newsham (1998). “Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection”, Tech. Report.
[39]. M. Barreno, B. Nelson, R. Sears, A. Joseph, J.D. Tygar, (2006). “Can Machine Learning be Secure?”. Proc. of the 2006 ACM Symposium on Information, Computer, and Communications Security, ASIACCS 2006, pp. 16-25.