Understanding Hindsight, Insight And Forsight Data To Large-Scale Distributed Data Intelligence (Algorithms) Machine: A Scale-Out Review

Associate Professor & Dean, PG and Research Department of Computer Science, SCAS, Pudukkottai, Tamilnadu, India.

Abstract

This review attempts to comprehend the insights and foresights into the grasping power of data science, data quality, data process, data pre-process, big data, big data process and analysis, analytics, BD (Big Data) and analytics lifecycle, file storage, platforms/technologies supported, Hadoop concepts, eco-system components, and design principles. Principle and philosophy behind computations are also explained through flow diagram. Various analytics based on its solutions, cluster computing based platforms like Apache Spark (its architecture – core, other components, and utilities), MLlib package – Machine Learning (ML) methods/ tasks and detailed supported algorithms are exclusively elucidated, to understand the concepts of these pinpoints. The explored comprehensive contents would definitely be useful and provide core understanding knowledge in the large scale ML dependent algorithms process, suitable to build the relevant application solutions (may be predictions/ classifications/ segmentations/ recommendations) via Apache – Spark environment.

Keywords :

Big Data,
Analytics,
Apache Spark,
Flink, Hadoop,
Recommender Engine,
Machine Learning,
Classification,
Clustering,
Algorithms,
Collaborative Filtering,
MLlib
Data Science,
Scaling,
Cloud Storage.

Introduction

Big Data will be the next front runner for any innovation since data is embedding with modern human life style. According to National Association of Software and Services Companies (NASSCOM), Big Data analytics sector in India is expected to witness eight fold growth to reach $16 billion by 2025 from the current level of $2 billion. The current sector has huge growth potential by 2025; India will have 32 percent share in the global market. Core methodology of data analytics is ML, which is a highly important component to deliver the right model in the line of BD applications. It is mandatory to understand the nook and corner of data process to ML large scale processing knowledge through BD technologies/ platforms with Hadoop/ Spark tools. All possible contents, detailed in a step by step manner, to reach out the right model development by handling high growth data, is being rapidly penetrated from various sources and are shown in Table 1.

Table 1. Rapid Growth of Data (Big Data) Generation with its Velocity

1. Data Science

DA (Data Analysis) is a science (Nair and Ganesh, 2016), which is used to examine the raw data and for drawing conclusions about the information. It is a real-world use of Big Data. Data Science is about two aspects such as extracting deep meaning from the data and creating “data products” (Software system). It combines the traditional scientific methods with the ability to crunch, explore, learn, and gain deep insight for Big Data. There will be an increasing demand for Big Data Analytics (BDA) in Data Science Age due to evolution of streaming data.

1.1 Data Science Process

Data Science process is explained in a flow manner so as to understand the meaning (concepts) of raw data, data quality, data preprocessing, data processing, data analysis, and analyzing the same through BD platforms/ using technologies or tools, to model the specified Data analytics.

Data→ Kinds of Data (un/semi) structured data→ Data Quality (data understanding/ examining the data)Data preprocessing: Data [Clean, Integration, Transformation, Aggregate, and Reduction] Imperfect Data: Missing, Noisy etc., Reduce Dimensionality... → Data processing [Explore, Represent, Link, Learn from Data, Deliver insights...]→ Data Analysis [By Machine Learning -(Classification, Clustering, Regression, DT, Association, Visual Analytics etc., )] Data analytics/ Type of analytics-via BD Platforms/ Technologies / Tools.

1.2 Data Scientist

According to Harvard Business Review, Data Scientist is a high-ranking professional with the training and curiosity to make discoveries in the world of Big Data. Data Scientist needs to have deep knowledge of a domain area, as well as the technical knowledge of computer science, statistical techniques, data analytics, and modeling methods like Data Mining and Machine Learning, and also soft skills.

2. Big Data (BD)

Big data is high-volume and high-velocity and/ or highvariety of information assets, that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Big Data, also referred to as “Data Intensive Technologies”, are targeting now to process highvolume, high-velocity, high-variety data (sets/ assets) to extract intended “data value” and ensure high veracity of original data. Health Big Data Analytics (HBDA), would become “Healthier Analytics” (Saravanakumar and Hanifa, 2017) in decision making to improve the quality of patient's life, predict epidemics and reduce preventable deaths.

2.1 BD Meaning – In a Nutshell

With Big data, the ability to collect data will only increase but ability to analyze the data will only improve.

2.2 BD Definitions

The following are the different definitions in connection with big data.

BD as Technology: New technology developments like MR (MapReduce), RDD (Resiclient Distributed Datasets) in Spark, Flink in Lambda Architecture.
BD as Application: Different applications are based on different types of BD (It looks for hindsight of data).
BD as Signals: Focuses timing rather than type of data.
BD as Opportunity: Highlights many potential opportunities by revisiting the collected or archived datasets, when new technologies are variable. (i.e., data that was previously ignored because of technology limitations)
BD as Metaphor: It elevates BDA to the new level. It defines Big Data as human thinking process.
Big Data as New Term for Old Stuff: As a New term, it means a new bottle for old wine.

2.3 Various Platforms for Big Data Analytics

Accessing, analyzing, securing, and storing big data are done using one of the most alleviating technologies called, “Big Data Technology”. Related platforms are listed below.

Apache Hadoop (can per form data-intensive applications), Apache Spark (fast cluster computing platform), Apache Flink (faster than Apache spark), Apache Hive, Apache Flume, Apache Sentry (Hadoop Security), IBM Infosphere, Apache S4, RHadoop (ML Packages), Horton Hadoopworks, Twitter's Storm, Dremel by Google's Big Query services, Lambda Architecture, epiC. Hadoop's Core and Eco-systems potency can be considered as a “Dispensing Pipeline”.

2.4 BD Classification (Saravanakumar and Hanifa, 2017)

Four aspects of classification are categorized in a bulky scale data (Verma et al., 2015) as given below.

Data Sources: From Web, Social, Machine, Sensing, Transactions and IoT.
Data Staging: By Cleaning, Normalization, and Transformation.
Data Stores: By Document oriented/ Columnar/ Graph based/ Key value pair (called “Modern Data Formats”).
Data Processing: Either by Batch based or Stream based (Real time-Tools).

2.5 Hadoop/ Hadoop Tools

Hadoop is a software framework for distributed processing of large datasets across large clusters of computers. It can handle large datasets (terabytes/ petabytes of data) and large clusters (hundreds or thousands of nodes). It is based on a simple data model, any data will fit in. Hadoop does not support updating of data once the file is closed, but rather supports appending in the file. In 2005, Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project.

Hadoop tools can help and solve real problems. Figure 1 presents the detailed understanding of each tool.

The tools used for various purpose are listed below.

Data Storage: Hbase, Parquet, Cassandra;
Meta Data Store: Sentry, Nutch, Solr, Gora, MongoDB;
Data Library: DataFu, Whirr (Cloud) ;
Data Integration: Structured Data (Sqoop, Hiho), Unstructured Data (Flume, Scribe), Log (Chukwa), ETL, Knox, Security (Penthaho, Talend);
Data Ingestion Layer: Kafka;
Data Processing: MR, Spark;
Data Analysis with SQL: Hive, Impala, Flink, Spark;
Data Analytic Layer: Storm;
Data Exploration: Cloudera Search;
Data Serialization: Spark, Avro, Thrift;
Data Intelligence: H₂O, Drill (Dremel), Mahout (ML);
Data Interaction Visualization: Hcatalog, Lucene;
Execution Development: Hama, Crunch;
Data Security: Sentry (Authorization tool providing security for Hadoop);
Management, Monitoring, Orientation: Workflow (Oozie), Configure (Zookeeper), Provisioning (Ambari, CDH), User Experience (Hue), Monitoring (Nagios, Ganglia), Report (Splunk, Talend), HDT (Development)
MapReduce Programming: Pig, Hive, Java, Perl (Can Replace the coding of MR);
Operating System: Windows/ Linux; and
Virtualized Infrastructure: Virtual Machine, J2EE Container).

Figure 1. Hadoop Framework Stack and Ecosystem

2.6 Design Principles of Hadoop

Hadoop is promising to adhere the design principles so as to deliver the right applications to the end-user. The following are the important principles to be adopted, when executing the design.

Need to process data scalability (petabytes of data, thousands of machines).

Need to parallelize computation across thousands of nodes.
Flexibility in accepting all data formats (No schema).
Efficient and simple fault-tolerant mechanism (if any nodes/ tasks fail and will recover automatically).
Commodity inexpensive hardware (Large number of low-end cheap machines working in parallel to solve a computing problem). This is contrast to Parallel DBs: small number of high-end expensive machines.
Performance (tons of indexing, tuning, data organization technology).
Automatic Parallelization and Distribution (Hidden from the end-user).
Clean and Simple Programming Abstraction and (MR function –Using Hadoop or RDD- Using Spark).
Other Features such as Provenance tracking, Annotation, and management.

2.7 BD Platforms/ Technologies/ Core and Eco-Systems

Hadoop Architecture (Master/ Slave model) consists of Hadoop Common and has three core components such as Hadoop Distributed File System (HDFS), MapReduce (MR), and YARN (Yet Another Resource Negotiator). Hadoop is concerned with three types, such as Apache Hadoop, RHadoop, and Hartonworks Hadoop. MapReduce (MR) Framework allows massive scalability and processing of extremely large file. MR is the heart of Hadoop, where the processing is carried out by assigning the tasks to various clusters. All complex components can be combined as a Big Data Ecosystem (BDE) as shown in Figure 1. Their deals with the evolving data, models, and supporting infrastructure during the whole “Big data life cycle” is given in Table 2.

Table 2. Big Data Life Cycle

3. Types of File Storage Systems

Storage is a critical component in data centres, which continues to grow in their size to accommodate the volume of data. Data Storage is the place, where data is held in an electromagnetic or optical form for access by a computer processor. In modern enterprises, data storage becomes the best business technology especially in large scale data like big data which are detailed below.

Built-in Storage Systems: Files are saved in built-in hard drives/ disks.
Removable Storage Systems: For example, External hard drives, USB memory sticks, DVDs).
Network Storage Systems: Simplest and dedicated network storage options (via Network Attached Storage (NAS)) to share one of the PCs to the network (LAN network – to have one central server, that contains all the files for everyone to access) and create a public folder on it.
Distributed File System (DFS): DFS allows access to files from multiple hosts across network as Network File System (NFS), it is tuned for file sharing than distributed computation. DFS application like HADOOP – is a tightly coupled DFS, which provides single global namespace across all nodes and support File Parallelism. Different types of DFS evolved to meet out the requirements of large scale datas are Google File System (GFS), Hadoop Distributed File System (HDFS), Team Foundation Server (TFS), GlusterFS, GridFS, Ceph, FastDFS, Lustre, etc.
Online Storage Systems (Cloud Storage): These systems are a good file storage option if it can be worked at remote desktops. There are many online cloud storage services. Most of the Cloud Service Providers (CSP) charge for the space (storage) for use. Amazon EC2, Google's AppEngine, Salesforce.com are the famous cloud storage vendors.

3.1 Depository

It is a terminology, which deals with storage technologies, especially for large scale data, represents long-term persistent storage and retrieval of both structured and unstructured data of geographically dispersed locations

3.2 Hadoop Distributed File System (HDFS)

HDFS is a DFS designed, to be fault tolerant and deployed on low cost hardware. It provides high throughput access to the data, suitable for applications with large scale datasets, and also supports the streaming access to file system data. Hadoop implements Google' s MapReduce, using HDFS.

MapReduce (Dean and Ghemawat, 2008) is a batch query processor and data-parallel programming model (consists of four steps like Input files Splitting, Mapping (Distribution), Shuffling + Sorting =Reducing (Parallelizing), and Aggregating Output files], hides complexity of distribution and fault tolerance (where Map = Distribute, Aggregate, Collaborate, and Reduce = Parallel processing). MapReduce divides applications (input data) into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. Resilient to hardware failure, moving computation closer to the data, portable across all platforms developed in Java are the mail goals of Hadoop.

3.3 BD Supported Languages or APIs

Python (can write minimum code), Scala, Java, Ruby, R (better visualization) and Clojure are the BD supported languages or Application Programming Interfaces (API), by using them, wonders can be done in the model development. Moreover, one must familiarize in Unix shell commands along with SQL commands to deliver better applications.

3.4 Big Data Life Cycle (4 A's)

The overall life cycle behind big data is illustrated in Table 2.

3.5 Data Analytics (DA)

Analytics technology refers to the systematic computational analysis of transforming data into information; it is described as "data-driven decision-making(Saravanakumar and Hanifa, 2017)

The main goal of analytics technology is to capture data collected from different sources and analyze these data to reach an optimal decision. Analytics ("Data Analytics") refers to the organized computational analysis to transform....

Data->into->Information->Knowledge->Intelligence, called "Data-driven decision-making".

3.6 Principle of Computational Processing

Bring computer to data rather than bringing data to computer.

i.e., understanding the method by which data can be preprocessed is important to improve data quality and the analysis results. Prior to data analysis, data must be well constructed.

3.7 Philosophy of Computation

Divide and Conquer (i.e., Data Partitioning) for large datasets.

The above mentioned is the philosophy of computation.

3.8 Motivation behind Big Data

The purpose of using Big Data or BDA is to gain;

Hindsight - metadata patterns emerging from historical data

Insight - deep understanding of issues or problems

Foresight - accurate prediction in near future

Based on the solutions, analytics can be classified into four, (Refer Table 3)

Table 3. Types of Analytics based on its Solutions

3.9 BD Analytics - Project Life Cycle

There will be a sequence (phase) of processes called “Project Life Cycle” for developing a better (conceiving) big data analytics to visualize its performance with the stakeholders satisfaction, called “BD Analytics Project Life Cycle” which is detailed in Table 4.

Table 4. BD Analytics - Project Life Cycle

3.10 BD Dimensions/ Properties & its Value Chain

Different dimensions of BD are elevated from 3 V's to 6 V's (Coined by various vendors) followed by 10 V's (Figure 2) and BD value chain, starting from data acquisition to data usage is depicted through figurative form in Figure 3.

By Douglas Laney: 3 V's (Volume, Velocity, Variety), [Root of term traced in 2001];
By IBM: 4 V's (Volume, Velocity, Variety, Veracity);
By Yuri Demchenko's: 5 V's (Volume, Velocity, Variety, Veracity, Value);
By Microsoft: 6 V's (Volume, Velocity, Variety, Veracity, Variability, Visibility); and
By Kir K Borne:10 V's (Volume, Velocity, Variety, Veracity, Variability, Visibility, Value, Validity, Vocabulary, Venue, Vagueness).

Figure 2. BD Dimensions (10 V's)

Figure 3. BD Value Chain

4. Apache Spark – Data Serialization (Figure 4)

Spark, cluster computing architectural model to Big Data solution works in both in-memory and on-disk since it holds intermediate results in memory instead of writing data on disk. It is the most active open source project in big data and has a calibre to replace Map Reduce (MR) (Need not write MR code). It is a flexible in-memory data processing for Hadoop in the line of easy development, flexible, extensible API and fast batch and stream processing. Supporting of lazy evaluation of big data-queries helps in optimization of the data processing workflows. Resilient Distributed Datasets (RDD) is one of the key programming abstractions for parallel programming model in Spark. It helps in backtracking and completing a task instead of starting everything from the scratch.

Spark consists of seven major components such as Spark Core Data Engine, Spark Cluster Manager, (includes Hadoop, Apache Mesos, and built-in standalone cluster manager), SparkSQL, Spark Streaming, Spark MLlib, Spark GraphX, and Spark Programming Tools.

Figure 4. Spark Analytics - Stack

5. Computing Platforms and its Scaling

To analyze the massive data set, it is essential to have suitable hardware, software. i.e., it must have effective computing platforms, developing tools and algorithms. Huge datasets are analyzed with respect to data volume, band width of network used and different constraints of computations like end-user locations, algorithms/ techniques used etc., Four computational platforms are used to perform large data set analysis like Super computing, Grid computing, Cloud computing, and Heterogeneous computing (it uses computational processors such as GPP (General Purpose Processors) and GPU (Graphics Processing Unit).

There are two levels of scaling such as;

Horizontal Scaling (Scale-Out): Apache Hadoop, Apache Spark, P to P Networks
Vertical Scaling (Scale-Up): HPC clusters, Multi-core processors, GPU (Graphical Processing Unit), FPGA, GPGPU.

6. Machine Learning (Liu et al., 2014)

Machine Learning (ML) is the science of making machines to work without programming – predefined rules and learning from data. It involves training or fitting a model on historical data and using the trained model to leverage predictions for new data (unknown data) using testing file. The core methodology in Data Analytics (DA) is Machine Learning (ML); it is an area of computer science, that aims to build delivery systems and algorithms that learn from large datasets, and meeting out analytic tasks through required algorithms are also one of the essential parts of BDA. The essence of ML is an automatic process of Pattern Recognition by a Learning Machine. ML uses data to make any inferences or predictions to the specified tasks. Historically, many terms to describe the equivalent meaning of ML, such as “Learning from data”, “Pattern Recognition”, “Data Science”, “Data Mining (DM)”, “Text Mining” or even “Business Intelligence” etc. One of the major techniques of ML is Data Mining (DM) where, the following patterns of data are looked into.

Feature selection;
Supervised learning (Predictive learning - with labels);
Semi-Supervised learning;
Unsupervised learning (Descriptive learning – without labels);
Deep/ Hybrid/ Reinforcement learning (complex, mysterious and popular learning - able to identify features from raw data) are the different categories of learning; and
Methods/Tasks of ML: classification analysis (classifiers)/ clustering analysis/association (rule learning) /bayesian /regression analysis/dimensionality reduction/ decision-makings through decision trees and regularization/ recommender engine supported collaborative filtering.

6.1 MLlib - Apache Spark (Zaharia et al., 2012; Karau et al., 2015)

Spark is a general-purpose big data platform. MLlib is a standard component of spark providing machine learning primitives on top of spark. MLlib built on Apache spark is a fast and general engine for large-scale data processing and specialized in large-scale machine learning. It runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It also writes applications quickly in Java, Scala, or Python.

6.2 ML Supported Algorithms through ML Methods in MLlib-Spark

These are comprehensively explained in Table 5 to understand the spark.ml package by the application developer(s). MLlib consists of ML algorithms for faster execution of programs.

Table 5. List of MLlib - Apache Spark Supported Algorithms

6.3 Apache Mahout in ML Library (ML Techniques)

Apache Mahout is ML library built on top of Hadoop, now integrated with Apache Spark to provide “in-memory” performance and avoid scalability issues (Table 6).

Table 6. Apache Mahout in ML Library

6.4 Spark Components/ Depth Benefits of in-memory Speeds

Table 7 describes the complete details on Spark components and core benefits of in-memory computing called Apache Spark to build any BD analytical applications.

Table 7. Spark Components

6.5 Recommender Engine Algorithms

ML based Recommendation Systems (RS) have become very popular and are necessary in recent years for a variety of applications. Recommendation is the key to success for discovering and retrieval of content in the era of huge data and is a proven case of BD. BD allowed us to do recommendations in a new scale that have not seen before. RS belongs to a research area in health care domain and is a classical example for Machine Learning (ML) applications by providing quality recommendations (Katarzyna, 2006). Applications of RS in health domain are still emerging and are widely used in health care industry to facilitate better health services to patients like food recommendation on diabetes mellitus, immune system disorders, and also to assist doctors and hospital personnel to make decisions.

6.5.1 Collective-Intelligence RS/Taxonomy of RS Techniques

The key role play by collective-intelligence RS in health care is greatly increasing and used to find behavioural changes for a wide range of health conditions (Wiesner and Pfeifer, 2014; Sadasivam et al., 2016). There are two broad categories of recommender engine algorithms (Verma et al., 2015) such as user-based and item-based recommenders. These recommendations are based on two filtering techniques namely collaborative and content-based filtering techniques. Collaborative filtering is used to understand the future behaviour of a patient with respect to the disease and has tremendous future applications.

Taxonomy of Remote Sensing techniques converges into three levels such as Collaborative Filtering (CF); (Memorybased CF and Model-based CF, and Content-based filtering. The taxonomy of recommender agents which is very worthy to mention and refer the Recommender Techniques such as Information Filtering Techniques [Demographic (Interest-filtering), Content-based, Collaborative filtering] and Matching Techniques [User Profile-Item Matching (Standard Keyword Matching, Cosine Similarity, Nearest Neighbour, Classification) and User-Profile Matching (Find similar users, Create a neighbourhood, Compute Prediction based on selected neighbours)]. Alternating Least Squares (ALS) algorithm is suitable and leveraged to find the appropriate recommendations for users, as in Apache Spark, substantially faster than Mahout.

7.Cloud Storage

Cloud Computing (Table 8) plays a critical role in BDA process as it offers subscription-oriented access computing infrastructure, data, and application services (Buyya et al., 2013). The original objective of BDA was to leverage commodity hardware to build computing clusters and scale out the computing capacity for web crawling and indexing system workloads.

Table 8. Cloud Computing – An Outlook

The implementation of cloud computing were underpinned with three service models, four deployment models, and - five Characteristics (Caesar and Buyya, 2015), which is so called 3S-4D-5C definition, and is given below.

Service orientation or 3S Service models (SaaS, PaaS, and IaaS),
Customized delivery or 4D Deployment models (Private, Public, Community, and Hybrid Cloud)
Shared Infrastructure or 5C Characteristics (On- Demand, Broad Network Access, Resource Pool, Rapid Elasticity, and Measured Service).

Conclusion

This review attempt assists the novice researchers' in understanding the A to Z of data, data quality, data preprocessing, data processing through large scale processing engines on selecting, applying, and evaluating the right ML supported algorithms (which are existing in the MLlib – Apache Spark platform). Along with that, design principles of Hadoop are explored in a comprehensive sense to attain the better solutions (building best data analytics) in BD environment.

References

[1]. Buyya, R., Vecchiola, C., & Selvi, S. T. (2013). Mastering Cloud Computing. Morgan Kaufmann, USA.

[2].Caesar, W. & Buyya, R. (2015). Cloud Data Centers and Cost Modeling. Morgan Kaufmann, USA.

[3].Dean, J. & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

[4].Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media, Inc.

[5].Katarzyna, M. (2006). Recommendation System for Online Social Network. Blekinge Institute of Technology, Master's Thesis in Software Engineering, Thesis no: MSE- 2006, 11.

[6].Liu, X., Datta, A., & Lim, E. P. (Eds.). (2014). Computational Trust Models and Machine Learning. CRC Press.

[7]. Nair, S. S. K. & Ganesh, N. (2016). An exploratory study on big data processing: A case study from a biomedical informatics. In Big Data and Smart City (ICBDSC), 2016 3^rd MEC International Conference on (pp. 1-4). IEEE.

[8]. Sadasivam, R. S., Cutrona, L. S., Kinney, L. R., Marlin, M. B., Mazor, M. K., Lemon, C. S. et al. (2016). Collectiveintelligence recommender systems: advancing computer tailoring for health behavior change into the 21^st century. Journal of Medical Internet Research, 18(3), 1-13.

[9]. Saravanakumar, M. V., & Hanifa, S. M. (2017). BIGDATA: Harnessing insights to healthier analytics - A survey. In Algorithms, Methodology, Models and Applications in Emerging Technologies (ICAMMAET), 2017 International Conference on (pp. 1-6). IEEE.

[10]. Verma, J. P., Patel, B., & Patel, A. (2015). Big data analysis: recommendation system with Hadoop frame work. In Computational Intelligence& Communication Technology (CICT), 2015 IEEE International Conference on (pp. 92-97). IEEE.

[11]. Wiesner, M. & Pfeifer, D. (2014). Health recommender systems: concepts, requirements, technical basics and challenges. International Journal of Environmental Research and Public Health, 11(3), 2580- 2607.

[12]. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma., J., Macauley, M. et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9^th USENIX conference on Networked Systems Design and Implementation (pp. 25-27). USENIX Association.