An Introduction to Data Lake

* HOD, Department of Information Technology, Vignan’s Institute of Information Technology, Visakhapatnam, India.

** Project Manager, Tech Mahindra, Visakhapatnam, India.

Abstract

Now-a-days companies are concentrating on more data to take informed decisions. Companies that are able to effectively use data are the world leaders in terms of wealth, development and growth. Even to survive, operate and compete in this age, organizations need to be able to effectively use their data. Huge amount of investment is made in storing and processing large amounts of data to make better decisions. Data lake is a massive, easily accessible data store/repository that allows for collecting large volumes of structured and unstructured data in its native format from disparate data sources. This paper describes Data Lake, Schema-on-Write, Schema-on-Read, Characteristics and implementation of data lake.

Keywords :

Data Lake,
Schema-on-Write,
Schema-on-Read.

Introduction

Long gone are the days when businesses could be run by intuition or personal experience of a few people. Data driven organizations are the order of the day since that is the only way to survive and thrive in these days of cutthroat competition. Data never lies and the decisions taken based on the evidence and conclusions derived out of analysis of data are more likely to succeed and less likely to fail. It has been proven from many studies and research that Data Driven organizations perform better than their competitors.

Till now, the organizations depended on and are still depending on Enterprise Data Warehouses (EDW) for Data Driven decision making. The EDWs cater to the operational reporting needs and the analytical requirements of the organization. Operational reporting helps in getting the data required to support the day-to-day activities of the organization. Analytical Dashboards help in deriving Business Intelligence and insights out of the data collected over a period of time [1-6]. Building and maintaining the EDW more-or-less consists of the following steps:

Understanding the operational reporting and analytical requirements of the organization.
Studying the various sources where this data is available.
Data modeling to serve the reporting requirements and make the data amenable to specific types of analysis.
Extract-Transform-Load (ETL) the data from various sources to the EDW. This includes data cleansing, massaging and aggregation of data.
Build the Business Intelligence layer on top of the EDW for reporting and dashboards.

This approach works fine within the following constraints.

The reporting needs which the EDW need to serve and the Business Questions that the EDW need to answer, are known in advance.
Much of the sources of the EDW are structured, welldefined and relational in nature.
The volume and the speed at which the data is coming from source systems are not too huge or big. Here huge and big are comparative terms used between the data from usual transaction systems of the organizations and Internet-scale Data (like say Twitter feeds and data received from Internet of Things and Devices).

With a great increase in volume, velocity and variety of data, new solutions and concepts are coming up to serve the needs of the Data Driven organizations. EDW is simply not a cost-effective solution to deal with this kind of Big Data.

Data Lake is one such concept and solution that is gaining traction these days. A Data lake is a massive, easily accessible data store/repository that allows for collecting large volumes of structured and unstructured data in its native format from disparate data sources [12-15]. The idea is to get the data into the data lake from the sources with minimum amount of processing (when compared to the ETL processes of an EDW) into a lesser rigid structure (when compared to that of a canonical data model).

At this point, it is appropriate to delve into the concepts of Schema-on-Write and Schema-on-Read before proceeding further with the Data Lake concept.

1. Schema-on-Write

Schema-on-Write refers to the traditional EDW systems which require a schema before being able to load data into it. In relational parlance, you need a table in advance to be able to load data into it. If new fields get added in the incoming data, the table needs to be altered and columns are added to it before you can load data into it. This has the advantages of data consistency and allows for data retrieval at interactive speeds because the organization of data is tightly controlled by well-designed ETL processes and it is clearly known where the data in question resides. This model is good at getting answers to known Business questions. The disadvantage of this model is the loss of agility and the high cost and time involved in responding to even small changes.

2. Schema-on-Read

Schema-on-Read refers to the concept of ingesting as-is data to the Data Store in native format with minimal amount of preparation and processing. Data Lake uses this Schema-on-Read model. Some amount of metadata may be associated with the data while storing it, so to be able to retrieve and organize it in a more useful format at a later date. This eliminates heavy processing at the time of data loading and allows for faster response to changes in the incoming data. When the actual use of the data becomes known in the context of the business questions, it needs to answer and when the data needs to be read and used, then the schema of the data can be defined. This is known as Late Binding. Thus the time and cost associated with data processing are deferred to the time of its actual usage. This is appropriate in the world of Big data where the value of data may not be known from the very start.

3. Characteristics of Data Lake

The term Data Lake was coined by James Dixon, who is the CTO and co-founder of Pentaho [7-9]. He first introduced this term in his following blog entry in October, 2010 (https://jamesdixon.wordpress.com/2010/10/14/pentaho -hadoop-and-data-lakes/).

In the above blog entry, he gave the analogy of bottled water for the structured and cleansed data in a data mart. A Data lake is compared to a large body of water, which is in a more natural state.

As with any concept/technology, Data Lake has evolved from the time James Dixon had described it five years back. The following characteristics of Data Lake [10] can be seen when we study the various contexts in which the term Data Lake is being today.

Can store large quantities of data.
Can store large varieties of structured, semistructured and unstructured data from various sources.
The data is usually stored in its original/native format.
Lower cost of deployment.
High Scalability.
Cost per unit storage is low because of the usage of clusters of commodity hardware.
Cost of input/ingestion of data to Data Lake is much low due to lesser amount of processing when compared to the standard ETL.
Data in the Data Lake need not confirm to a rigid upfront canonical data model and is flexible enough for Late Binding. Data models evolve over a period of time.
Open source software like Hadoop are used for its implementation
The Business requirement or the actual purpose/use, for which the data needs to stored, may not be fully evident or known upfront.
Has the ability to take in newer kind of data with least impact or changes required to the existing system.
Ease of accessibility through various interfaces.
Data is usually searched in Data Lake on similar lines as searching for something on the internet using Google.

4. Implementation of Data Lake

The most popular option used to implement Data Lake is Apache Hadoop. Data in Data Lakes is stored on Hadoop Distributed File System (HDFS).

Booz Allen Hamilton Inc. [11] explained an innovative approach using key/value to organize data in the Data Lake. The metadata is stored along with the Data in the form of tags. These tags allow for storing, managing and retrieval of Data. The example in Table 1 shows the movie data organized using tags. In actual implementation scenario, further types of tags like Tag Group, Time Stamp and Visibility may be associated with the data. The table shows how flexibly the data can be organized. As newer type of data get added in future, the same can be simply handled by adding a new tag.

Table 1. Movie Information Organized Using Tags

Conclusion

The inability of existing analytical systems to respond to the speed of business has given rise to the concept of Data Lake and its adoption in Businesses. It is not that all the existing analytical systems are obsolete, but there are many use cases for which Data Lake may be the appropriate and cost-effective solution. Data Lake is definitely one solution which can serve a purpose or requirement which is not even known at this moment and thus enable the organizations to be future ready. Those organizations planning to adopt the Data Lake strategy should also be aware of the risks of their Data stores turning into Data Graveyards or Data Sewers when data is dumped into the Data Lake without proper thought. Usage of proper technologies and tested industry standard methods/processes will prevent this problem.

References

[1]. Rajesh K.V.N. (2008). “Business Intelligence for Enterprises”. The IUP Journal of Information Technology, Vol. 4, No. 1, pp. 45-54.

[2]. Rajesh K.V.N. (2011). “Location Intelligence Mashup Using Open Source Software and Google Maps API”. The IUP Journal of Information Technology, Vol. 7, No. 1, pp. 35-46.

[3]. Rajesh K.V.N. (2013). “Big Data Analytics: Applications and Benefits”. The IUP Journal of Information Technology, Vol. 9, No. 4, pp. 41-51.

[4]. Rajesh K.V.N. (2014).“Business Analytics: Its Application in Various Industry Verticals from Banking to Government”. CSI Communications, Vol. 38, No. 4, pp. 7- 9.

[5]. Rajesh K.V.N. and Ramesh K.V.N. (2014). “A Brief Histor y of BIDW (Business Intelligence and Data Warehousing)”. CSI Communications, Vol. 38, No. 6, pp. 26-28.

[6]. Rajesh K.V.N. and Ramesh K.V.N. (2015). “Security in Business Intelligence Reporting Systems ”. CSI Communications, Vol. 39, No. 4, pp. 35-37.

[7]. James Dixon, (2010). Pentaho, Hadoop and Data Lakes. Retrieved from, https://jamesdixon.wordpress.com /2010/10/14/ pentaho-hadoop-and-data-lakes/

[8]. James Dixon, (2014). Data Lakes Revisited. Retrieved from, https://jamesdixon.wordpress.com/2014/09/25/ data-lakes-revisited/

[9]. James Dixon, (2015). Imagines a Data Lakes that Matter. Retrieved from, http://www.forbes.com/sites /dan woods/2015/01/26/james-dixon-imagines-a-data-lakethat- matters/

[10]. CITO Research: Putting the Data Lake to work A Guide to Best practices”. Retrieved from, http://hortonworks.com /wp-content/uploads/2014/05/Teradata Hortonworks_Da talake_White-Paper _20140410.pdf

[11]. Mark Jacobsohn and Michael Delurey, (2014). How the Data Lake Works? Retrieved from, https://www. Boozal len.com/content/dam/boozallen/documents/Data_ Lake.pdf

[12]. Andrew C. Oliver, (2014). How to create a Data Lake for Fun and Profit? Retrieved from, http://www.infoworld. com/article/2608490/application-development/how-tocreate- a-data-lake-for-fun-and-profit.html

[13]. Steve Jones, (2013). Why Business Needs a Lake for Data Not a wave house? Retrieved from, http:// www.capgemini.com/blog/capping-it-off /2013/12/whybusiness- needs-a-lake-for-data-not-a-warehouse

[14]. Brian Stein and Alan Morrison, (2014). The Enterprise Data Lake: Better Integration and Deeper Analytics. Retrieved from, https://www.pwc.com/us/en/healthindustries/ assets /pwc-tech-forecast-data-lakes.pdf

[15]. “The principles of the Business Data Lake”. Retrieved from, http://pivotal.io/big-data/white-paper/theprinciples -of-the-business-data-lake