Data Quality Evaluation Framework for Big Data

Grace Amina Onyeabor *, Azman Ta’a**
*Lecturer Department of Information Science, University of Ibadan, Nigeria.
** Senior Lecturer, Department of Information Science, University of Ibadan, Nigeria.
Periodicity:July - December'2018
DOI : https://doi.org/10.26634/jcc.5.2.15692

Abstract

Data is an important asset in all business organizations of today. Thus the results of its poor quality can be very grievous leading to erroneous insights. Therefore, Data Quality (DQ) needs to be evaluated before the analysis of any Big Data (BD). The evaluation of DQ in BD is challenging. Given the enormous datasets that are of varied format fashioned at a rapid speed, it is impossible to use the traditional methods of evaluating DQ in BD. Rather, there is a requirement of strategies and devices for the assessment and evaluation of DQ in BD in a rapid and more efficient manner. However, assessing the quality of data on the whole BD can be very expensive. In addition, there is also a need for improvement in data transformation activities of BD. This paper proposes a framework for DQ evaluation with the application of data sampling technique on BD sets from different data sources reducing the size of the data to samples representing the population of the BD sets. The Bag of Little Bootstrap (BLB) sampling technique will be used. The target Data Quality Dimensions (DQDs) to be used in this paper are completeness, consistency, and accuracy. In addition, the DQDs will be measured using different metric functions relevant to the DQDs. This will be done before and after an improved data transformation techniques to check the improvement of DQ in BD.

Keywords

Big Data, Data Sampling, Data Transformation, Data Quality Evaluation.

How to Cite this Article?

Onyeabor,G.A., Ta’a,A.(2018). Data Quality Evaluation Framework for Big Data, i-manager's Journal on Cloud Computing 5(2), 27-35. https://doi.org/10.26634/jcc.5.2.15692

References

[1]. Addo-Tenkorang, R., & Helo, P. (2011). Enterprise resource planning (ERP): A review literature report. In Proceedings of the World Congress on Engineering and Computer Science (Vol. 2, pp. 19-21).
[2]. Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14, 2. DOI: http://doi.org/10.5334/dsj- 2015-002.
[3]. Chang, W. L. (2015). NIST Big Data Interoperability Framework: Volume 4, Security and Privacy (No. Special Publication (NIST SP)-1500-4).
[4]. Chen, M., Mao, S., Zhang, Y., & Leung, V. C. (2014). Big Data: Related Technologies, Challenges and Future Prospects. Springer.
[5]. Cormode, G., & Duffield, N. (2014). Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (p. 1975). ACM.
[6]. Eckerson, W. W. (2002). Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute. Retrieved from http://download.101com.com/pub/tdwi/ Files/DQReport.pdf
[7]. Fan, W., & Geerts, F. (2012). Foundations of Data Quality Management - Synthesis Lectures on Data Management. Morgan & Claypool.
[8]. Fan, W., Geerts, F., Jia, X., & Kementsietsidis, A. (2008). Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2), 6:1-6:48.
[9]. Feldman, M. (2016). The Big Data Challenge: Intelligent Tiered Storage at Scale - Actionable Market Intelligence for High Performance Computing [White Paper]. Retrieved from https://www.cray.com/sites/ default/files/resources/Integrated_Tiered_Storage_White paper.pdf
[10]. Floridi, L. (2014). Big Data and information quality. In The Philosophy of Information Quality (pp. 303-315). Springer, Cham.
[11]. Fürber, C., & Hepp, M. (2011). Towards a vocabulary for data quality management in semantic web architectures. In Proceedings of the 1st International Workshop on Linked Web Data Management (pp. 1-8). ACM.
[12]. Gadepally, V., Herr, T., Johnson, L., Milechin, L., Milosavljevic, M., & Miller, B. A. (2015). Sampling operations on big data. In Signals, Systems and Computers, 2015 49th Asilomar Conference on (pp. 1515-1519). IEEE.
[13]. Glowalla, P., Balazy, P., Basten, D., & Sunyaev, A. (2014). Process-driven data quality management--An application of the combined conceptual life cycle model. In System Sciences (HICSS), 2014 47th Hawaii International Conference on (pp. 4700-4709). IEEE.
[14]. Grijzenhout, S., & Marx, M. (2013). The quality of the XML web. Web Semantics: Science, Services and Agents on the World Wide Web, 19, 59-68.
[15]. Han, R., Nie, L., Ghanem, M. M., & Guo, Y. (2013). Elastic algorithms for guaranteeing quality monotonicity in big data mining. In Big Data, 2013 IEEE International Conference on (pp. 45-50). IEEE.
[16]. Hazen, B. T., Boone, C. A., Ezell, J. D., & Jones- Farmer, L. A. (2014). Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. International Journal of Production Economics, 154, 72-80.
[17]. Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.
[18]. Immonen, A., Pääkkönen, P., & Ovaska, E. (2015). Evaluating the quality of social media data in big data architecture. IEEE Access, 3, 2028-2043.
[19]. Jaya, M. I., Sidi, F., Ishak, I., Affendey, L. S., & Jabar, M. A. (2017). A review of data quality research in achieving high data quality within organization. Journal of Theoretical & Applied Information Technology, 95(12), 2647-2657.
[20]. Juddoo, S. (2015). Overview of data quality challenges in the context of Big Data. In Computing, Communication and Security (ICCCS , 2015 International Conference on (pp. 1-9). IEEE.
[21]. Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint arXiv:1206.6415.
[22]. Krogstie, J., & Gao, S. (2015). A semiotic approach to investigate quality issues of open big data ecosystems. In International Conference on Informatics and Semiotics in Organisations (pp. 41-50). Springer, Cham.
[23]. Lee, Y. W., & Strong, D. M. (2003). Knowing-why about data processes and data quality. Journal of Management Information Systems, 20(3), 13-39.
[24]. Levitin, A. V., & Redman, T. C. (1998). Data as a resource: properties, implications, and prescriptions. Sloan Management Review, 40(1), 89-102.
[25]. Liang, F., Kim, J., & Song, Q. (2016). A bootstrap Metropolis–Hastings algorithm for Bayesian analysis of big data. Technometrics, 58(3), 304-318.
[26]. Loshin, D. (2013). Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph. USA: Elsevier.
[27]. Mahanti, R. (2014). Critical success factors for implementing data profiling: The first step toward data quality. Software Quality Professional, 16(2), 13-26.
[28]. Maier, M., Serebrenik, A., & Vanderfeesten, I. T. P. (2013). Towards a big data reference architecture (Master's Thesis, University of Eindhoven).
[29]. Malik, P. (2013). Governing big data: Principles and practices. IBM Journal of Research and Development, 57(3/4), 1-13.
[30]. Merino, J., Caballero, I., Rivas, B., Serrano, M., & Piattini, M. (2016). A data quality in use model for big data. Future Generation Computer Systems, 63, 123- 130.
[31]. Pääkkönen, P., & Jokitulppo, J. (2017). Quality management architecture for social media data. Journal of Big Data, 4(1), 6. DOI: https://doi.org/10.1186/ s40537-017-0066-7
[32]. Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211-218.
[33]. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.
[34]. Satyanarayana, A. (2014). Intelligent sampling for big data using bootstrap sampling and Chebyshev inequality. In Electrical and Computer Engineering (CCECE), 2014 IEEE 27th Canadian Conference on (pp. 1- 6). IEEE.
[35]. Sebastian-Coleman, L. (2012). Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework. USA: Newnes.
[36]. Serhani, M. A., El Kassabi, H. T., Taleb, I., & Nujum, A. (2016). An hybrid approach to quality evaluation across big data value chain. In Big Data (BigData Congress), 2016 IEEE International Congress on (pp. 418-425). IEEE.
[37]. Sidi, F., Panahy, P. H. S., Affendey, L. S., Jabar, M. A., Ibrahim, H., & Mustapha, A. (2012). Data quality: A survey of data quality dimensions. In Information Retrieval & Knowledge Management (CAMP), 2012 International Conference on (pp. 300-304). IEEE.
[38]. Sneed, H. M., & Erdoes, K. (2015). Testing big data (Assuring the quality of large databases). In Software Testing, Verification and Validation Workshops (ICSTW), 2015 IEEE Eighth International Conference on (pp. 1-6). IEEE.
[39]. Soares, S. (2012). Big Data Governance: An Emerging Imperative. MC Press.
[40]. Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103-110.
[41]. Taleb, I., Dssouli, R., & Serhani, M. A. (2015). Big data pre-processing: A quality framework. In Big Data (BigData Congress), 2015 IEEE International Congress on (pp. 191- 198). IEEE.
[42]. Taleb, I., El Kassabi, H. T., Serhani, M. A., Dssouli, R., & Bouhaddioui, C. (2016). Big data quality: A quality dimensions evaluation. In Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ ATC/ ScalCom/ CBDCom/ IoP/ SmartWorld), 2016 Intl. IEEE Conferences (pp. 759-765). IEEE.
[43]. Wang, R. Y. (1998). A product perspective on total data quality management. Communications of the ACM, 41(2), 58-65.
[44]. Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5- 33.
[45]. Zhou, H., Lou, J. G., Zhang, H., Lin, H., Lin, H., & Qin, T. (2015). An empirical study on quality issues of production big data platform. In Proceedings of the 37th International Conference on Software Engineering (Vol. 2, 17-26). IEEE Press.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.