Improving the Performance of KNN Classification Algorithms by Using Apache Spark

B. Rajesh*, Asadi Srinivasulu**
* M.Tech Scholar, Department of Software Engineering, Jawaharlal Nehru Technological University Ananthapur, Andhra Pradesh, India.
**Associate Professor, Department of Information Technology, Sree Vidyanikethan Engineering College, Andhra Pradesh, India.
Periodicity:July - December'2017


Data mining and machine learning are the most interesting research areas which find meaningful information from the large amount of data available, and converts into understandable form for further use. Diabetes is one of the growing diseases all over the world. Health trade professionals desire a reliable prediction system to diagnose polygenic disease. Tools and techniques available will be used to find the appropriate approaches and methods for classification of diabetes and in extracting valuable pattern. The Spark software was employed as a mining tool for diagnosing diabetes. Thus, using the spark, the performance of KNN Classification can be improved.


Big Data Analytics, Machine Learning, Healthcare, kNN, K-Means, Spark, Classifiers

How to Cite this Article?

Rajesh, B.., and Srinivasulu, A. (2017). Improving the Performance of KNN Classification Algorithms by Using Apache Spark. i-manager's Journal on Cloud Computing, 4(2), 23-32.


[1]. Aggarwal, C. C., Han, J., Wang, J., & Philip, S. Y. (2003). A Framework for Clustering Evolving Data Streams. In Proceedings 2003 VLDB Conference (pp. 81-92).
[2]. Bradley, P. S., Fayyad, U., & Reina, C. (1998, August). Scaling Clustering Algorithms to Large Databases. In KDD (pp. 9-15).
[3]. Burkardt, J. (2009). K-means clustering. Virginia Tech, Advanced Research Computing, Interdisciplinary Center for Applied Mathematics.
[4]. Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications,40(1), 200-210.
[5]. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37-54.
[6]. Gupta, H., & Srivastava, R. (2014). K-means based document clustering with automatic “K” selection and cluster refinement. International Journal of Computer Science and Mobile Applications, 2(5), 7-13.
[7]. Haraty, R. A., Dimishkieh, M., & Masud, M. (2015). An enhanced k-means clustering algorithm for pattern discovery in healthcare data. International Journal of Distributed Sensor Networks, 11(6).
[8]. Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100- 108.
[9]. Helma, C., Cramer, T., Kramer, S., & De Raedt, L. (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of Chemical Information and Computer Sciences, 44(4), 1402-1411.
[10]. Jen, C., Wang, C., Jiang, B. C., Chu, Y., & Chen, M. (2012). Application of classification techniques on development an early-warning system for chronic illnesses. Expert Systems with Applications, 39(10), 8852- 8858.
[11]. Jothi, N., Rashid, N.A.A., & Husain, W. (2015). Data mining in healthcare–a review. Procedia Computer Science, 72, 306-313.
[12]. Kang, S., Kang, P., Ko, T., Cho, S., Rhee, S., & Yu, K. S. (2015). An efficient and effective ensemble of support vector machines for anti-diabetic drug failure prediction. Expert Systems with Applications, 42(9), 4265-4273.
[13]. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
[14]. Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of Healthcare Information Management,19(2), 64-72.
[15]. Kohonen, T. (2001). Self-Organizing Maps (Vol. 30). Springer.
[16]. Kulis, B., & Jordan, M. I. (2011). Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352.
[17]. MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).
[18]. Morr, C. E., & Subercaze, J. (2010). Knowledge management in healthcare. In Handbook of research on developments in e-health and telemedicine: Technological and social perspectives (pp. 490-510). IGI Global.
[19]. Obenshain, M. K. (2004). Application of data mining techniques to healthcare data. Infection Control & Hospital Epidemiology, 25(8), 690-695.
[20]. Soliman, T. H. A., Sewissy, A. A., & AbdelLatif, H. (2010, November). A gene selection approach for classifying diseases based on microarray datasets. In Computer Technology and Development (ICCTD), 2010 2nd International Conference on (pp. 626-631). IEEE.
[21]. Su, C. T., Wang, P. C., Chen, Y. C., & Chen, L. F. (2012). Data mining techniques for assisting the diagnosis of pressure ulcer development in surgical patients. Journal of Medical Systems, 36(4), 2387-2399.
[22]. Tapia, J. J., Morett, E., & Vallejo, E. E. (2009). A clustering genetic algorithm for genomic data mining. In Foundations of Computational Intelligence Volume 4 (pp. 249-275). Springer, Berlin, Heidelberg.
[23]. Tomar, D., & Agarwal, S. (2013). A survey on Data Mining approaches for Healthcare. International Journal of Bio-Science and Bio-Technology, 5(5), 241-266.
[24]. Veloso, R., Portela, F., Santos, M. F., Silva, A., Rua, F., Abelha, A., & Machado, J. (2014). A clustering approach for predicting readmissions in intensive medicine. Procedia Technology, 16, 1307-1316.
[25]. Yang, J., Li, J., Mulder, J., Wang, Y., Chen, S., Wu, H., ... & Pan, H. (2015). Emerging information technologies for enhanced healthcare. Computers in Industry, 69, 3- 11.
[26]. Zheng, B., Yoon, S. W., & Lam, S. S. (2014). Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Systems with Applications, 41(4), 1476- 1482.

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

If you have access to this article please login to view the article or kindly login to purchase the article
Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.