Genetic Algorithm Using MapReduce - A Critical Review

Palak Sachar *  Vikas Khullar **
* PG Scholar, Department of Computer Science and Engineering, CT Group of Institutions, Jalandhar, Punjab, India.
** Assistant Professor, Department of Computer Science and Engineering, CT Group of Institutions, Jalandhar, Punjab, India.

Abstract

Now-a-days, to achieve an optimized solution for hard problems is a big challenge. Scientists are putting their best efforts to introduce a best algorithm to optimize the problem to a great extent. Genetic Algorithm is one of the stepping stones in the challenge and is an evolutionary algorithm inspired by Darwin's theory of evolution. Using this algorithm, with MapReduce, makes it efficient and user friendly. Users can build more scalable applications with MapReduce, since it provides a better abstraction to the genetic algorithm in lesser time. To parallelize the process of any project, MapReduce plays a vital role on Hadoop platform. The platform may vary from Hadoop to cloud which affect the performance significantly. Parallelizing of a genetic algorithm is convenient with the help of MapReduce. The major objective of the study is to know the behavior of Genetic Algorithm under the paradigm of Hadoop MapReduce. The various applications show different trends influenced by this platform. Also literature review strongly depicts the advantages of Hadoop MapReduce platform over other platforms. Moreover, the difference between various paradigms of Parallelisation is given in the paper to make decisions regarding its implementation of future work.

Keywords :

Introduction

Genetic Algorithm is a meta heuristic technique, which is being used for finding optimal solutions for hard problems [16, 42]. A genetic algorithm is a mimic of natural evolution [1,2,3]. Genetic algorithm is being used in many large scale problems, as traditional Message Passing Interface (MPI) based parallel genetic algorithm approaches required detail knowledge about the machine learning architecture [30]. So, to provide abstraction in the working of GAs, there is a platform which is known as MapReduce [40].

MapReduce provide a powerful platform for parallel GAs in which, it abstracts the internal working of GAs so that, the user can build scalable application. This feature is first introduced by Google [14, 41, 45].

This paper contains a literature survey on genetic algorithm using MapReduce and some applications related. Also it involves comparison of technologies used in GA. The last section comprises of future trends followed with conclusion.

1. Map Reduce

Map Reduce is a programming model for processing large scale datasets in computer clusters. A Map is a tool or program which runs on datasets and generally divides datasets into various splits separately. Reduce is a programming portion where it is setup to split into logical pairs of data [6,7,18].

Map and reduce both acts as the function map () and reduce (). The major task of map () is to take a pair of input datasets and create it as a pair of key and value which is generally called as the intermediate key/value pair. Then its task of partioning arranges those key/value pair according to their intermediate set. The intermediate sets are sent to the reduce function and reduce () merges the keys and set of values to generate small set of values [8].

Both map () and reduce () is written by the user and the intermediate state is handled by Hadoop MapReduce framework [20, 49]. This working is also shown in Figure 1.

Figure 1. Working of MapReduce

2. MapReduce Genetic Algorithm

A genetic algorithm is a series of five steps which include initialization, selection, crossover, mutation and termination criteria.

During initialization (a), the process of generating the individuals get started. For the selection process (c), fitness function (b) is evaluated for each individual and two individuals can be selected for further computations. Some information are getting exchanged in both the individuals in a crossover process (d), and finally one new individual is created by recombining the selected individual (e) [33,34,35]. Functioning of GAs is shown in Figure 2.

Figure 2. Working of GA [10]

Feature of map () in genetic algorithm implemented in Hadoop is to calculate the fitness function and keep the record of each individual's fitness function in decreasing order. The map function is also writing the new individual in the HDFS. It is task of map () to write the finals in one single file after all iterations. In reduce function, all the methods are described through which, selection and crossover process is carried out [8,9,10].

3. Literature Review

In this literature review, there is a survey of various applications of genetic algorithm implemented on MapReduce framework.

In 2009, Verma et al. introduced a genetic algorithm using map reduce for scaling. In the framework, genetic algorithm is distributed into various phases. To satisfied Amdahl's law, a separate map reduce function is used [4, 6]. In it, default partitioner is replaced by random partitioner that sends record to reduce randomly so that, it will not overload the reducers. Verma et al. critized this method in their next publication [17,50] and improved in [43,44]. In the improved version, access to HDFS is minimized in such a way that, the reducer function collects the output from the map function directly. It extracts the best individual among the population. Due to this improved version, the algorithm runs faster.

Milos Ivanovic et al. 24 represented a framework for optimization using a parallel Genetic algorithm. It was implemented over the grid and HPC resources. This paper was helpful in removing various shortcomings of static pilot-job framework. It provides complete abstraction of grid complexities and allowed the users to concentrate on the optimization problems. The results obtained by Milos's framework, focused on how the project speeds up when computation were quite high and expensive.

Chengyu Hu et al. was successful in making one of the wonderful applications of genetic algorithm which is called MR-PNGA, that is MapReduce based Parallel Niche Genetic Algorithm. This framework is helpful in determining the water containment resources in large water distribution network. This framework is opted because this project needs extremely high and real time computations. This framework is capable of producing accurate results, plus performance can be improved by exploring it using cloud services [25]. According to this research, other than the five series of steps it includes one more step into the series which is called a niche. The purpose of niche is screening of every individual so that, an individual with best fitness fiction comes out. MR-PNGA is able to find out the multiple possible containment sources fast and accurate.

Masha A. Alshammari proposed a framework through which, it is easy to find the shortest reduction in minimum time. The proposed framework of a genetic algorithm is inspired from the works of Keco and Subasi [12] in which, they used the I/O footprint to minimize the overhead. In the genetic algorithm framework, each instance is compared to other instance, which balance the load and GA population divide among the set of nodes. The optimization situation in this framework is to find the chromosome with the highest fitness level in less time [22]. It is observed by Alshammari that, by double the number of mapper, the performance degrades due to context switching in large clusters [26, 27].

Fei Teng and Doga Tuncay [48] applied the feature of parallel nature of genetic algorithm on Hadoop MapReduce framework and twister iterative MapReduce framework on a max problem to do performance analysis.

Filomena Ferrucci et al. described three models of main grain parallelism which are implemented using the MapReduce paradigm [10]. Those models are

 

Global Parallelization Model:

This model is also famous as the, 'fitness evaluation model'. This model is entirely based upon the Master Slave methodology in which, Master computers the GA's series of steps, while Slave computes the Fitness function for each population [38,39]. Figure 3 shows the global parallelization model.

Figure 3. Global Parallelization Model

Coarse Grain Parallelization Model:

It is well known as an Island model in which, Population is present in different locations considered as an island and every time the individuals are migrating from one island to another. In this model, Number of reducers is more than a global parallelization model. Each reducer has partitioners which are helpful in implementing the genetic algorithm, parallel [31,32,37]. Figure 4 depicts the Coarse Grain Parallelization model. In the Figure, white oval represents the islands.

Figure 4. Island Model

Fine-Grain Parallelization Model:

This model is inspired by a grid model. In this model, the population is presented in a grid and GA is computed by evaluating the fitness function of a nearby neighborhood. Partitioners are used as a pseudo function [29]. Fine Grain Parallelization model is shown in Figure 5.

Figure 5. Fine-grain Parallelization Model

Saurabh Arora et al. surveyed the clustering techniques and genetic algorithm is one of such techniques which is used in Big Data Analytics [19, 23, 28].

Comparison between all these three Models of main grain parallelization is shown in Table 1.

Table 1. Comparison of Main Grain Parallelization Models

Chao Juin and Rajkumar introduced MRPGA which is an extension of MapReduce for parallelizing the genetic algorithm framework. In this project, a Map is used for mapping a population. Only one reduce function is used for reducing the population locally. Sub-optimal individuals are produced by the selection algorithm. Second, reduce function from global selection. To do the second reduction phase, a co-coordinator is required. All the sets of sub-optimal individuals are collected, merge & sorted. MRPGA deploys a different policy that is different individual on the basis of their key are collected as per their locations [5].

Di-Wei Huang and Jimmy Lin implemented a framework in which the authors scale up the population for solving, the Job Shop Scheduling Problem (JSSP). The objective of this model is to schedule J jobs on N machines so that, the overall time to compute all jobs can be minimized. Each job contains an ordered list of operations, which is processed by the certain machine for a particular period of time. According to this project, map phase evaluates the fitness function and the schedule is built as per, chromosomes. In this phase, local search is performed for calculating makespan. Reduce phase do selection and crossover for generating the population as output. A new MapReduce job is also created. The whole process is repeated until JSSP found [15,21].

Iza'in Nurfateha Ruzan and Suriyati Chuprat helped the data scientists by proposing a method of the hybrid algorithm using genetic algorithm that uses Handoop MapReduce for optimizing the energy efficiency in cloud computing platform [11].

The issue mentioned in this paper is that, the rapid growth of demand of data centers increases the consumption of electricity. DVFS (Dynamic Voltage and Frequency Scaling) combined with DPM for reduction of voltage supply and clock frequency. This technique is used to lower the CPU performance ratio when it is not fully applying. During this run time, power consumption is controlled by switching in between the frequency and voltage. Hybrid metric is used to maintain the energy consumption. The GA involvement in this technique is:

 

Periasamy Vivekanandan and Raju Nedunchezian fixed up another application of mining data stream with the concepts of drift using genetic algorithm. In this paper, it is said that the Genetic algorithm is a strong rule based classification algorithm, which is used for mining static small data sets [36,44]. A scalable and adaptive online algorithm is proposed to mine classification rules for the data structure with the concept of drift. Conclusion given by Periasamy Vivekanandan and his co-author is that, instead of using a fixed data set for fitness calculation, the proposed algorithm uses the snapshot of the training data set extracted from the current port of data stream from fitness calculation. It builds the rules of all classes separately, which makes the rules to evolve quickly [14].

Atanas Radenski [13] brought up with a new idea of distributed simulating annealing application using MapReduce in which, MapReduce is an emerging distributed computing framework for large scale data processing on clusters of commodity servers [46,47]. GA+SA works in two job MP pipeline. In first job, GA is a basic multi-population GA and in second job, there is a general purpose simulated annealing SA.

Dino Keco and Subasi have suggested two models for parallel genetic algorithm using MapReduce [12]. In first model, there is only one map reduce used in one generation. For every new generation of the map reduce, we need new phase of map reduce. HDFS is responsible for transferring the data between each generation of GAS. This model has big IO footprints because each generation of the GA's full population is saved in HDFS. This is the only reason for process degradation. In the second model, only one map reduce function is used to solve each generation of the population. Most of the processing has been moved from reduce phase to map phase due to which, it reduces the amount of IO footprint because all processing are kept in memory instead of HDFS [12].

Conclusion and Future Trends

It is evident that the genetic algorithm provides best optimization in less time. Genetic algorithm has the potential of playing a key role in decision making. With the feature of optimizations, decision making in the business world could be better significantly. Moreover, Genetic Algorithm is showing its footsteps in predicting the effects of disease on brain anatomy. This framework may indulge in better and fast search of the genomic data. In the coming years of data science, this framework will play an important role in predicting health and better DNA combinations. Genetic algorithm and their applications could be able to do bandwidth minimization, latency minimization, fitness optimization, energy optimization, software specialization, memory optimization, software transplantation, bug fixing, and multi-objective SE optimization.

References

[1]. K. Gallagher and M. Sambridge, (1994). “Genetic algorithms: a powerful tool for large-scale on linear optimization problems”. Comput. Geosci., Vol. 20, No. 78, pp.1229–1236.
[2]. P. Frnti, J. Kivijrvi, T. Kaukoranta, and O. Nevalainen, (1997). “Genetic algorithms for large scale clustering problems”. Comput.J, Vol. 40, pp. 547–554.
[3]. K. Sastry, D. E. Goldberg, and X. Llora, (2007). “Towards Billionbit Optimization Via a Parallel Estimation of distribution algorithm”. In GECCO '07: Proceedings of the th 9 Annual Conference on Genetic and Evolutionary Computation, pp. 577–584, New York, NY, USA, ACM.
[4]. Verma, A., Llorà, X., Goldberg, D. E., and Campbell, R. H., (2009). “Scaling Genetic Algorithms Using MapReduce”. In: 9th International Conference on Intelligent Systems Design and Applications, pp. 13-18. IEEE, New York.
[5]. Chao Jin and Rajkumar Buyya, (2008). “MRPGA: An Extension of Map Reduce for Parallelizing Genetic Algorithm”. IEEE international Conference on Escience, pp. 214-221.
[6]. Zhou C., (2010). “Fast Parallelization of Differential th Evolution Algorithm Using MapReduce”. In: 12 Annual Conference on Genetic and Evolutionary Computation, pp. 1113-1114, ACM, New York .
[7]. Filomena Ferrucci, (2015). "A Parallel Genetic Algorithm Framework Based on Map Reduce.” ACM New York, NY, USA, pp. 13-17.
[8]. Nivranshu Hans, Sana Mahajan, and SN Omkar, (2015). “Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce”. International Journal of Scientific & Technology Research, Vol. 4, No. 4.
[9]. Doina Logofatu, and Daniel Stamate, (2014). “Scalable Distributed Genetic Algorithm for Data Ordering Problem with Inversion Using MapReduce”. AIAI 2014, pp. 325-334.
[10]. Filomena Ferrucci, M-Tahar Kechadi, (2013). “A Framework for Genetic Algorithms Based on Hadoop”. arixiv, Cornell University USA, Vol. 30.
[11]. Iza'in Nurfateha Ruzan, and Suriayati Chuprat, (2012). “A Hybrid Algorithm Using Genetic Algorithmhadoop Mapreduce Optimization for Energy Efficiency in Cloud Computing Platform”. International Journal of Science and Research (IJSR).
[12]. Dino Keco, and Abdulhamit Subasi, (2012). “Parallelization of Genetic Algorithm using Hadoop Map/reduce”, Southeast Europe Journal of Soft Computing, Vol.1, No. 2, pp. 56-59.
[13]. Atanas Radenski, (2012). “Distributed Simulating Annealing with Map Reduce”, European Conference on Applications of Evolutionary Computation, pp. 466-476, Spinger, Berlin.
[ 1 4 ] . Periasamy Vivekanandan, and Raju Nedunchezhian, (2011). “Mining Data Stream with the concept drifts using genetic algorithm”. Artificial Intelligence Review, Vol. 36, No.3, pp. 163-178.
[15]. Di-Wei Huang and Jimmy Lin, (2010). “Scaling population of a genetic algorithm for Job Shop scheduling Problem using map reduce”. 2nd IEEE International Conference on Cloud Computing Technology and Science, pp. 780-785,USA.
[16]. Bioplanet.com. What is Bioinformatics? Retrieved from http://www. bioplanet.com/what-is-bioinformatics/
[17]. Abhishek Verma and Roy H. Campbell, (2010). “Scaling ECGA Model Building via Data-Intensive Computing”. IEEE Congress on Evolutionary Computation (CEC), Barcelona, Spain.
[18]. Youssef M. Essa, Gamal Attiya and Ayman El-Sayed, (2013). “Mobile Agent based New Framework for Improving Big Data Analysis”. 2013 International Conference on Cloud Computing and Big Data.
[19]. Saurabh Arora and Inderveer Chana, (2014). “A Survey of Clustering Techniques for Big Data Analysis”. 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence).
[20]. Shankar Ganesh Manikandan and Siddarth Ravi, (2014). “Big Data Analysis using Apache Hadoop”. 2014 International Conference on IT Convergence and Security (ICITCS 2014), Beijing, China, pp. 345-348.
[21]. Xiafai Huang, Hui Zhou, Wei Wu, (2015). “Hadoop Job Scheduling Based on Mixed Ant-genetic Algorithm”. Cyber-Enabled Distributed Computing and knowledge Discovery, September 2015, pp. 226-229, IEEE.
[22]. Yildrim, Hallac, I.R. Aydin.G and Tatar Y, (2015). “Running Genetic Algorithm on hadoop for solving high  dimensional optimization problems”. 9th International Conference (AICT), pp. 12-16, IEEE.
[23]. Pan and Ren-Hao, (2015). "Design of an NGS MicroRNA predictor using multilayer hierarchal MapReduce Framework”. Data Science and Advanced Analytics (DSAA), 2015, IEEE International Conference, pp. 1-8, France.
[24]. Milos Ivanovic, Visnja Simic, Boban Stojanovic, Ana Kaplarevic and Branko Marovic, (2014). “Elastic Grid Resources Provisioning with WoBinGo: A parallel framework for genetic algorithm based optimization”. Future Generation Computer Systems, Vol. 42, No. 2015, pp. 44-54.
[25]. Chengyu Hu, Jing Zhao, Xuesong Yan, Deze Zeng and Song Guo, (2015). “A MapReduce based Parallel Niche Genetic Algorithm for Containment Source Identification in Water Distributed Network”. Ad Hoc Networks, pp. 1-11.
[26]. Mashan A. Alshammari and El-Sayed M.El-alfy, (2015). “MapReduce Implementation for Minimum Reduct using Parallel Genetic Algorithm”. 6th International Conference on Information and Communication System , IEEE.
[27]. Andrea L. Halweg-Edward, William C Grau, James D Winkler, Andrew D. Garast and Rayan T. Gill, (2015). “The emergence commodity-scale genetic manipulation”. Chemical Biology, pp. 150-155.
[28]. David Camacho, (2015). “Bio-inspired Clustering: Basic Features and Future Trends in the Era of Big Data”. 2nd international Conference on Cybernetics (CYBCONF 2015), IEEE, Poland.
[29]. Yue-Jiao Gong, Wei-Neng chen, Zhi-Hui Zhan, Jun Zhang, Yun Li and Qingfu Zhang, (2015). “Distributed evolutionary Algorithm and their models: A Survey of State-of-the-art”. Applied Soft Computing, Vol. 34, pp. 286-300.
[30]. D.L. Gonzalez, and F.F. de Vega, (2007). “On the Intrinsic Fault-Tolerance Nature of Parallel Genetic Programming”. In : EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing, pp. 450–458.
[31]. F. Herrera, and M. Lozano, (2000). “Gradual Distributed Real-Coded Genetic Algorithms”. IEEE Trans. Evol. Comput. Vol. 4, No. 1, pp. 43–63.
[32]. N. Melab, and E.-G. Talbi, (2010). “GPU-based Island Model for Evolutionary Algorithms”. In: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO), pp. 1089–1096.
[33]. J.J. Merelo-Guervos, A. Mora, J.A. Cruz, A.I. Esparcia-Alcazar, C. Cotta, (2012). “Scaling in Distributed Evolutionary Algorithms with Persistent Population”. In: IEEE Congress on Evolutionary Computation (CEC), pp. 1–8.
[34]. L.d.P. Veronese, and R.A. Krohling, (2010). “Differential Evolution Algorithm on the GPU with C-CUDA”. In: IEEE Congress on Evolutionary Computation (CEC), pp. 1–7.
[35]. S.M. Said, and M. Nakamura, (2012). “Parallel enhanced hybrid evolutionary algorithm for continuous function optimization”. In: International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 125–131.
[36]. B. Wu, G. Wu, and M. Yang, (2012). “A MapReduce Based Ant Colony Optimization Approach to Combinatorial Optimization Problems”. In: International Conference on Natural Computation (ICNC), pp. 728–732.
[37]. B. Yu, Z. Yang, X. Sun, B. Yao, Q. Zeng, and E. Jeppesen, (2011). “Parallel Genetic Algorithm in Bus Route Headway Optimization”. Appl.Soft Comput. Vol. 11, No. 8, pp. 5081–5091.
[38]. W. Yu,W. Zhang, (2006). “Study on function Optimization Based on Master-Slave Structure Genetic Algorithm”. In: International Conference on Signal Processing, Vol. 3, pp. 1–4.
[39]. J. Zhao, W. Wang, W. Pedrycz, and X. Tian, (2012). “Online Parameter Optimization-based Prediction for Converter Gas System by Parallel Strategies”. IEEE Trans. Control Syst. Technol. Vol. 20, No. 3, pp. 835–845.
[40]. C. Zhou, (2010). “Fast parallelization of differential Evolution Algorithm using MapReduce”. In: Proceedings th of the 12 Annual Conference on Genetic and Evolutionary Computation (GECCO), pp. 1113–1114.
[41]. W. Zhao, S. Alam, and H.A. Abbass, (2014). “MOCCA-II: A Multi-objective Co-operative Evolutionary Algorithm”. Appl. Soft Comput. Vol. 23, pp. 407–416.
[42]. W. Zhu, (2011). “Nonlinear Optimization with a Massively Parallel Evolution Strategy Pattern Search Algorithm on Graphics Hardware”. Appl. Soft Comput. Vol. 11, No. 2, pp. 770–1781.
[43]. X. Llora, A. Verma, R.H. Campbell, and D.E. Goldberg, (2010). “When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms Via Data Intensive Computing”. In: Parallel and Distributed Computational Intelligence, Springer, Berlin, Heidelberg, pp. 11–41.
[44]. J.J. Durillo, and A.J. Nebro, (2011). “JMetal: a Java Framework for Multi-objective Optimization”. Adv. Eng. Softw. Vol. 42, No. 10, pp. 760–771.
[45]. J. Ekanayake, S. Pallickara, and G. Fox, (2008). “MapReduce for Data Intensive Scientific Analyses”. eScience, 2008. eScience '08. IEEE Fourth International Conference, pp. 277–284.
[46]. T. Xia, (2008). “Large-scale sms messages mining based on map-reduce”. Computational Intelligence and Design, 2008, ISCID '08 International Symposium, Vol. 1, pp. 7–12.
[47]. H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, (2007). “Map-reduce-merge: simplified relational data processing on large clusters”. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040, New York, NY, USA, ACM.
[48]. Fei Teng, and Doga Tuncay, "Genetic Algorithm using MapReduce runtimes". Proposal Document on salsahpc.indiana.edu.
[49]. Sukhmani Goraya, and Vikas Khullar, (2015). “Map- Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop for Extensive Data”. IOSR Journals (IOSR Journal of Computer Engineering), pp. 64- 75.
[50]. Vikas Khullar and Sukhmani Goraya, (2015). “Enhancing Dynamic Capacity Scheduler for Data Intensive Jobs”. International Journal of Computer Applications, Vol. 121, No. 12, pp. 21-24.