i-manager Publications

	i-manager's Journal on Cloud Computing				View PDF
	Volume :3	No :1	Issue :-2016	Pages :1-12

Abstract
Introduction
1. Literature Review
2. Implementation Algorithms
3. Results
Conclusion
References

Clustering of Summarizing Multi-Documents (Large Data) by Using MapReduce Framework

K. Thirumalesh * Srinivasulu Asadi **

* Research Scholar, Department of Information Technology, Sree Vidyanikethan Engineering College, Tirupathi, India.

** Associate Professor, Department of Information Technology, Sree Vidyanikethan Engineering College, Tirupathi, India

Abstract

Multi document summarization differs from the single document. Issues of compression, speed, and redundancy and passage selection are critical in the form of useful summaries. A collection of different documents is given to a variety of summarization methods based on different strategies to extract the most important sentences from the original document. LDA (Latent Dirichlet Allocation) topic modeling technique is used to divide the documents topic wise for summarizing the large text collection over the MapReduce framework. Compression ratio, retention ratio, Rouge and Pyramid score are different summarization parameters used to measure the performance of the summarizing documents. Semantic similarity and clustering methods are used efficiently for generating the summary of large text collections from multiple documents. Summarizing multi documents is a time consuming problem and it is a basic tool for understanding the summary. The presented method is compared with the MapReduce framework based k-means clustering algorithm applied on Four Multi-document summarization methods. Support for multilingual text summarization is provided over the MapReduce framework in order to provide the summary generation from the text document collections available in different languages.

Keywords :

Summarizing Large Text,
Semantic Similarity,
LDA (Latent Dirichlet Allocation),
K-means,
Clustering Based Summarization,
Big Text Data Analysis.

Introduction

Multi-document summarization is a difficult task because it takes multiple documents analyzed for generating the meaningful summary. For summarizing the multidocument’s input as the news archives, research papers, tweets, and technical reports are available in the internet. The numbers of Summarization techniques presented generate the meaningful summary by extracting the important sentences from the source of the collection of documents. Summarization Evaluation is performed based on Rouge and Pyramid Scores[7]. There are two summarization approaches, Extractive and Abstractive Extractive: Extractive summarization, extracts the sentences or paragraphs up to certain limit to provide a coherent summary. This technique is mainly focused on automatic summarization method. Extractive method is an essential role in the single and multi-document summarization, where these are used as statistical measures. Some features for extracting the sentences are such as score based [word frequency][ [42,43], identifying the key phrases and position of the text. The main approach for single document summarization is to provide a summary for a single document and whereas in multi-document summarization used, a collection of multi documents which are related to relevant documents produce a summary as a single document. Abstractive Summarization is used to understand the content with a certain degree which are expressed in the original document and creates the summaries based on information[1]. It is a human summarization method, which is more difficult to implement. The process of Query based summarization is mainly used to retrieve the sentences from document and that matches the user query.

MapReduce is a programming model for generating the large data through parallel algorithm. MapReduce is an essential to summarized multi-documents[2]. There are three requirements for summarizing multi-documents, Clustering: it is the ability to cluster similar documents and to find the related information. Anti-redundancy: This requirement is used to minimize redundancy between passages in the summary. Coverage: It is used to find out the main points and extract the important points across the document[4]. Text summarization is the process of summarizing a single document or a set of related important sentences. Redundancy elimination is one of the main difference between single and multi-document summarization.

For eliminating the redundant sentences in multidocument summarization, these models are used: Semantic, Syntactic and Statistical models. Statistical models were used to describe the selection or elimination of the sentences to the summarization process, and these are also subjected to sentences compression[45]. Rouge: To correlate with human evaluations for content match in text summarization and machine translation[8]. It is used to estimate extractive and abstractive summaries, but to correlate better for extractive [46,47]. Pyramid is used to evaluate the performance with precession measure. In text mining, the Text Summarization is a challenging problem. It provides a number of real life applications that can be developed based on text summarization and which benefits the user. There are four types of clustering, they are Partitioning clustering, Hierarchical clustering, Agglomerative clustering and Divisive clustering. We are using the Semantic and Clustering method for summarizing the large data. There are three techniques in semantic and syntactic methods, they are Graph representation, coreference chains and Lexical chains. Agglomarative clustering is a bottom approach, that reduce the large data in this approach[9]. And opposite to the agglomerative clustering is Divisive clustering, it is top down approach. Lexical chain is a flow of related words used to merge the long words. Two or more words or letters referred by the same person is called Co-reference chain. Statistical method is used to compress the sentence and relevant scores. Information Extraction is used to find the relevant sentences and extract it in Summarization of multi documents. MapReduce technique is more efficient for summarizing the large collection of data in either single document or multidocuments. Word Net is an example for automatic text summarization. And it is similar to the dictionary containing the description of the given word[5].

1. Literature Review

For summarizing the documents, different methods based on different strategies are presented to extract the most important sentences from the original documents. Four types of methods are used to multi-document summarization systems (i.e. the centroid-based method, the graph-based method, LSA, and NMF), and Weighted consensus method with various combination methods (e.g. average score, average rank, Borda count, median aggregation, round-robin scheme, correlation-based weighting method, and graph-based combination).

Weighted consensus multi-document summarization is described in [49]. Text extraction to multi-document summarization based on single - document summarization methods by using additional, available information from the document set as a relationship between the documents. Issues from Single document are speed, Redundancy, passage selection, and Compression.

Multi-Document Summarization By Sentence Extraction is shown in [44]. Big data are decomposed into many parts for parallel clustering. Ant Colony clustering algorithm is an useful method for semantic clustering of the data based on attribute character vectors.

MapReduce based Method for Big Data Semantic Clustering[12,13]. Information Extraction (IE) extracts the structured data from the Text. A Benchmark has been presented to systematically retrieve the large text from IE, along with the performance of the document[50].

A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics is given in [15].

Social Content Matching in MapReduce[3], distributes the information from supplier to the consumer. There are two matching algorithms for MapReduce paradigm, viz. Greedy MapReduce, Stack MapReduce. For analyzing large data and mining Big Data, MapReduce framework is used in a number of works[10]. Some of the works presented in this direction is web log analysis[2].

Summarizing large text collection is an interesting and challenging problem in text analytics[11]. A number of approaches was suggested for handling large text for automatic text summarization[21].

A MapReduce based distributed and parallel framework for summarizing large text is also presented by Hu and Zou [23].

A technique is proposed by Lai and Renals [24].for meeting summarization using prosodic features and augment lexical features [14]. Features related to dialogue acts are discovered and utilized for meeting summarization. An unsupervised method for the automatic summarization of source code text is proposed by Fowkes et al. [33]. The proposed technique is utilized for code folding, which allows one to selectively hide blocks of code. A multi-sentence compression technique is proposed by Tzouridis et al. [34].

2. Implementation Algorithms

Initially, the authors have collected the multiple documents which are related to each other, such as newspapers, research papers or conference papers through internet. There are four stages of summarizing a multi-document into a single document summary.

Document clustering is an initial stage, which initializes the document as a cluster center from the collected documents and these are applied to the next stage known as LDA (Latent Dirichlet Allocation), that decomposes the cluster centers as topic wise through topic modeling tool. In the Frequent and Semantic terms, the Topics are divided into the form of terms, and this stage, counts the number of frequencies. Sentence Filtering stage is applied on each individual document, and they removed the unimportant sentences and finally they obtained a meaningful summary. Summarizing the multi-document is based on Semantic similarity and clustering method[6].

MapReduce is a programming model for generating the large text collections through parallel algorithm. There are two types of functions they are Map() and Reduce() Functions. The Map() function assigns the jobs, i.e., in the presented methodology, multiple documents are assigned as the jobs and these documents are decomposed as the terms. Figure 1 shows the Multi Document Summarization[16-22].

Calculating the frequency of the these topic terms and selecting as a Semantic similar terms, these selected terms are computed by using WordNet Java API. In Figure 2, the documents such as DOC₁ , DOC₂ ,…..DOC_Nare assigned to the Map Function. These are divided as the terms T₁ ,T₂ ,…T_K1 to each documents, and these terms are shuffled and Exchanged to the Reducer function. The Reducer() function, reduces each terms to its relevant documents[25-32].

Figure 1. Steps for Multi-Document Summarization (Large Data)

2.1 K-means Clustering Algorithm

Clustering is a process of creating and collecting of similar objects. K-means is a classical unsupervised learning algorithms used for clustering. It is a simple, low complexity and a very popular clustering algorithm. The kmeans algorithm is a separation based clustering algorithm. It takes an input parameter, k, i.e. the number of clusters to be produced, which separates a set of n objects to generate the k clusters. Performance of kmeans is measured using the square-error function defined in the equation,

(1)

where E is the sum of the square error, p is the point in space instead of a given object and mi is the mean of cluster Ci. This criterion tries to make the resulting k clusters as compact and as separate as possible.

Algorithm: k-means.

Input:

k-the number of clusters,

D: A data set containing n objects

Output:

A set of k clusters.

Method:

(1) Randomly choose k objects from D as the initial clusters.

(2) Repeat.

(3) (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;

(4) Update the cluster means, i.e. calculate the mean value of the objects for each cluster;

(5) Until no change.

The algorithm is applied for each stage based on Mapper and reducer. The initial stage is based on Document Clustering of k-means. In this stage, the Mapper decomposes the documents as the cluster center and finds the near center and provides as a point. After creating the text document clustering, the document belonging to clusters are retrieved and the text information present in each document is collected as an aggregate.

The topic modeling technique is then applied on collective each information to generate the topics from each text document clusters. LDA (Latent Dirichlet Allocation) technique is used for generating topics from each document cluster. WordNet Java API is used to generate the list of semantic similar terms. The semantic similar terms are generated over the MapReduce framework and the generated semantic terms are added to the vector. The semantic similar term finds an intensive computing operation.

2.2 Mapper and Reducer for Document Clustering

Mapper:

1. Initialize the cluster centers randomly and Read it into the memory from a sequence file,

{C_Mean1,C_Mean2, ,….C_Meank} ←Random

2. Iterate each cluster center for each input key-value pair. Computer Center for all as, (K, V)

3. Calculate distance and assign the nearest center with the lowest distance,

C_i Min(d(key1,C_i),….,d(keyN, C_i ))

4. Update the cluster center with its vector to the file system,

{(K_i1 ,V_i),…,(K_iM ,V_i1 )) єC_i}

Reducer:

1. Iterate each value vector and calculate the mean.

2. Update the new center from the calculated mean,

C ←Avg(d_i1 ,…d_ij )

3. Check whether the cluster center is same as new center. If (C_iOLD =C_iNEW ), then stop, else GoTo next step.

4. If it they are not equal, increment an update counter. Run the Mapper and Reducer until the Cluster Converges.

In LDA algorithm, the cluster centers as C₁,C₂ ,….C_n . The Mapper, divides the Documents as the topic wise based on topic modeling technique. The Reducer integrates the topics of all the clusters, Extract all the topics discovered from LDA in the documents.

2.3 Mapper and Reducer for LDA Topic Generation from Document Cluster

1. For each, cluster get the documents it contains and extract the text collection from the these documents.

2. For each cluster C_i є {C₁ , C₂ ,…., C_N }.

3. Extract the documents in Ci as {D_i1 , D_i2 ,…D_iM }.

4. For each document, extract and merge the text from the text collection.

5. Apply LDA topic modeling to these collection and get the list of topics for the cluster C_i as T_i= {T_i1 , T_i2 ,…T_iK}.

Reducer:

1. Integrate the topics of all the clusters.

2. For each cluster C_i є{C₁ , C₂ ,…., C_N }.

3. Extract the topics discovered by LDA in the documents in C_ias T_i ={T_i1 , T_i2 ,…T_iK }

4. For each document extract, the text and computer the text collection.

5. Topics=Topics{T_ij є T_i }

The Mapper computes the semantic similar terms for each topic term generated by the document cluster and the reducer aggregates these terms and counts the frequencies of these terms (topic terms and) semantic similar terms of topic terms) aggregately.

2.4 Mapper and Reducer for Semantic Terms Generation from Cluster Topic Terms

Mapper:

1. For each topic term in topic list {T₁ ,T₂ ,…T_N }.

2. Get the semantic similar terms.

3. TS_i = Compute Semantic Similar (T_i ).

//Pass the term T_i in WordNet API and extract the semantic similar Terms in the set TS_i.

4. For all term t, t є TS_ipresent in the document D do

5. Emit (term t; count 1)

Reducer:

1. For each term t, counts [c₁ ,c₂ ]

2. Initialize the sum of term frequency as 0.

3. For all count c є counts[c₁ ,c₂ ] do

4. Update sum by adding count i.e., sum + =c

5. Emit(term t; count sum)

In the Document Filtering algorithm, select the each individual document from the document collection, and extract the sentences from the document here, removing the unimportant or repeated sentences from the original document. The Reducer, Integrates all the filtered sentences and produces a single document and presents a summary to be easily understandable to the user.

2.5 Mapper and Reducer for Document Filtering

Mapper:

1. Select one document at a time from the document collection.

2. For each Document D є{D₁ ,D₂ ,…..}

3. Extract the sentences from Document D as {S_i1 ,S_i2,….} using parsing.

4. If S_ik contains the terms present in Tsi

5. Filter the sentences S_ik containing the terms and add() Vector to it.

6. Vector=Vector ∪{S_ik}

Reducer:

Integrate all the filtered sentences and produce a Single document presenting the Summary=Summary ∪{S_lk}

3. Results

The Implementation is carried by using the java based technologies. Majorly MapReduce implementation is performed by using the Hadoop environment. Initially we are collecting the Twitter data for summarizing the large collection of text data. The experiments are performed through Intel Core i3 processor with the 4GB RAM (Random Access Memory) and Windows 7(32-bit) Operating System. VM Ware Virtual machine is the major part for the experiments.

Initially, the authors have created the node to browse the directory. There are two nodes, they are the user node and temporary node and the summarization directory is created to summarize the documents or data.
Initially to integrate all the files as shown is Figure 3, open the terminal in hadoop environment by using JPS command. And start all the files using start-all.sh command. Again go to the JPS command to view all the nodes such as name node, data node, etc.
After entering the JPS command, open Google chrome and enter the node 50070/explore html/ to show the Browsing directory (Figure 4).
To create the new Directory, click the Utilities, that shows the Browse the file System and then click on that and open the terminal type, hadoop dfs –mk /sum. Here we can see the directory name (Figure 5).
To check whether the sum directory (Figure 6) is created or not, open Google chrome. Refresh the page and it shows the created sum directory.
To insert the Twitter data for Summarization, use the command, cd/home/user/sourcefile/Summarizing/ twitterdata/. This command is used to view the list of files that we are having in the Summarizing folder. These are shown in Figure 7.
To insert the twitter data.txt, type the command in the terminal, hadoop dfs–put twitter_data.txt /sum. Here we can insert the Twitter Data in the text source file in Summarizing folder, eg. (twitter as twitte_data.txt). These are depicted in Figure 8.
To check the twitter data, open Google chrome and refresh that page and then open the sum directory which shows the twitter_data.txt with the user node 128 MB block size in sum directory as shown in Figure 9.

Many large text of the data are used in collecting the twitter data for summarizing the large data.

Figure 2. Calculating the Common Terms from the Text Collection by using MapReduce

Figure 3. Integrate all the Files

Figure 4. Inserting Node 50070 to Show the Directory

Figure 5. Create the New Directory

Figure 6. Sum Directory

Figure 7. Inserting the Twitter Data for Summarizing

Figure 8. Put the Twitter_Data.txt in Sum Directory

Figure 9. Display the Twitter-Data.txt

After inserting the twitter data, use the given command hadoop jar/ home/ user/ source file/ Summarizing/summarizing/twitter.jar/sum/twitter_data.txt/ sum/output. This command is used to show the output by using the MapReduce concept as shown in Figure 10.

Figure 10. Working of MapReduce Concept to Summarizing the Data

In MapReduce, initially assign the job to the Map Function, i.e., insert the twitter data files in bytes. There are three conditions.

1. Map 0% and reduce 0%.

2. Map 100% and reduce 0%.

3. Map 100% and reduce 100%.

If both the function map() and reduce() are 100%, then the task is successfully completed and it shows all the summarized data with the consisting algorithms.

In Figure 11, they given input Format counters in bytes read 129670, and after summarizing the data, it shows the file output Format counters bytes written in 90. After summarizing the data, to show the output in the form of graph, open the terminal and type the command, cd /var/ www /html to show the graph . cp – R /home/user/sourcefile/Summarizing/summarizing/Twitter Graph/var/www/html. Then enter, chmod–R 777 TwitterGraph, where 777 represents the node number to show the graph in the hadoop. After entering the node number, type ls, where ls represents the list of files. It shows the Twitter graph.
Then go to, Google chrome and refresh that page, then the summarized data is shown in line, Bar and Pie Graph and these processes are shown in Figures 12,13, 14, and 15. After entering into the Google chrome, enter the local host/ Twitter Graph it opens the new window.
In the graph (Figure 13), X-axis contains the number of tweets and Y-axis represents the Twitter and having the tweet and retweet of the large data. Initially, the authors gave the input file and it shows each individual files of the summarized data.
The bar graph of Figure 14, shows that the, X-axis represents the Number of tweets and Y-axis Represents the Twitter. It shows the Summarized data in the form of bar graph to for easy understanding.
In the graph Figure 15 the Summarized data are represented with the precision measures. Initially, assign each data individually to the map() and By reduce() functions, summarize the large data.

Figure 11. Output by using the MapReduce Task

Figure 12. Twitter Analysation

Figure 13. Result in Line Graph

Figure 14. Result in Bar Graph

Figure 15. Result in Pie Graph

Conclusion

In this work, the authors have presented a multi-document text summarizer based on MapReduce framework. Experiments are approved using four nodes in the MapReduce framework for a large text collection and summarization in the form of Line Bar, and Pie graphs are evaluated for a large text collection. It is also shown experimentally that the MapReduce framework provides improved scalability and summary time complexity, while considering a large number of text documents for summarization. The result of data semantic clustering has showed good accuracy and calculation efficiency. The authors have studied four most widely used multidocument summarization systems and presented a weighted consensus summarization method to combine the results from single summarization systems. Three possible cases of summarizing the multiple documents of the large data are also studied relatively. While considering the large number of text documents of the data for summarization, MapReduce is an essential part and it gives the reliable output and performance reduce, MapReduce framework gives better results, scalability and reduces the time complexity. Future work is to support for multi-lingual text summarization by using the MapReduce concept in order to provide the summary from the text documents of the large data in different languages.

References

[1]. N.K Nagwani, (2015). “Summarizing Large Text Collection Using Topic Modeling and Clustering Based on MapReduce Framework”. Journal of Big Data, Springer Open Journal, DOI:10.1186\S40537-015-0020-5.

[2]. Zhang G., and Zhang M., (2013). “The Algorithm of Data Preprocessing in Web Log Mining Based on Cloud Computing”. In 2012 International Conference on Information Technology and Management Science (ICITMS 2012) Proceedings, Springer, Berlin, Heidelberg, Germany, pp. 467–474.

[3]. Morales GDF, Gionis A., and Sozio M., (2011). “Social Content Matching in MapReduce”. Proceedings of the VLDB Endowment, Vol. 4, No. 7, pp. 460-469.

[4]. Verma A., Llora X., Goldberg DE., and Campbell RH, (2009). “Scaling Genetic algorithms using MapReduce Intelligent Systems Design and Application (ISDA)”. Ninth International Conference, Pisa, Italy, pp 13–18.

[5]. Cambria E., Rajagopal D., Olsher D., and Das D., (2013). “Big Social Data Analysis”. Big Data Computing Chapter, Vol. 13, pp. 401-414.

[6]. Lieberman M., (2014). “Visualizing Big Data: Social Network Analysis”. Digital Research Conference, San Antonio, Texas, pp. 1-23.

[7]. López V., Río S.D., Benítez J.M, and Herrera F., (2014). “Cost-Sensitive Linguistic Fuzzy Rule Based Classification Systems Under the MapReduce Framework for Imbalanced Big Data”. Fuzzy Sets Syst, Vol. 1, pp. 1-34.

[8]. Blanas S., Patel J.M, Ercegovac V., Rao J., Shekita E.J, and Tian Y., (2010). “A Comparison of Join Algorithms for Log Processing in MapReduce”. Proc. of the 2010 ACM SIGMOD International Conference on Management of Data, New York, USA, pp. 975-986.

[9]. Hoi SCH, Wang J., Zhao P., and Jin R., (2012). “Online st Feature Selection for Mining Big Data”. Proc. of the 1 International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, ACM, New York, USA, pp. 93–100.

[10]. Chen S.Y, Li J.H, Lin K.C, Chen H.M, and Chen T.S., (2013). “Using MapReduce Framework for Mining Association Rules ”. In Information Technology Convergence Springer, Netherlands, pp. 723–731.

[11]. Urbani J., Maassen J., and Bal H., (2010). “Massive Semantic Web data compression with MapReduce”. th Proc. of the 19 ACM International Symposium on High Performance Distributed Computing, New York, USA, pp. 795–802.

[12]. Rajdho A., and Biba M., (2013). “Plugging Text Processing and Mining in a Cloud Computing Framework”. In Internet of Things and Inter-cooperative Computational Technologies for Collective Intelligence Springer, Berlin, Heidelberg, Germany, pp. 369–390.

[13]. Balkir A.S, Foster I., and Rzhetsky A., (2011). “A Distributed Look-up Architecture for Text Mining Applications using MapReduce”. High Performance. Computing, Networking, Storage and Analysis (SC), 2011 International Conference, Seattle, US, pp. 1–11

[14]. Zongzhen H., Weina Z., and Xiaojuan D., (2013). “A Fuzzy Approach to Clustering of Text Documents Based on MapReduce”. In Computational and Information Sciences (ICCIS), 2013 Fifth International Conference on IEEE. Shiyang, China, pp. 666-669.

[15]. Chen F., and Hsu M., (2013). “A Performance Comparison of Parallel DBMSs and MapReduce on Large Scale Text Analytics”. Proc. of the 16^th International Conference on Extending Database Technology ACM, New York, USA, pp. 613-624.

[16]. Das T.K, and Kumar P.M., (2013). “Big Data Analytics: A Framework for Unstructured Data Analysis”. International Journal of Engineering and Technology (IJET), Vol. 5, No. 1, pp. 153-156.

[17]. Momtaz A., and Amreen S., (2012). “Detecting Document Similarity in Large Document Collection using MapReduce and the Hadoop Framework”. BS Thesis. BRAC University, Dhaka, Bangladesh, pp. 1–54.

[18]. Lin J., and Dyer C., (2010). “Data-Intensive Text Processing with MapReduce”. Morgan & Claypool Publishers, Vol. 3, No. 1, pp. 1-177.

[19]. Elsayed T., Lin J., and Oard D.W., (2008). “Pairwise Document Similarity in Large Collections with MapReduce”. Proc. of the 46^th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Stroudsburg, US, pp. 265–268.

[20]. Galgani F., Compton P., and Hoffmann A., (2012). “Citation Based Summarisation of Legal Texts”. Proc. of 12^th Pacific Rim International Conference on Artificial Intelligence, Kuching, Malaysia, pp. 40–52

[21]. Hassel M., (2004). “Evaluation of Automatic Text Summarization”. Licentiate Thesis, Stockholm, Sweden, pp. 1–75.

[22]. Wang Y., Bai H., Stanton M., Chen W.Y, and Chang E.Y., (2009). “PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications”. 5^th International Conference, A AIM (Algorithmic Aspects in Information and Management), San Francisco, CA, USA, pp. 309–322.

[23]. Hu Q., and Zou X., (2011). “Design and implementation of multi-document automati c summarization using MapReduce”. Computer Engineering and Applications, Vol. 47, No. 35, pp. 67–70.

[24]. Lai C., and Renals S., (2014). “Incorporating Lexical and Prosodic Information at Different Levels for Meeting Summarization”. Proceedings of the 15^th Annual Conference of the International Speech Communication Association, INTERSPEECH 2014. ISCA, Singapore, pp. 1875–1879.

[25]. M. Cannataro and D. Talia, (2004). “Semantics and Knowledge Grids: Building the Next-generation Grid”. Intelligent Systems, IEEE, Vol. 19, No. 1, pp. 56–63.

[26]. S. Wang, H.-J. Wang, X.-P. Qin, and X. Zhou, (2011). “Architecting Big Data: Challenges, Studies and Forecasts”. Jisuanji Xuebao (Chinese Journal of Computers), Vol. 34, No. 10, pp. 1741–1752.

[27]. K. Chen and W.-M. Zheng, (2009). “Cloud Computing: System Instances and Current Research”. Journal of Software, Vol. 20, No. 5, pp. 1337–1348.

[28]. J. Dean and S. Ghemawat, (2010). “MapReduce: A Flexible Data Processing Tool”. Communications of the ACM, Vol. 53, No. 1, pp. 72–77.

[29]. W. Xi-Zhao, (2003). “Optimization of k-means Clustering by Feature Weight Learning”. Journal of Computer Research and Development, Vol. 6.

[30]. H.-G. Li, G.-Q. Wu, X.-G. Hu, J. Zhang, L. Li, and X. Wu, (2011). “K-means Clustering with Bagging and th MapReduce”. In System Sciences (HICSS), 2011 44 Hawaii International Conference on IEEE, pp. 1–8.

[31]. Steve L., (2012). “The Age of Big Data”. Big Data's Impact in the World, New York, USA, pp. 1–5.

[32]. Lee K.H, Lee Y.J, Choi H,, Chung Y.D, and Moon B., (2011). “Parallel Data Processing with MapReduce: A Survey”. ACM SIGMOD Record, Vol. 40, No. 4, pp.11–20.

[33]. Fowkes J., Ranca R., Allamanis M., Lapata M., and Sutton C., (2014). “Autofolding for Source Code Summarization”. Computing Research Repository, 1403(4503): pp. 1-12.

[34]. Tzouridis E., Nasir J.A, Lahore LUMS, and Brefeld U., (2014). “Learning to Summarise Related Sentences”. The 25^thInternational Conference on Computational Linguistics (COLING'14), Dublin, Ireland, pp. 1–12, ACL

[35]. Wang Y., Bai H., Stanton M., Chen W.Y, and Chang E.Y, (2009). “PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications”. 5^th International Conference, A AIM (Algorithmic Aspects in Information and Management), San Francisco, CA, USA, pp. 309–322.

[36]. Miller G.A., (1995). “WordNet: A Lexical Database for English”. Commun ACM, Vol. 38, No. 11, pp. 39-41.

[37]. Blei D.M, Ng AY, and Jordan M.I, (2003). “Latent Dirichlet Allocation”. The Journal of Machine Learning Research, Vol. 3, pp. 993–1022.

[38]. Feldman R., and Sanger J., (2007). The Text Mining Handbook-Advanced Approaches in Analyzing Unstructured Data. Press, Cambridge University, ISBN 978- 0-521-83657-9

[39]. McCallum A.K., (2002). “Mallet: A Machine Learning for Language Toolkit”. Retrieved from http://mallet. cs.umass.edu/ on 10 May 2014.

[40]. Galgani F., Compton P., and Hoffmann A., (2012). “Combining Different Summarization Techniques for Legal Text”. Proc. of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, Avignon, France, pp. 115–123.

[41]. Galgani F., Compton P., and Hoffmann A., (2014). “HAUSS: Incrementally Building a Summarizer Combining Multiple Techniques”. Int. J. Human-Computer Studies, Vol. 72, pp. 584–605.

[42]. Li W., (1992). “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”. IEEE Trans Inf Theory, Vol. 38, No. 6, pp. 1842–1845.

[43]. Reed W.J., (2001). “The Pareto, Zipf and Other Power Laws”. Econ Lett, Vol. 74, No. 1, pp.15–19.

[44]. Goldstein J., Mittal V., Carbonell J.G, and Kantrowitz M., (2000). “Multi-Document Summarization By Sentence Extraction”. School of Computer Science, Carnegie Mellon University, Research Showcase, pp. 40–48.

[45]. Lin C.Y., (2004). “Rouge: a Package for Automatic Evaluation of Summaries”. In: Out TSB (ed) Proceedings of the ACL-04 Workshop Association for Computational Linguistics, Barcelona, Spain, pp. 74–81.

[46]. Nenkova A., and Passonneau R., (2004). “Evaluating Content Selection in Summarization: The Pyramid Method”. Proc. Human Language Technology Conf. North Am, Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), Boston, Massachusetts, pp. 145–152.

[47]. Harnly A., Nenkova A., Passonneau R., and Rambow O., (2005). “Automation of Summary Evaluation by the Pyramid Method”. In Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, pp. 226–232.

[48]. Qazvinian V., and Radev D.R., (2008). “Scientific Paper Summarization Using Citation Summary Networks”. nd Proceedings of the 22 International Conference on Computational Linguistics, Vol. 1, Stroudsburg, PA, pp. 689–696.

[49]. Wang D., and Li T., (2012). “Weighted Consensus Multi-document Summarization”. Inf Process Manag, Vol. 48, pp. 513–523.

[50]. Amdahl G.M., (1967). “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities”. Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, Atlantic City, New Jersey, USA, pp. 483–485.

i-manager's Journal on Cloud Computing

Clustering of Summarizing Multi-Documents (Large Data) by Using MapReduce Framework

Abstract

Keywords :

Introduction

1. Literature Review

2. Implementation Algorithms

2.1 K-means Clustering Algorithm

2.2 Mapper and Reducer for Document Clustering

Reducer:

2.3 Mapper and Reducer for LDA Topic Generation from Document Cluster

Reducer:

2.4 Mapper and Reducer for Semantic Terms Generation from Cluster Topic Terms

Mapper:

Reducer:

2.5 Mapper and Reducer for Document Filtering

Mapper:

Reducer:

3. Results

Conclusion

References