i-manager Publications

Clustering of Summarizing Multi-Documents (Large Data) by Using MapReduce Framework

K. Thirumalesh*, Srinivasulu Asadi**

* Research Scholar, Department of Information Technology, Sree Vidyanikethan Engineering College, Tirupathi, India.

** Associate Professor, Department of Information Technology, Sree Vidyanikethan Engineering College, Tirupathi, India

Periodicity:November - January'2016

Abstract

Multi document summarization differs from the single document. Issues of compression, speed, and redundancy and passage selection are critical in the form of useful summaries. A collection of different documents is given to a variety of summarization methods based on different strategies to extract the most important sentences from the original document. LDA (Latent Dirichlet Allocation) topic modeling technique is used to divide the documents topic wise for summarizing the large text collection over the MapReduce framework. Compression ratio, retention ratio, Rouge and Pyramid score are different summarization parameters used to measure the performance of the summarizing documents. Semantic similarity and clustering methods are used efficiently for generating the summary of large text collections from multiple documents. Summarizing multi documents is a time consuming problem and it is a basic tool for understanding the summary. The presented method is compared with the MapReduce framework based k-means clustering algorithm applied on Four Multi-document summarization methods. Support for multilingual text summarization is provided over the MapReduce framework in order to provide the summary generation from the text document collections available in different languages.

Keywords

Summarizing Large Text, Semantic Similarity, LDA (Latent Dirichlet Allocation), K-means, Clustering Based Summarization, Big Text Data Analysis.

How to Cite this Article?

Thirumalesh, K., and Asadi, S. (2016). Clustering Of Summarizing Multi-Documents (Large Data) By Using MapReduce Framework. i-manager’s Journal on Cloud Computing.,3(1), 1-12.

References

[1]. N.K Nagwani, (2015). “Summarizing Large Text Collection Using Topic Modeling and Clustering Based on MapReduce Framework”. Journal of Big Data, Springer Open Journal, DOI:10.1186\S40537-015-0020-5.

[2]. Zhang G., and Zhang M., (2013). “The Algorithm of Data Preprocessing in Web Log Mining Based on Cloud Computing”. In 2012 International Conference on Information Technology and Management Science (ICITMS 2012) Proceedings, Springer, Berlin, Heidelberg, Germany, pp. 467–474.

[3]. Morales GDF, Gionis A., and Sozio M., (2011). “Social Content Matching in MapReduce”. Proceedings of the VLDB Endowment, Vol. 4, No. 7, pp. 460-469.

[4]. Verma A., Llora X., Goldberg DE., and Campbell RH, (2009). “Scaling Genetic algorithms using MapReduce Intelligent Systems Design and Application (ISDA)”. Ninth International Conference, Pisa, Italy, pp 13–18.

[5]. Cambria E., Rajagopal D., Olsher D., and Das D., (2013). “Big Social Data Analysis”. Big Data Computing Chapter, Vol. 13, pp. 401-414.

[6]. Lieberman M., (2014). “Visualizing Big Data: Social Network Analysis”. Digital Research Conference, San Antonio, Texas, pp. 1-23.

[7]. López V., Río S.D., Benítez J.M, and Herrera F., (2014). “Cost-Sensitive Linguistic Fuzzy Rule Based Classification Systems Under the MapReduce Framework for Imbalanced Big Data”. Fuzzy Sets Syst, Vol. 1, pp. 1-34.

[8]. Blanas S., Patel J.M, Ercegovac V., Rao J., Shekita E.J, and Tian Y., (2010). “A Comparison of Join Algorithms for Log Processing in MapReduce”. Proc. of the 2010 ACM SIGMOD International Conference on Management of Data, New York, USA, pp. 975-986.

[9]. Hoi SCH, Wang J., Zhao P., and Jin R., (2012). “Online st Feature Selection for Mining Big Data”. Proc. of the 1 International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, ACM, New York, USA, pp. 93–100.

[10]. Chen S.Y, Li J.H, Lin K.C, Chen H.M, and Chen T.S., (2013). “Using MapReduce Framework for Mining Association Rules ”. In Information Technology Convergence Springer, Netherlands, pp. 723–731.

[11]. Urbani J., Maassen J., and Bal H., (2010). “Massive Semantic Web data compression with MapReduce”. th Proc. of the 19 ACM International Symposium on High Performance Distributed Computing, New York, USA, pp. 795–802.

[12]. Rajdho A., and Biba M., (2013). “Plugging Text Processing and Mining in a Cloud Computing Framework”. In Internet of Things and Inter-cooperative Computational Technologies for Collective Intelligence Springer, Berlin, Heidelberg, Germany, pp. 369–390.

[13]. Balkir A.S, Foster I., and Rzhetsky A., (2011). “A Distributed Look-up Architecture for Text Mining Applications using MapReduce”. High Performance. Computing, Networking, Storage and Analysis (SC), 2011 International Conference, Seattle, US, pp. 1–11

[14]. Zongzhen H., Weina Z., and Xiaojuan D., (2013). “A Fuzzy Approach to Clustering of Text Documents Based on MapReduce”. In Computational and Information Sciences (ICCIS), 2013 Fifth International Conference on IEEE. Shiyang, China, pp. 666-669.

[15]. Chen F., and Hsu M., (2013). “A Performance Comparison of Parallel DBMSs and MapReduce on Large Scale Text Analytics”. Proc. of the 16^th International Conference on Extending Database Technology ACM, New York, USA, pp. 613-624.

[16]. Das T.K, and Kumar P.M., (2013). “Big Data Analytics: A Framework for Unstructured Data Analysis”. International Journal of Engineering and Technology (IJET), Vol. 5, No. 1, pp. 153-156.

[17]. Momtaz A., and Amreen S., (2012). “Detecting Document Similarity in Large Document Collection using MapReduce and the Hadoop Framework”. BS Thesis. BRAC University, Dhaka, Bangladesh, pp. 1–54.

[18]. Lin J., and Dyer C., (2010). “Data-Intensive Text Processing with MapReduce”. Morgan & Claypool Publishers, Vol. 3, No. 1, pp. 1-177.

[19]. Elsayed T., Lin J., and Oard D.W., (2008). “Pairwise Document Similarity in Large Collections with MapReduce”. Proc. of the 46^th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Stroudsburg, US, pp. 265–268.

[20]. Galgani F., Compton P., and Hoffmann A., (2012). “Citation Based Summarisation of Legal Texts”. Proc. of 12^th Pacific Rim International Conference on Artificial Intelligence, Kuching, Malaysia, pp. 40–52

[21]. Hassel M., (2004). “Evaluation of Automatic Text Summarization”. Licentiate Thesis, Stockholm, Sweden, pp. 1–75.

[22]. Wang Y., Bai H., Stanton M., Chen W.Y, and Chang E.Y., (2009). “PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications”. 5^th International Conference, A AIM (Algorithmic Aspects in Information and Management), San Francisco, CA, USA, pp. 309–322.

[23]. Hu Q., and Zou X., (2011). “Design and implementation of multi-document automati c summarization using MapReduce”. Computer Engineering and Applications, Vol. 47, No. 35, pp. 67–70.

[24]. Lai C., and Renals S., (2014). “Incorporating Lexical and Prosodic Information at Different Levels for Meeting Summarization”. Proceedings of the 15^th Annual Conference of the International Speech Communication Association, INTERSPEECH 2014. ISCA, Singapore, pp. 1875–1879.

[25]. M. Cannataro and D. Talia, (2004). “Semantics and Knowledge Grids: Building the Next-generation Grid”. Intelligent Systems, IEEE, Vol. 19, No. 1, pp. 56–63.

[26]. S. Wang, H.-J. Wang, X.-P. Qin, and X. Zhou, (2011). “Architecting Big Data: Challenges, Studies and Forecasts”. Jisuanji Xuebao (Chinese Journal of Computers), Vol. 34, No. 10, pp. 1741–1752.

[27]. K. Chen and W.-M. Zheng, (2009). “Cloud Computing: System Instances and Current Research”. Journal of Software, Vol. 20, No. 5, pp. 1337–1348.

[28]. J. Dean and S. Ghemawat, (2010). “MapReduce: A Flexible Data Processing Tool”. Communications of the ACM, Vol. 53, No. 1, pp. 72–77.

[29]. W. Xi-Zhao, (2003). “Optimization of k-means Clustering by Feature Weight Learning”. Journal of Computer Research and Development, Vol. 6.

[30]. H.-G. Li, G.-Q. Wu, X.-G. Hu, J. Zhang, L. Li, and X. Wu, (2011). “K-means Clustering with Bagging and th MapReduce”. In System Sciences (HICSS), 2011 44 Hawaii International Conference on IEEE, pp. 1–8.

[31]. Steve L., (2012). “The Age of Big Data”. Big Data's Impact in the World, New York, USA, pp. 1–5.

[32]. Lee K.H, Lee Y.J, Choi H,, Chung Y.D, and Moon B., (2011). “Parallel Data Processing with MapReduce: A Survey”. ACM SIGMOD Record, Vol. 40, No. 4, pp.11–20.

[33]. Fowkes J., Ranca R., Allamanis M., Lapata M., and Sutton C., (2014). “Autofolding for Source Code Summarization”. Computing Research Repository, 1403(4503): pp. 1-12.

[34]. Tzouridis E., Nasir J.A, Lahore LUMS, and Brefeld U., (2014). “Learning to Summarise Related Sentences”. The 25^thInternational Conference on Computational Linguistics (COLING'14), Dublin, Ireland, pp. 1–12, ACL

[35]. Wang Y., Bai H., Stanton M., Chen W.Y, and Chang E.Y, (2009). “PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications”. 5^th International Conference, A AIM (Algorithmic Aspects in Information and Management), San Francisco, CA, USA, pp. 309–322.

[36]. Miller G.A., (1995). “WordNet: A Lexical Database for English”. Commun ACM, Vol. 38, No. 11, pp. 39-41.

[37]. Blei D.M, Ng AY, and Jordan M.I, (2003). “Latent Dirichlet Allocation”. The Journal of Machine Learning Research, Vol. 3, pp. 993–1022.

[38]. Feldman R., and Sanger J., (2007). The Text Mining Handbook-Advanced Approaches in Analyzing Unstructured Data. Press, Cambridge University, ISBN 978- 0-521-83657-9

[39]. McCallum A.K., (2002). “Mallet: A Machine Learning for Language Toolkit”. Retrieved from http://mallet. cs.umass.edu/ on 10 May 2014.

[40]. Galgani F., Compton P., and Hoffmann A., (2012). “Combining Different Summarization Techniques for Legal Text”. Proc. of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, Avignon, France, pp. 115–123.

[41]. Galgani F., Compton P., and Hoffmann A., (2014). “HAUSS: Incrementally Building a Summarizer Combining Multiple Techniques”. Int. J. Human-Computer Studies, Vol. 72, pp. 584–605.

[42]. Li W., (1992). “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”. IEEE Trans Inf Theory, Vol. 38, No. 6, pp. 1842–1845.

[43]. Reed W.J., (2001). “The Pareto, Zipf and Other Power Laws”. Econ Lett, Vol. 74, No. 1, pp.15–19.

[44]. Goldstein J., Mittal V., Carbonell J.G, and Kantrowitz M., (2000). “Multi-Document Summarization By Sentence Extraction”. School of Computer Science, Carnegie Mellon University, Research Showcase, pp. 40–48.

[45]. Lin C.Y., (2004). “Rouge: a Package for Automatic Evaluation of Summaries”. In: Out TSB (ed) Proceedings of the ACL-04 Workshop Association for Computational Linguistics, Barcelona, Spain, pp. 74–81.

[46]. Nenkova A., and Passonneau R., (2004). “Evaluating Content Selection in Summarization: The Pyramid Method”. Proc. Human Language Technology Conf. North Am, Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), Boston, Massachusetts, pp. 145–152.

[47]. Harnly A., Nenkova A., Passonneau R., and Rambow O., (2005). “Automation of Summary Evaluation by the Pyramid Method”. In Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, pp. 226–232.

[48]. Qazvinian V., and Radev D.R., (2008). “Scientific Paper Summarization Using Citation Summary Networks”. nd Proceedings of the 22 International Conference on Computational Linguistics, Vol. 1, Stroudsburg, PA, pp. 689–696.

[49]. Wang D., and Li T., (2012). “Weighted Consensus Multi-document Summarization”. Inf Process Manag, Vol. 48, pp. 513–523.

[50]. Amdahl G.M., (1967). “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities”. Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, Atlantic City, New Jersey, USA, pp. 483–485.

Clustering of Summarizing Multi-Documents (Large Data) by Using MapReduce Framework

Abstract

Keywords

How to Cite this Article?

References

If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Options for accessing this content:

	North Americas,UK, Middle East,Europe		India	Rest of world
	USD	EUR	INR	USD-ROW
Pdf	35	35	200	20
Online	35	35	200	15
Pdf & Online	35	35	400	25