i-manager Publications

Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining

Prashant Dahiwale*, Paul M.**, M.M. Raghuwanshi***

* Lecturer, Department of Computer Engineering, Government Polytechnic Daman, UT of Daman and Diu, India.

** Lecturer, Department of Information Technology, Government Polytechnic Daman, UT of Daman and Diu, India.

*** Professor, Department of Computer Technology, Yashwantrao Chauhan College of Engineering, Nagpur, Maharashtra, India.

Periodicity:January - March'2018
DOI : https://doi.org/10.26634/jse.12.3.14554

Abstract

The speed at which World-Wide-Web (WWW) spreads its division from an insubstantial number of web-pages to an enormous amount of web information, progressively improves web crawling complications in a search engine. A search engine controls a set of queries from varying parts of the world, and its satisfaction depends only on the knowledge that it collects by means of crawling. The most general habit of the society is information distribution, and it is done by means of publishing prearranged, semi-structured, and amorphous reserve on the web ( ). This social practice directs to an exponential expansion of web-resource, and hence it became necessary to crawl for non-stop updating of web-knowledge and variations of some presented sources in any condition. This paper proposes feature based crawling algorithm for lightweighted and efficient crawling. The scaling technique is used to evaluate the performance of proposed method with the standard crawler. A great speed presentation is observed after scaling, and the extract of related web-source in such an extreme speed is examined.

Keywords

Features Vector, Similarity Measure, Equivalence Measure, Term Frequency, Data Mining, Information Extraction, Focused Crawler, Crawler Analysis.

How to Cite this Article?

Dahiwale, P., Mate, S., and Raghuwanshi, M, M. (2018). Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining. i-manager's Journal on Software Engineering, 12(3), 1-7. https://doi.org/10.26634/jse.12.3.14554

References

[1]. Brin, S., & Page, L. (1998). The anatomy of a largescale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117.

[2]. Cecchini, R. L., Lorenzetti, C. M., Maguitman, A. G., & Brignole, N. B. (2007). Genetic algorithms for topical web search: A study of different mutation rates. In XIII Congreso Argentino de Ciencias de la Computación, 1585-1595.

[3]. Chen, X., & Zhang, X. (2008, October). Hawk: A focused crawler with content and link analysis. In e- Business Engineering, 2008. ICEBE'08. IEEE International Conference on (pp. 677-680). IEEE.

[4]. Hati, D., & Kumar, A. (2010, June). Improved focused crawling approach for retrieving relevant pages based on block partitioning. In Education Technology and Computer (ICETC), 2010 2nd International Conference on (Vol. 3, pp. V3-269). IEEE.

[5]. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604-632.

[6]. Kumar, S., & Chauhan, N. (2012). A context model for focused web search. International Journal of Computers & Technology, 2(3c), 155-162.

[7]. Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE Transactions on knowledge and Data Engineering, 26(7), 1575-1590.

[8]. Mukhopadhyay, D. M., Balitanas, M. O., Farkhod, A., Jeon, S. H., & Bhattacharyya, D. (2009). Genetic algorithm: A tutorial review. International Journal of Grid and Distributed Computing, 2(3), 25-32.

[9]. Nandy, S., Sarkar, P. P., & Das, A. (2012). Analysis of a Statistical Hypothesis based Learning Mechanism for Faster crawling. International Journal of Artificial Intelligence and Applications, 3(4), 117-130.

[10]. Reddy, G. S., & Krishnaiah, D. R. (2012). Clustering algorithm with a novel similarity measure. IOSR Journal of Computer Engineering (IOSRJCE), 4(6), 37-42.

[11]. Shehata, S., Karray, F., & Kamel, M. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1360-1371.

[12]. Sivanandam, S. N., & Deepa, S. N. (2007). Introduction to Genetic Algorithms. Springer Science & Business Media.

	North Americas,UK, Middle East,Europe		India	Rest of world
	USD	EUR	INR	USD-ROW
Pdf	35	35	200	20
Online	15	15	200	15
Pdf & Online	35	35	400	25

Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining

Abstract

Keywords

How to Cite this Article?

References

If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Options for accessing this content: