Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining

Prashant Dahiwale*, Sanjay Mate**, M.M. Raghuwanshi***
* Lecturer, Department of Computer Engineering, Government Polytechnic Daman, UT of Daman and Diu, India.
** Lecturer, Department of Information Technology, Government Polytechnic Daman, UT of Daman and Diu, India.
*** Professor, Department of Computer Technology, Yashwantrao Chauhan College of Engineering, Nagpur, Maharashtra, India.
Periodicity:January - March'2018


The speed at which World-Wide-Web (WWW) spreads its division from an insubstantial number of web-pages to an enormous amount of web information, progressively improves web crawling complications in a search engine. A search engine controls a set of queries from varying parts of the world, and its satisfaction depends only on the knowledge that it collects by means of crawling. The most general habit of the society is information distribution, and it is done by means of publishing prearranged, semi-structured, and amorphous reserve on the web ( Nandy et al., 2012). This social practice directs to an exponential expansion of web-resource, and hence it became necessary to crawl for non-stop updating of web-knowledge and variations of some presented sources in any condition. This paper proposes feature based crawling algorithm for lightweighted and efficient crawling. The scaling technique is used to evaluate the performance of proposed method with the standard crawler. A great speed presentation is observed after scaling, and the extract of related web-source in such an extreme speed is examined.


Features Vector, Similarity Measure, Equivalence Measure, Term Frequency, Data Mining, Information Extraction, Focused Crawler, Crawler Analysis.

How to Cite this Article?

Dahiwale, P., Mate, S., and Raghuwanshi, M, M. (2018). Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining. i-manager's Journal on Software Engineering, 12(3), 1-7.


[1]. Brin, S., & Page, L. (1998). The anatomy of a largescale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117.
[2]. Cecchini, R. L., Lorenzetti, C. M., Maguitman, A. G., & Brignole, N. B. (2007). Genetic algorithms for topical web search: A study of different mutation rates. In XIII Congreso Argentino de Ciencias de la Computación, 1585-1595.
[3]. Chen, X., & Zhang, X. (2008, October). Hawk: A focused crawler with content and link analysis. In e- Business Engineering, 2008. ICEBE'08. IEEE International Conference on (pp. 677-680). IEEE.
[4]. Hati, D., & Kumar, A. (2010, June). Improved focused crawling approach for retrieving relevant pages based on block partitioning. In Education Technology and Computer (ICETC), 2010 2nd International Conference on (Vol. 3, pp. V3-269). IEEE.
[5]. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604-632.
[6]. Kumar, S., & Chauhan, N. (2012). A context model for focused web search. International Journal of Computers & Technology, 2(3c), 155-162.
[7]. Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE Transactions on knowledge and Data Engineering, 26(7), 1575-1590.
[8]. Mukhopadhyay, D. M., Balitanas, M. O., Farkhod, A., Jeon, S. H., & Bhattacharyya, D. (2009). Genetic algorithm: A tutorial review. International Journal of Grid and Distributed Computing, 2(3), 25-32.
[9]. Nandy, S., Sarkar, P. P., & Das, A. (2012). Analysis of a Statistical Hypothesis based Learning Mechanism for Faster crawling. International Journal of Artificial Intelligence and Applications, 3(4), 117-130.
[10]. Reddy, G. S., & Krishnaiah, D. R. (2012). Clustering algorithm with a novel similarity measure. IOSR Journal of Computer Engineering (IOSRJCE), 4(6), 37-42.
[11]. Shehata, S., Karray, F., & Kamel, M. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1360-1371.
[12]. Sivanandam, S. N., & Deepa, S. N. (2007). Introduction to Genetic Algorithms. Springer Science & Business Media.

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

If you have access to this article please login to view the article or kindly login to purchase the article
Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.