Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining

Prashant Dahiwale*, Sanjay Mate**, M.M. Raghuwanshi***
* Lecturer, Department of Computer Engineering, Government Polytechnic Daman, UT of Daman and Diu, India.
** Lecturer, Department of Information Technology, Government Polytechnic Daman, UT of Daman and Diu, India.
*** Professor, Department of Computer Technology, Yashwantrao Chauhan College of Engineering, Nagpur, Maharashtra, India.
Periodicity:January - March'2018


The speed at which World-Wide-Web (WWW) spreading its division from an insubstantial number of web-pages to a
enormous centre of web information progressively improves web crawling complications in a search engine. A search
engine control a set of queries from a varying part of this world, and the satisfaction of it only depend on the knowledge
that it collects by means of crawling. The most general habit of the society is information distribution, and it is done by
means of publishing prearranged, semi-structured and amorphous reserve on the web (Nandy, Sarkar, and Das, 2012).
This social practice directs to an exponential expansion of web-resource, and hence it became necessary to crawl for
non stop updating of web-knowledge and variations of some presented sources in any conditions. This paper proposes
feature based crawling algorithm for light weighted and efficient crawling. The scaling technique is used to evaluate the
performance of proposed method with the standard crawler. The great speed presentation is observed after scaling,
and the extract of related web-source in such a extreme speed is examined.


Features Vector, Similarity Measure, Equivalence Measure, Term Frequency, Data Mining, Information Extraction, Focused Crawler, Crawler Analysis.

How to Cite this Article?

Dahiwale, P., Mate, S., and Raghuwanshi, M, M. (2018). Design and Development of Feature Based Similarity Measure Crawling Algorithm: An Approach to Text Mining. i-manager's Journal on Software Engineering, 12(3), 1-7.


[1]. Brin, S., & Page, L. (1998). The anatomy of a largescale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117.
[2]. Cecchini, R. L., Lorenzetti, C. M., Maguitman, A. G., & Brignole, N. B. (2007). Genetic algorithms for topical web search: A study of different mutation rates. In XIII Congreso Argentino de Ciencias de la Computación, 1585-1595.
[3]. Chen, X., & Zhang, X. (2008, October). Hawk: A focused crawler with content and link analysis. In e- Business Engineering, 2008. ICEBE'08. IEEE International Conference on (pp. 677-680). IEEE.
[4]. Hati, D., & Kumar, A. (2010, June). Improved focused crawling approach for retrieving relevant pages based on block partitioning. In Education Technology and Computer (ICETC), 2010 2nd International Conference on (Vol. 3, pp. V3-269). IEEE.
[5]. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604-632.
[6]. Kumar, S., & Chauhan, N. (2012). A context model for focused web search. International Journal of Computers & Technology, 2(3c), 155-162.
[7]. Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE Transactions on knowledge and Data Engineering, 26(7), 1575-1590.
[8]. Mukhopadhyay, D. M., Balitanas, M. O., Farkhod, A., Jeon, S. H., & Bhattacharyya, D. (2009). Genetic algorithm: A tutorial review. International Journal of Grid and Distributed Computing, 2(3), 25-32.
[9]. Nandy, S., Sarkar, P. P., & Das, A. (2012). Analysis of a Statistical Hypothesis based Learning Mechanism for Faster crawling. International Journal of Artificial Intelligence and Applications, 3(4), 117-130.
[10]. Reddy, G. S., & Krishnaiah, D. R. (2012). Clustering algorithm with a novel similarity measure. IOSR Journal of Computer Engineering (IOSRJCE), 4(6), 37-42.
[11]. Shehata, S., Karray, F., & Kamel, M. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1360-1371.
[12]. Sivanandam, S. N., & Deepa, S. N. (2007). Introduction to Genetic Algorithms. Springer Science & Business Media.

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

If you have access to this article please login to view the article or kindly login to purchase the article
Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.