Abstract

Internet users have increased with increase in technology. People use internet for their business need, education, online marketing, social communication and more. For this purpose, users are forced to use the internet for their day-to-day operation effectiveness. But content in the web is also increased due to various factors. Any user from any place can upload any type of file easily, and all the contents today are available in the form of digital data. From this huge repertoire browsing, the correct web page is a really challenging task to the user. Retrieving the effective and correct content is not an easy job, for which a number of research works have been performed in the field of web extraction. This paper reviews some of the web extraction techniques and methods.

On the last decades, the amount of web-based information available has increased dramatically. The idea of gathering useful information from the web has become a challenging issue for the users. Current web information gathering systems attempt to satisfy user requirements by capturing their information needs. For this purpose, user profiles are created for user background knowledge description. User profiles represent the concept models possessed by users when gathering web information. A concept model is implicitly possessed by users and is generated from their background knowledge. While this concept model cannot be proven in laboratories, many web ontologists have observed it in the user behavior [6]. When users read through a document, they can easily determine whether or not it is of their interest or relevance to them, a judgment that arises from their implicit concept models. If a user's concept model can be simulated, then a superior representation of user profiles can be built.

To simulate user concept models, ontologies—a knowledge description and formalization model—are utilized in personalized web information gathering. Such ontologies are called ontological user profiles or personalized ontologies. To represent user profiles, many researchers have attempted to discover user background knowledge through global or local analysis.

Global analysis uses existing global knowledge bases for user background knowledge representation. Commonly used knowledge bases include generic ontologies (e.g., WorldNet [7] ), Thesauruses (e.g., digital libraries), and online knowledge bases (e.g., online categorizations and Wikipedia) [8]. The global analysis techniques produce effective performance for user background knowledge extraction. However, global analysis is limited by the quality of the user knowledge base. For example, Word Net was reported as helpful in capturing user interest in some areas but useless for others.

1.1 Chris Buckley, and Ellen M. Voorhees “Evaluating Evaluation Measure Stability” [1]

1.1.1 Problem Formulation

Information retrieval has a well-established tradition of performing laboratory experiments on test collections to compare the relative effectiveness of different retrieval approaches. The experimental design specifies the evaluation criterion to be used to determine if one approach is better than the another. Because retrieval behavior is sufficiently complex to be difficult to summarize in one number and many different effectiveness measures have been proposed.

1.1.2 Research Design

The test collection must have a reasonable number of requests. Sparck Jones and Van Rijsbergen suggested a minimum of 75 requests, while the TREC program committee has used 25 requests as a minimum and 50 requests as the norm. Five or ten requests is too few. The experiment must use a reasonable evaluation measure. Average Precision, R-Precision, and Precision at 20 (or 10 or 30) documents retrieved are the most commonly used measures. Measures such as Precision at one document retrieved (i.e., Is the first retrieved document relevant?) or the rank of the first relevant document are not usually reasonable evaluation measures.

1.1.3 Findings

This paper examines these three rules-of-thumb and shows how they interact with each other. A novel approach is presented for experimentally quantifying the likely error associated with the conclusion “method A is better than method B” given a number of requests, an evaluation measure, and a notion of difference. As expected, the error rate increases as the number of requests decreases. More surprisingly, a striking difference in error rates for various evaluation measures is demonstrated. For example, Precision at 30 documents retrieved has almost twice the error rate of Average Precision. These results will not imply that measures with higher error rates should not be used; different evaluation measures evaluate different aspects of retrieval behavior and evaluation measures must be chosen to match the goals of the test.

1.1.4 Summary

This paper presents a method for reasonable number of request, reasonable evaluation measure and reasonable notation of different quantifying. Also this paper motivated how properties of these rules support the retrieval process effectively. It brings the conclusion that some method always dominated the other one. The experiment brings the request factor, evaluation measure, and notation difference.

1.2 Susan Gauch, Jason Chaffee, and Alexander Pretschner “Ontology-Based Personalized Search and Browsing” [2]

1.2.1 Implication

Goal of this paper is to use the data from the Query Track to quantify the error rate associated with deciding that one retrieval method is better than the another given that the decision is based an experiment with a particular number of topics, a specific evaluation measure, and a particular value used to decide if two scores are different. The approach is as follows. First, the authors chose an evaluation measure and a “fuzziness” value. The fuzziness value is the percentage difference between scores such that if the difference is smaller than the fuzziness value, the two scores are deemed equivalent. For example, if the fuzziness value is 5%, any scores within 5% of one another are counted as equal.

1.2.2 Research Design

The ontologies that are used for the browsing content at a Website are generally different for each site that a user visits. Even if there are similarly named concepts in the ontology, they may contain different types of pages. Frequently, the same concepts will appear with different names and/or in different areas of the ontology. Not only are there differences between sites, but between the users as well. One user may consider the certain topic to be an “Arts” topic, while a different user might consider the same topic to be a “Recreation” topic.

1.2.3 Findings

One increasingly popular way to structure information is through the use of ontologies, or graphs of concepts. One such system is OntoSeek [Guarino 99], which is designed for content-based information retrieval from online yellow pages and product catalogs. OntoSeek uses simple conceptual graphs to represent queries and resource descriptions. The system uses the Sensus ontology [Knight 99], which comprises a simple taxonomic structure of approximately 70,000 nodes. The system presented in [Labrou 99] uses Yahoo [YHO 02] as ontology. The system semantically annotates Web pages via the use of Yahoo categories as descriptors of their content. The system uses Telltale [Chower 96a, Chower 96b, and Pearce 97] as its classifier. Telltale computes the similarity between documents using n-grams as index terms. The ontologies used in the above examples use simple structured links between concepts.

1.2.4 Summary

This paper brings the concept of retrieval method of information based on evaluation measure and fuzziness value. Information retrievals are based on the available data set and users input query. Normally, information on the web have huge amount of duplication, for which retrieval is a challenging task. It brings the information based on fuzziness value. It identified the percentage value between the scores and fuzziness.

1.3 Zhiqiang Cai, Danielle S. McNamara, Max Louwerse, Xiangen Hu, Mike Rowe, and Arthur C. Graesser “NLS: A Non-Latent Similarity Algorithm” [3]

1.3.1 Problem Formulation

Computationally determining the semantic similarity between textual units (words, sentences, chapters, etc.) has become essential in a variety of applications, including web searches and the question answering systems. One specific example is AutoTutor, an intelligent tutoring system in which the meaning of a student answer is compared with the meaning of an expert answer. In another application, called Coh-Metrix, semantic similarity is used to calculate the cohesion in text by determining the extent of overlap between sentences and paragraphs.

1.3.2 Research Design

This paper focuses on the vector space models. The specific goal is to compare Latent Semantic Analysis (LSA, Landauer & Dumais, 1997) to an alternative algorithm called Non-Latent Similarity (NLS). This NLS algorithm makes use of a Second-Order similarity Matrix (SOM). Essentially, a SOM is created using the cosine of the vectors from a first-order (non-latent) matrix. This First Order Matrix (FOM) could be generated in any number of ways. However, here we used a method modified from Lin (1998).

1.3.3 Findings

LSA is a type of vector-space model that is used to represent world knowledge (Landauer & Dumais, 1997). LSA extracts quantitative information about the co-occurrences of words in the documents (paragraphs and sentences) and translates this into a N-dimensional space. The input of LSA is a large co-occurrence matrix that specifies the frequency of each word in a document. Using Singular Value Decomposition (SVD), LSA maps each document and the word into a lower dimensional space.

1.3.4 Summary

Text extraction is based on the text phrase, verb or words available on the text. For extracting the effective documents on web, this paper proposed Vector Space model using LSA and NLS. Information extracted by LSA are converted in N-dimensional space. Experimental proof that the proposed method brings effective extraction.

1.4 Kyung Soon Lee W. Bruce Croft James Allan “A Cluster-Based Resampling Method for Pseudo- Relevance Feedback” [4]

1.4.1 Problem Formulation

Most pseudo-relevance feedback methods assume that a set of top-retrieved documents is relevant and then learn from the pseudo-relevant documents to expand terms or to assign better weights to the original query. This is similar to the process used in relevance feedback, when actual relevant documents are used. This is common and even expected in all retrieval models. This noise, however, can result in the query representation “drifting” away from the original query.

1.4.2 Research Design

This paper describes a resampling method using clusters to select better documents for pseudo-relevance feedback. Document clusters for the initial retrieval set can represent aspects of a query on especially largescale web collections, since the initial retrieval results may involve diverse subtopics for such collections. Since it is difficult to find one optimal cluster, we use several relevant groups for feedback

1.4.3 Findings

The top-retrieved documents are a query-oriented ordering that does not consider the relationship between documents. The pseudo-relevance feedback problem of learning expansion terms is viewed closely related to a query to be similar to the classification problem of learning an accurate decision boundary, depending on training examples. This problem is approached by repeatedly selecting dominant documents to expand terms toward dominant documents of the initial retrieval set, as in the boosting method for a weak learner that repeatedly selects hard examples to change the decision boundary toward hard examples.

1.4.4 Summary

Resampling the top-ranked documents using clusters is effective for pseudo-relevance feedback. The improvements obtained were consistent across nearly all collections, and for large web collections. In order to retrieve relevant document clustering perform the initial steps which reduces the searching time for the document. This search process performed is repeated in the document. The main drawback of this proposed method is that it is very difficult to identify the optimal cluster.

1.5 Xing Jiang and Ah-Hwee Tan “Mining Ontological Knowledge from Domain-Specific Text Documents” [5]

1.5.1 Problem Formulation

Ontology is an explic t specificatio n o f a conceptualization, comprising a formal description of concepts, relations between concepts, and axioms about a target domain. Considered as the backbone of the Semantic Web, domain ontologies enable software agents to interact and carry out sophisticated tasks for users.

1.5.2 Research Design

To reduce the effort of building ontologies, ontology learning systems [9] have been developed to learn ontologies from domain relevant materials. However, most existing ontology learning systems focus on extracting concepts and taxonomic (IS-A) relations. For example, SymOntos, a symbolic ontology management system developed at IASI CNR, made use of shallow NLP tools including a morphologic analyzer, a Part-Of-Speech (POS) tagger and a chunk parser, to process documents and employed text mining techniques to produce large ontologies based on document document collections.

1.5.3 Findings

In this paper, a novel system is presented, known as Concept Relation Concept Tuple based Ontology Learning (CRCTOL) for mining rich semantic knowledge in the form of ontology from domain-specific documents. By using a full text parsing technique and incorporating statistical and lexico-syntactic methods, the knowledge extracted by this system is more concise and contains a richer semantics compared with the alternative systems.

1.5.4 Summary

Constructing the ontology's based system is complex in nature. This paper eliminates this problem, with help of ontology's domain relevant materials. Most methods are based on IS-A technique, and this paper brings a new technique called CRCTOL for semantic knowledge ontology for the specific documents. Results prove that, the proposed method produces the effective results than the existing techniques.

Extracting the domain knowledge from the huge content is really a challenging factor for the browser community. The best text mining process also fails to retrieve the needed content. Here a comparison has been conducted based on various web extraction techniques, based on some key issues available on the text. Result brings the conclusion that there is still a big gap available in the knowledge extraction process.

[6]. S.E. Middleton, N.R. Shadbolt, and D.C. De Roure, (2004). “Ontological User Profiling in Recommender Systems,” ACM Trans. Information Systems (TOIS), Vol.22, No.1, pp.54-88.

[8]. D.N. Milne, I.H. Witten, and D.M. Nichols, (2007). “A Knowledge-Based Search Engine Powered by Wikipedia,” Proc. 16th ACM Conf. Information and Knowledge Management (CIKM '07), pp.445-454.

[9]. R. Navigli, P. Velardi, and A. Gangemi, (2003). “Ontology Learning and Its Application to Automated Terminology Translation,” IEEE Intelligent Systems, Vol.18, No.1, pp.22-31.

Literature Survey On Web Based Knowledge Extraction