Abstract

Mining data from databases have seen many upgrades since few decades considering the development of huge data repositories with the advent of internet revolution worldwide. Inherently, the importance of search algorithms to mine data has gained prominence over a period of time. Many programming languages have been used to retrieve relevant information from databases. In this paper, the author presents the E-Literature database created in MySQL with all possible entries, such as ISSN, Publisher name, publication type, etc. Authors from different geographical regions can also be searched from the database. The search algorithm code implemented in the work is used to search the database with varied options, such as 'Abstract', 'keywords', 'affiliation', 'country', 'ISSN', etc. Each search option and the relevant code were written in PHP. Binary search algorithm has been implemented in the work to perform search routine. Apart from general search option, a robust search method which combines various search combinations called 'combination search' can be used to efficiently mine data.

The rapid growth of the web in the last decade makes it the largest publicly accessible data source in the world. The amount of data/information on the web is huge and still growing. Data of all types exist, such as structured tables, semi-structured web pages, unstructured texts, heterogeneous data, hyperlinked texts, etc. [ 6]. The World Wide Web has witnessed esteem for its capability of storing huge amount of data, wherein millions of such repositories are available online. Data mining is also called Knowledge Discovery in Databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases, texts, images, the Web, etc. [ 2]. Data retrieval aims at retrieving all objects which satisfy the defined conditions and consists mainly of determining which documents of a database contain the keywords in the user query [ 3].

Databases are not just confined for depositing and data retrieving medium, but can also be used for analyzing data from huge repositories. It also aids research to analyze the data with the help of data mining algorithms which delivers a role being played by database technology in the data mining process [ 1]. Much research has been devoted in the area of text-mining since few decades, where the main intention was to explore and train on considerable knowledge from huge data repositories. Internet has become the excellent mode to disseminate the wealth of information related to the topic. Owing to this perspective, search algorithms that efficiently extract either exact or related data to the user have gained much prominence [ 5]. Though information is perceived from online sources, many programming languages have been used to retrieve the relevant data from huge databases.

Here, the author reports e-literature database, created in MySQL and implements binary search algorithm concepts to mine abstract related information, including journals, year, keywords, abstract, etc. The rationale behind the work is based on the huge volume of information in journals publishing manuscripts having fewer or limited search options. Hence, a specific, broad, intuitive search options are implemented in this work to represent robust initiatives of search algorithms.

In order to develop a local database, abstract related data was extracted from PubMed database [ 7] and Google Scholar [4]. PubMed is a freeware of National Center for Biotechnology Information (NCBI) maintained at the National Library of Medicine (NLM). PubMed provides with accessibility for ease in searching certain topics using generic mechanisms, using MeSH terms, publisher's name, title, patterns, phrases, names of publications, etc. Google Scholar was used to retrieve subject specific data and the extracted information was stored in e-literature database.

1.1 Database Architecture in MySQL

INSERT INTO `eliterature_table` (`SNo`, `Journal_ID`, `Structure`, `Journal_Name`, `ISSN`, `Publisher`, `Pub_type`, `Article_title`, `Authors`, `Affiliation`, `Country`, `Volume`, `Issue`, `Page_nos`, `Abstract`, `Keywords`, `Impact_factor`, `Year`) VALUES

1.2 Extracting Data from Online Databases

In order to insert specific data in local databases, PubMed, Google Scholar, etc., are searched for the presence of keyword, 'data mining'. However, it was observed that most relevant data was obtained from Google Scholar but not from PubMed. PubMed database has much of data related to biology and hence using keyword 'data mining' resulted in more biological papers which are not significant to this study.

Moreover, it should be noted that the PubMed database is very huge with lots of information wealth necessary to any scientist or researcher working in all fields of science. As the study has been restricted to data mining, biological aspects of mining are only considered for extraction from PubMed database.

In the next step, Google Scholar was searched for the presence of keyword, 'data mining' without any limitations on search query. Limitations, such as year, date, and relevance can be supplied to the query. This search resulted in many hits which are more than expected and far from PubMed database which suggests the fact that this database is very huge and score more than PubMed.

Mining data from database requires defining search algorithm to mine data using the user supplied specified input of keywords. Literature data, in general is huge and enormous in size owing to the submission of research and review papers being published in various journals of scientific significance. It should be noted that the data is also diverse and more informative. Hence journals which publish such information from authors have been diversified from general to more specific based on the field of study. Hence, to mine information, it is deemed to be necessary to implement robust and user friendly search functions. Search scripts are written in java in most of the cases. As we can see from available online databases, PHP and java scripts are used.

Therefore, in this study, a robust search mechanism implementing javascript and PHP features are used to mine textual information from local databases.

The e-literature table was created in MySQL database with all possible entries and the image is given in Figures 1 and 2. Entry items, such as ISSN, Publisher name, publication type (research/review/case study) are incorporated.

Authors from different geographical regions can also be searched from the database. For example, if a work on 'IP networking' has to be searched from authors representing a country, this can be made possible using the database to retrieve entries from that specified country. When the results are displayed, the user will have an option to choose the number of hits and can download the data as a single zip file.

From Figure 4, it is evidenced that combination search which is the most robust method implemented in this project can be used to mine data with much ease as the method is simple and user friendly. For example, an author can mine data based on specified years published on particular keywords originating from a specific country of origin can reveal the importance and observable data on the specified keyword.

Retrieving information from databases either online or standalone has seen implementation of improved search techniques using various programming languages. The eliterature database presented here is one of its kind where binary search algorithm was implemented to create a robust and efficient search option to retrieve data. When compared to PubMed or Google scholar, the database has more options to search and even a user can utilize combination search to narrow down the results. Further work is in progress to compare efficiency of algorithms implemented in e-literature, PubMed, and Google Scholar.

Implementation of Robust Search Algorithm to Mine Data from E-Literature Databases

Abstract

Keywords :

Introduction

1. Materials and Methods

1.1 Database Architecture in MySQL

1.2 Extracting Data from Online Databases

2. Results and Discussion

Conclusion

References