Data Retrieval in Cancer Documents using Various Weighting Schemes

A. Nicholas Daniel*, Jayanthila Devi**
*,** Srinivas University, Karnataka, India.
Periodicity:October - December'2023
DOI : https://doi.org/10.26634/jit.12.4.20365

Abstract

In the realm of data retrieval, sparse vectors serve as a pivotal representation for both documents and queries, where each element in the vector denotes a word or phrase from a predefined lexicon. In this study, multiple scoring mechanisms are introduced aimed at discerning the significance of specific terms within the context of a document extracted from an extensive textual dataset. Among these techniques, the widely employed method revolves around inverse document frequency (IDF) or Term Frequency-Inverse Document Frequency (TF-IDF), which emphasizes terms unique to a given context. Additionally, the integration of BM25 complements TF-IDF, sustaining its prevalent usage. However, a notable limitation of these approaches lies in their reliance on near-perfect matches for document retrieval. To address this issue, researchers have devised latent semantic analysis (LSA), wherein documents are densely represented as low-dimensional vectors. Through rigorous testing within a simulated environment, findings indicate a superior level of accuracy compared to preceding methodologies.

Keywords

Retrieval, Weighing Scheme, TF-IDF, BM25, Latent Semantic Analysis, Cancer Research, Information Retrieval, Data Mining, Document Analysis, Weighted Retrieval.

How to Cite this Article?

Daniel, A. N., and Devi, J. (2023). Data Retrieval in Cancer Documents using Various Weighting Schemes. i-manager’s Journal on Information Technology, 12(4), 28-32. https://doi.org/10.26634/jit.12.4.20365

References

[5]. Hsu, W., Antani, S., Long, L. R., Neve, L., & Thoma, G. R. (2009). SPIRS: a Web-based image retrieval system for large biomedical databases. International Journal of Medical Informatics, 78, S13-S24.
[16]. Wang, Y., Wu, S., Li, D., Mehrabi, S., & Liu, H. (2016). A Part-Of-Speech term weighting scheme for biomedical information retrieval. Journal of Biomedical Informatics, 63, 379-389.
[17]. Yada, S., Nakamura, Y., Wakamiya, S., & Aramaki, E. (2022). Real-mednlp: Overview of real document-based medical natural language processing task. In Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies (pp. 285-296).
[18]. Zhang, X., Jing, L., Hu, X., Ng, M., Jiangxi, J. X., & Zhou, X. (2008). Medical document clustering using ontology-based term similarity measures. International Journal of Data Warehousing and Mining (IJDWM), 4(1), 62-73.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.