i-manager Publications

Data Retrieval in Cancer Documents using Various Weighting Schemes

A. Nicholas Daniel*, Jayanthila Devi**

*,** Srinivas University, Karnataka, India.

Periodicity:October - December'2023
DOI : https://doi.org/10.26634/jit.12.4.20365

Abstract

In the realm of data retrieval, sparse vectors serve as a pivotal representation for both documents and queries, where each element in the vector denotes a word or phrase from a predefined lexicon. In this study, multiple scoring mechanisms are introduced aimed at discerning the significance of specific terms within the context of a document extracted from an extensive textual dataset. Among these techniques, the widely employed method revolves around inverse document frequency (IDF) or Term Frequency-Inverse Document Frequency (TF-IDF), which emphasizes terms unique to a given context. Additionally, the integration of BM25 complements TF-IDF, sustaining its prevalent usage. However, a notable limitation of these approaches lies in their reliance on near-perfect matches for document retrieval. To address this issue, researchers have devised latent semantic analysis (LSA), wherein documents are densely represented as low-dimensional vectors. Through rigorous testing within a simulated environment, findings indicate a superior level of accuracy compared to preceding methodologies.

Keywords

Retrieval, Weighing Scheme, TF-IDF, BM25, Latent Semantic Analysis, Cancer Research, Information Retrieval, Data Mining, Document Analysis, Weighted Retrieval.

How to Cite this Article?

Daniel, A. N., and Devi, J. (2023). Data Retrieval in Cancer Documents using Various Weighting Schemes. i-manager’s Journal on Information Technology, 12(4), 28-32. https://doi.org/10.26634/jit.12.4.20365

References

[1]. Alawad, M., Gao, S., Qiu, J. X., Yoon, H. J., Blair Christian, J., Penberthy, L., ... & Tourassi, G. (2020). Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. Journal of the American Medical Informatics Association, 27(1), 89-98.

[2]. Diao, L., Yan, H., Li, F., Song, S., Lei, G., & Wang, F. (2018). The research of query expansion based on medical terms reweighting in medical information retrieval. EURASIP Journal on Wireless Communications and Networking, 2018, 1-7.

[3]. Foufi, V., Timakum, T., Gaudet-Blavignac, C., Lovis, C., & Song, M. (2019). Mining of textual health information from Reddit: Analysis of chronic diseases with extracted entities and their relations. Journal of Medical Internet Research, 21(6), e12876.

[4]. Hsu, E., Malagaris, I., Kuo, Y. F., Sultana, R., & Roberts, K. (2022). Deep learning-based NLP data pipeline for EHRscanned document information extraction. JAMIA Open, 5(2), ooac045.

[5]. Hsu, W., Antani, S., Long, L. R., Neve, L., & Thoma, G. R. (2009). SPIRS: a Web-based image retrieval system for large biomedical databases. International Journal of Medical Informatics, 78, S13-S24.

[6]. Iqbal, A., Sharif, M., Yasmin, M., Raza, M., & Aftab, S. (2022). Generative adversarial networks and its applications in the biomedical image segmentation: A comprehensive sur vey. International Journal of Multimedia Information Retrieval, 11(3), 333-368.

[7]. Jia, R., Wong, C., & Poon, H. (2019). Document-level $ N $-ary relation extraction with multiscale representation learning. arXiv.

[8]. Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2019). How do we talk about doctors and drugs? Sentiment analysis in forums expressing opinions for medical domain. Artificial Intelligence in Medicine, 93, 50-57.

[9]. Lever, J., Jones, M. R., Danos, A. M., Krysiak, K., Bonakdar, M., Grewal, J. K., ... & Jones, S. J. (2019). Textmining clinically relevant cancer biomarkers for curation into the CIViC database. Genome Medicine, 11(1), 1-16.

[10]. Martinez-Rodriguez, J. L., Hogan, A., & Lopez-Arevalo, I. (2020). Information extraction meets the semantic web: A survey. Semantic Web, 11(2), 255-335.

[11]. Nasar, Z., Jaffry, S. W., & Malik, M. K. (2018). Information extraction from scientific articles: A survey. Scientometrics, 117, 1931-1990.

[12]. Shah, S., & Luo, X. (2018, March). Comparison of deep learning based concept representations for biomedical document clustering. In 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) (pp. 349-352). IEEE.

[13]. Syed, K., Sleeman IV, W., Hagan, M., Palta, J., Kapoor, R., & Ghosh, P. (2020, August). Automatic incident triage in radiation oncology incident learning system. In Healthcare, 8(3), 272. MDPI.

[14]. Wang, Y., Wang, M., & Fujita, H. (2020). Word sense disambiguation: A comprehensive knowledge exploitation framework. Knowledge-Based Systems, 190, 105030.

[15]. Wu, H., Chen, W., Xu, S., & Xu, B. (2021, June). Counter factual supporting facts extraction for explainable medical record based diagnosis with graph network. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1942-1955).

[16]. Wang, Y., Wu, S., Li, D., Mehrabi, S., & Liu, H. (2016). A Part-Of-Speech term weighting scheme for biomedical information retrieval. Journal of Biomedical Informatics, 63, 379-389.

[17]. Yada, S., Nakamura, Y., Wakamiya, S., & Aramaki, E. (2022). Real-mednlp: Overview of real document-based medical natural language processing task. In Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies (pp. 285-296).

[18]. Zhang, X., Jing, L., Hu, X., Ng, M., Jiangxi, J. X., & Zhou, X. (2008). Medical document clustering using ontology-based term similarity measures. International Journal of Data Warehousing and Mining (IJDWM), 4(1), 62-73.

Data Retrieval in Cancer Documents using Various Weighting Schemes

Abstract

Keywords

How to Cite this Article?

References

If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Options for accessing this content:

	North Americas,UK, Middle East,Europe		India	Rest of world
	USD	EUR	INR	USD-ROW
Pdf	35	35	200	20
Online	15	15	200	15
Pdf & Online	35	35	400	25