A Research on Development of an Image Caption Generator using AI and Image Processing

Yogesh Katre*, Sanika Meshram**, Harsh Nerkar***, Divya Pathrabe****, Shruti Ughade*****, Pranjali Jibhkate******
*-****** Department of Computer Science and Engineering, S. B. Jain Institute of Technology, Management and Research, Nagpur, Maharashtra, India.
Periodicity:October - December'2025
DOI : https://doi.org/10.26634/jcom.13.3.22429

Abstract

Image caption generation involves developing an appropriate textual description of an image through the combination of visual and textual information. Here, a deep learning pipeline with an encoder–decoder architecture is discussed, which uses a deep learning model, such as a convolutional neural network (for instance, ResNet50), to obtain feature representations from an image, and a sequence learning model that employs Long Short-Term Memory (LSTM) to generate the textual description of the image. Spatial attention is incorporated into the decoder to help generate more relevant and detailed captions by associating model attention across important image regions. The pipeline is evaluated using standard evaluation metrics such as BLEU, METEOR, and CIDEr, which provide scores showing how similar the newly generated captions are to human captions/annotations. Demonstrations on the standard Flickr8k dataset show that this approach produces fluent, accurate, and informative descriptions and discuss future applications of the approach, including accessibility, automated tagging, and human–computer interaction.

Keywords

Artificial Intelligence, Computer Vision, Deep Learning, Transformer Models, Image Processing, Multimodal Learning.

How to Cite this Article?

Katre, Y., Meshram, S., Nerkar, H., Pathrabe, D., Ughade, S., and Jibhkate, D. (2025). A Research on Development of an Image Caption Generator using AI and Image Processing. i-manager’s Journal on Computer Science, 13(3), 49-59. https://doi.org/10.26634/jcom.13.3.22429

References

[2]. Chen, W., Hu, H., Li, Y., Ruiz, N., Jia, X., Chang, M. W., & Cohen, W. W. (2023). Subject-driven text-to-image generation via apprenticeship learning. Advances in Neural Information Processing Systems, 36, 30286-30305.
[12]. Sankareswari, S., Dongarkar, B. Z., Dongarkar, H., Sarang, S., & Valke, M. (2023). Image caption generator using deep learning. International Journal of Creative Research Thoughts, 11(10).
[17]. Usman, M., & Syeed, P. S. (2022). Image caption generator using deep learning. Neuroquantology, 20(12), 2682-2691.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 15 15 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.