Image Captioning using Deep Learning and Python

Manish Nishad*, Lokeshwari Sahu**, Gunjan Kumar***, Rahul Verma****, Shankar Sharan Tripathi*****, Siddhartha Choubey******, Madhu Yadav*******
*-******* Department of Computer Science Engineering, Shri Shankaracharya Technical Campus, Chhattisgarh, India.
Periodicity:January - March'2024
DOI : https://doi.org/10.26634/jse.18.3.20582

Abstract

In recent years, the confluence of computer vision and natural language processing, propelled by advancements in deep learning, has garnered significant interest. Among its notable applications, image captioning stands out, enabling computers to comprehend visual content through one or more sentences. This process entails not only identifying objects and scenes but also analyzing their attributes, states, and interrelations, culminating in the generation of meaningful descriptions encapsulating high-level image semantics. While inherently complex, image captioning has seen remarkable progress thanks to the efforts of numerous researchers. This paper offers a comprehensive review of three prominent image captioning methodologies leveraging deep neural networks: CNN-RNN, CNN-CNN, and Reinforcement-based frameworks. Each approach is accompanied by a detailed analysis of representative works, elucidating their respective contributions. Furthermore, evaluation metrics pertinent to these methods are discussed, followed by a synthesis of their advantages and primary challenges. Through this thorough examination, insights into the evolving landscape of image captioning are aimed to be provided, highlighting avenues for further exploration and innovation.

Keywords

Image Captioning, Deep Learning, Computer Vision, Natural Language Processing, CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), CNN-RNN, CNN-CNN, Reinforcement Learning, Evaluation Metrics, Challenges, Advancements

How to Cite this Article?

Nishad, M., Sahu, L., Kumar, G., Verma, R., Tripathi, S. S., Choubey, S., and Yadav, M. (2024). Image Captioning using Deep Learning and Python. i-manager’s Journal on Software Engineering, 18(3), 59-70. https://doi.org/10.26634/jse.18.3.20582

References

[1]. Alzubi, J. A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., & Gupta, P. (2021). Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. Journal of Intelligent & Fuzzy Systems, 40(4), 5761-5769.
[3]. Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5561-5570).
[4]. Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65-72).
[6]. Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., & Mitchell, M. (2015). Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809.
[7]. Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1473-1482).
[8]. Geetha, G., Kirthigadevi, T., Ponsam, G. G., Karthik, T., & Safa, M. (2020, December). Image captioning using deep convolutional neural networks (CNNs). In Journal of Physics: Conference Series (Vol. 1712, No. 1, p. 012015). IOP Publishing.
[11]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
[13]. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
[14]. Karpathy, A., & Fei-Fei, L. (2015). Deep visual- semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128-3137).
[15]. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25 (2), 1-9.
[16]. Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74-81).
[18]. Papineni, K., Roukos, S., Ward, T., & Wei-Jing Zhu, W. (2002). BLEU: A method for automatic evaluation of MT. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 311-318).
[19]. Ranzato, M. A., Chopra, S., Auli, M., & Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
[20]. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.
[21]. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556.
[22]. Sundermeyer, M., Oparin, I., Gauvain, J. L., Freiberg, B., Schlüter, R., & Ney, H. (2013, May). Comparison of feedforward and recurrent neural network language models. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8430- 8434). IEEE.
[23]. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
[24]. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4566- 4575).
[25]. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3156-3164).
[26]. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (pp. 2048-2057). PMLR.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.