i-manager Publications

Image Captioning using Deep Learning and Python

Manish Nishad*, Lokeshwari Sahu**, Gunjan Kumar***, Rahul Verma****, Shankar Sharan Tripathi*****, Siddhartha Choubey******, Madhu Yadav*******

*-******* Department of Computer Science Engineering, Shri Shankaracharya Technical Campus, Chhattisgarh, India.

Periodicity:January - March'2024
DOI : https://doi.org/10.26634/jse.18.3.20582

Abstract

In recent years, the confluence of computer vision and natural language processing, propelled by advancements in deep learning, has garnered significant interest. Among its notable applications, image captioning stands out, enabling computers to comprehend visual content through one or more sentences. This process entails not only identifying objects and scenes but also analyzing their attributes, states, and interrelations, culminating in the generation of meaningful descriptions encapsulating high-level image semantics. While inherently complex, image captioning has seen remarkable progress thanks to the efforts of numerous researchers. This paper offers a comprehensive review of three prominent image captioning methodologies leveraging deep neural networks: CNN-RNN, CNN-CNN, and Reinforcement-based frameworks. Each approach is accompanied by a detailed analysis of representative works, elucidating their respective contributions. Furthermore, evaluation metrics pertinent to these methods are discussed, followed by a synthesis of their advantages and primary challenges. Through this thorough examination, insights into the evolving landscape of image captioning are aimed to be provided, highlighting avenues for further exploration and innovation.

Keywords

Image Captioning, Deep Learning, Computer Vision, Natural Language Processing, CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), CNN-RNN, CNN-CNN, Reinforcement Learning, Evaluation Metrics, Challenges, Advancements

How to Cite this Article?

Nishad, M., Sahu, L., Kumar, G., Verma, R., Tripathi, S. S., Choubey, S., and Yadav, M. (2024). Image Captioning using Deep Learning and Python. i-manager’s Journal on Software Engineering, 18(3), 59-70. https://doi.org/10.26634/jse.18.3.20582

References

[1]. Alzubi, J. A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., & Gupta, P. (2021). Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. Journal of Intelligent & Fuzzy Systems, 40(4), 5761-5769.

[2]. Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, the Netherlands, October 11-14, 2016, Proceedings, Part V (pp. 382-398). Springer International Publishing.

[3]. Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5561-5570).

[4]. Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65-72).

[5]. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[6]. Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., & Mitchell, M. (2015). Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809.

[7]. Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1473-1482).

[8]. Geetha, G., Kirthigadevi, T., Ponsam, G. G., Karthik, T., & Safa, M. (2020, December). Image captioning using deep convolutional neural networks (CNNs). In Journal of Physics: Conference Series (Vol. 1712, No. 1, p. 012015). IOP Publishing.

[9]. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142- 158.

[10]. Gu, J., Cai, J., Wang, G., & Chen, T. (2018, April). Stack-captioning: Coarse-to-fine learning for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 6837-6844.

[11]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).

[12]. Hochreiter, S., & Schmidhuber, J. (1997). Long short- term memory. Neural Computation, 9(8), 1735-1780.

[13]. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.

[14]. Karpathy, A., & Fei-Fei, L. (2015). Deep visual- semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128-3137).

[15]. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25 (2), 1-9.

[16]. Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74-81).

[17]. Mao, J., Xu, W., Yang, Y., Wang, J., & Yuille, A. L. (2014). Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090.

[18]. Papineni, K., Roukos, S., Ward, T., & Wei-Jing Zhu, W. (2002). BLEU: A method for automatic evaluation of MT. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 311-318).

[19]. Ranzato, M. A., Chopra, S., Auli, M., & Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.

[20]. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.

[21]. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556.

[22]. Sundermeyer, M., Oparin, I., Gauvain, J. L., Freiberg, B., Schlüter, R., & Ney, H. (2013, May). Comparison of feedforward and recurrent neural network language models. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8430- 8434). IEEE.

[23]. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).

[24]. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4566- 4575).

[25]. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3156-3164).

[26]. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (pp. 2048-2057). PMLR.

Image Captioning using Deep Learning and Python

Abstract

Keywords

How to Cite this Article?

References

If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Options for accessing this content:

	North Americas,UK, Middle East,Europe		India	Rest of world
	USD	EUR	INR	USD-ROW
Pdf	35	35	200	20
Online	35	35	200	15
Pdf & Online	35	35	400	25