A Review on Image Captioning System from Artificial Intelligence, Machine Learning and Deep Learning Techniques

Revathi B. S.*, A. Meena Kowshalya**
*-** Government College of Technology, Coimbatore, Tamil Nadu, India.
Periodicity:July - September'2022
DOI : https://doi.org/10.26634/jip.9.3.19054

Abstract

Image Captioning is the process of generating textual descriptions of an image, which need to be syntactically and semantically correct. This paper extensively surveys very early literature that includes the advent of Artificial Intelligence, the Machine Learning pathway, the early Deep Learning and the current Deep Learning methodology for Image Captioning. This survey paper aims to develop a system to predict captions for the given images with a higher accuracy by combining the results of different Deep Learning Techniques. This model based on a neural network consists of a vision CNN followed by the language generator RNN. It generates complete sentence in natural language from an input image. The state of art is achieved by comparing three different encoder –decoder models. By comparing three models, the blue score of CNN-LSTM Model with Flikr 8k dataset is 0.44, CNN-LSTM Word Embedding with Flikr 8k dataset is 0.68 and CNN –GRU model Visual Attention with MSCOCO Dataset is 0.86.

Keywords

Image Captioning, Deep Learning, Automatic Caption Generation, Evaluation Metrics.

How to Cite this Article?

Revathi, B. S., and Kowshalya, A. M. (2022). A Review on Image Captioning System from Artificial Intelligence, Machine Learning and Deep Learning Techniques. i-manager’s Journal on Image Processing, 9(3), 17-33. https://doi.org/10.26634/jip.9.3.19054

References

[1]. Babri, H. A., & Tong, Y. (1996, June). Deep feedforward networks: application to pattern recognition. In Proceedings of International Conference on Neural Networks (ICNN'96), 3, 1422-1426. https://doi.org/10.1109/ICNN.1996.549108
[2]. Baum, E. B. (1988). On the capabilities of multilayer perceptrons. Journal of Complexity, 4(3), 193-215. https://doi.org/10.1016/0885-064X(88)90020-9
[3]. Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6), 1554-1563.
[4]. Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances: By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, A.M.F.R.S. Philosophical transactions of the Royal Society of London, 53(53), 370-418. https://doi.org/10.1098/rstl.1763.0053
[5]. Bellis, M. (2021). The History of Photography: Pinholes and Polaroids to Digital Images. Retrieved from https:// www.thoughtco.com/history-of-photography-and-thecamera-1992331
[6]. Block, H. D. (1970). A review of “perceptrons: An introduction to computational geometry”. Information and Control, 17(5), 501-522. https://doi.org/10.1016/S0019-9958(70)90409-2
[7]. Block, H. D. (1970). A review of “perceptrons: An introduction to computational geometry. Information and Control, 17(5), 501-522.
[8]. Boutaba, R., Salahuddin, M. A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F., & Caicedo, O. M. (2018). A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1), 1-99. https://doi.org/10.1186/s13174-018-0087-2
[9]. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and Regression Trees. CRC press, Boca Raton, Florida.
[10]. Buczak, A . L. (2005). U. S. Patent No. 6, 922, 680. Washington, DC: U. S. Patentand Trademark Office.
[11]. Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154-6162).
[12]. Chao, Y. W., Wang, Z., Mihalcea, R., & Deng, J. (2015). Mining semantic affordances of visual object categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4259-4267).
[13]. Chow, C. K. (1957). An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, EC-6(4), 247-254. https://doi.org/10.1109/TEC.1957.5222035
[14]. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015, June). Gated feedback recurrent neural networks. In International Conference on Machine Learning (pp. 2067-2075). PMLR.
[15]. Cook, W. A. (1989). Case Grammar Theory. Georgetown University Press.
[16]. Cortes, C., & Vapnik, V. (1995). Support vector machine. Machine Learning, 20(3), 273-297.
[17]. Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232. https://doi.org/10.1111/j.2517-6161.1958.tb00292.x
[18]. Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 1, 886-893. https://doi.org/10.1109/CVPR.2005.177
[19]. Evgeniou, T., & Pontil, M. (1999, July). Support vector machines: Theory and applications. In Advanced Course on Artificial Intelligence (pp. 249-257). Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44673-7_12
[20]. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627-1645. https://doi.org/10.1109/TPAMI.2009.167
[21]. Fix, E. & Hodges, J. L. (1951). Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field.
[22]. Floyd, R. W., & Beigel, R., (1994). The Language of Machines: An Introduction to Computability and Formal Languages. Computer Science Press, New York.
[23]. Gagniuc, P. A. (2017). Markov Chains: From Theory to Implementation and Experimentation. John Wiley & Sons, Inc, (pp. 256).
[24]. Gallant, S. I. (1990). Perceptron-based learning algorithms. IEEE Transactions on Neural Networks, 1(2), 179-191.
[25]. Gan, C., Yang, T., & Gong, B. (2016). Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 87-97).
[26]. Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp.1440-1448). https://doi.org/10.1109/ICCV.2015.169
[27]. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580-587).
[28]. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580-587).
[29]. Goldberger, A. S. (2004). Econometric computing by hand. Journal of Economic and Social Measurement, 29(1-3), 115-117.
[30]. Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6645-6649). IEEE. https://doi.org/10.1109/ICASSP.2013.6638947
[31]. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2961-2969).
[32]. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2961-2969).
[33]. Hebb, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. John Wiley and Sons, Inc., New York, 1-337.
[34]. Hebb, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. John Wiley and Sons, Inc., New York, 1-337.
[35]. Hebb, D. O. (2002). The Organization of Behavior: A Neuropsychological Theory. Psychology Press, (pp. 378). https://doi.org/10.4324/9781410612403
[36]. Herz, A., Sulzer, B., Kühn, R., & Van Hemmen, J. L. (1988). The Hebb rule: Storing static and dynamic objects in an associative neural network. EPL (Europhysics Letters), 7(7), 663. https://doi.org/10.1209/0295-5075/7/7/016
[37]. Herz, A., Sulzer, B., Kühn, R., & Van Hemmen, J. L. (1988). The Hebb rule: Storing static and dynamic objects in an associative neural network. EPL (Europhysics Letters), 7(7), 663. https://doi.org/10.1209/0295-5075/7/7/016
[38]. Hill, S. (2013). A Complete History of the Camera Phone. Retrieved from https://www.digitaltrends.com/mobile/camera-phone-history/
[39]. Hochreiter, S., & Schmidhuber, J. (1997). Long shortterm memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
[40]. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500-544. https://doi.org/10.1113/jphysiol.1952.sp004764
[41]. Jaynes, E. T. (1957a). Information theory and statistical mechanics. Physical Review, 106(4), 620. https://doi.org/10.1103/PhysRev.106.620
[42]. Jaynes, E. T. (1957b). Information theory and statistical mechanics. Physical review, 108(2), 171. https://doi.org/10.1103/PhysRev.108.171
[43]. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16(3), 307-354. https://doi.org/10.1016/0364-0213(92)90036-T
[44]. Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering, 160(1), 3-24.
[45]. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1, 1097- 1105.
[46]. Kubat, M. (1999). Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0-02- 352781-7. The Knowledge Engineering Review, 13(4), 409-412.
[47]. Lampert, C. H., Nickisch, H., & Harmeling, S. (2009, June). Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 951-958). IEEE. https://doi.org/10.1109/CVPR.2009.5206594
[48]. LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1-14.
[49]. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1989). Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems, (pp-2).
[50]. Legendre, A. (1805). Nouvelles Méthodes Pour La Détermination Des Orbites Des Comètes. Nineteenth Centur y Collections Online (NCCO): Science, Technology, and Medicine: 1780-1925.
[51]. Li, L. J., & Fei-Fei, L. (2007, October). What, where and who? Classifying events by scene and object recognition. In 2007 IEEE 11th International Conference on Computer Vision (pp. 1-8). IEEE. https://doi.org/10.1109/ICCV.2007.4408872
[52]. Liu, B. (2011). Supervised learning. Web Data Mining, 63-132. https://doi.org/10.1007/978-3-642-19460-3_3
[53]. Machinery, C. (1950). Computing machinery and intelligence-AM Turing. Mind, 59(236), 433.
[54]. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics Probability, 5(1), 281-297.
[55]. Maji, S., Bourdev, L., & Malik, J. (2011, June). Action recognition from a distributed representation of pose and appearance. In CVPR 2011 (pp. 3177-3184). IEEE. https://doi.org/10.1109/CVPR.2011.5995631
[56]. Marchesi, M., Orlandi, G., Piazza, F., Pollonara, L., & Uncini, A. (1990, June). Multi-layer perceptrons with discrete weights. In 1990 IJCNN International Joint Conference on Neural Networks (pp. 623-630). IEEE. https://doi.org/10.1109/IJCNN.1990.137772
[57]. Maron, M. E. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM (JACM), 8(3), 404-417.
[58]. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115-133. https://doi.org/10.1007/BF02478259
[59]. Miller, B. L. (1967). Finite Stage Continuous Time Markov Decision Processes with an Infinite Planning Horizon. RAND Corporation, Research foundation in Santa Monica, California. (pp. 26).
[60]. Ng, A., & Jordan, M. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems, 14, (pp. 8).
[61]. Novikoff, A. B. (1962). On Convergence Proofs on Perceptrons. Symposium on the mathematical theory of automata, 615-622.
[62]. Novikoff, A., (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 615-622.
[63]. O'Regan, G. (2013). Marvin minsky. Giants of Computing, 193-195. https://doi.org/10.1007/978-1-4471-5340-5_41
[64]. Ono, K., & Kimura, M. (1969). Optimal control of markov processes. Transactions of the Society of Instrument and Control Engineers, 5(3), 273-278. https://doi.org/10.9746/sicetr1965.5.273
[65]. Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. https://doi.org/10.48550/arXiv.1609.03499
[66]. Park, J. K., Chen, Y. H., & Simons, D. B. (1979). Cluster analysis based on density estimates and its application to landsat imagery (Doctoral dissertation, Colorado State University. Libraries).
[67]. Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065-1076.
[68]. Pegler-Gordon, A. (2006). Seeing images in history. Perspectives on History, 44, 28-31.
[69]. Raudys, Š. (1998). Evolution and generalization of a single neurone: I. Single-layer perceptron as seven statistical classifiers. Neural Networks, 11(2), 283-296. https://doi.org/10.1016/S0893-6080(97)00135-4
[70]. Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster RCNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137-1149. https://doi.org/10.1109/TPAMI.2016.2577031
[71]. Riedmiller, M., & Braun, H. (1992). Rprop-a fast adaptive learning algorithm. In Proc. of ISCIS VII), Universitat.
[72]. Rosenblatt, F. (1957). The Perceptron - A Perceiving and Recognizing Automaton. Cornell Aeronautical Laboratory, New York.
[73]. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386. https://psycnet.apa.org/doi/10.1037/h0042519
[74]. Rosenblatt, F. (1960). Perceptron simulation experiments. Proceedings of the IRE, 48(3), 301-309. https://doi.org/10.1109/JRPROC.1960.287598
[75]. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3), 832-837.
[76]. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536. https://doi.org/10.1038/323533a0
[77]. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210-229. https://doi.org/10.1147/rd.33.0210
[78]. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210-229. https://doi.org/10.1147/rd.33.0210
[79]. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal of Research and Development, 11(6), 601-617. https://doi.org/10.1147/RD.116.0601
[80]. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093
[81]. Schwiening, C. J. (2012). A brief historical perspective: Hodgkin and Huxley. The Journal of Physiology, 590(11), 2571-2575. https://doi.org/10.1113/jphysiol.2012.230458
[82]. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556
[83]. Stanfill, C., & Waltz, D. (1986). Toward memorybased reasoning. Communications of the ACM, 29(12), 1213-1228. https://doi.org/10.1145/7902.7906
[84]. Steinhaus, H. (1956). Sur la division des corps matériels en parties. Bulletin L'Académie Polonaise des Science, 4(12), 801-804.
[85]. Stigler, S. M. (1981). Gauss and the invention of least squares. The Annals of Statistics, 9(3), 465-474.
[86]. Stratonovich, R. L. (1965). 36-Conditional markov processes. Non-Linear Transformations of Stochastic Processes, 427-453. https://doi.org/10.1016/B978-1-4832-3230-0.50041-9
[87]. Sugano, Y., & Bulling, A. (2016). Seeing with humans: Gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203. https://doi.org/10.48550/arXiv.1608.05203
[88]. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, Cambridge. 1-526.
[89]. Svozil, D., Kvasnicka, V., & Pospichal, J. (1997). Introduction to multi-layer feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems, 39(1), 43-62. https://doi.org/10.1016/S0169-7439(97)00061-0
[90]. Tesauro, G. (2007). Reinforcement learning in autonomic computing: A manifesto and case studies. IEEE Internet Computing, 11(1), 22-30. https://doi.org/10.1109/MIC.2007.21
[91]. Turing, A. M. (1980). Computing Machinery and Intelligence. Creative Computing, 6(1), 44-53.
[92]. Turing, A. M. (1980). Computing machinery and intelligence. Creative Computing, 6(1), 44-53.
[93]. United States Patent and Trademark Office. (n.d.). Retrieved from https://www.uspto.gov/
[94]. Venugopalan, S., Anne Hendricks, L., Rohrbach, M., Mooney, R., Darrell, T., & Saenko, K. (2017). Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5753-5761).
[95]. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016, October). Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988-997). https://doi.org/10.1145/2964284.2964299
[96]. Wang, P. S., Liu, Y., Guo, Y. X., Sun, C. Y., & Tong, X. (2017). O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4), 1-11. https://doi.org/10.1145/3072959.3073608
[97]. Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7794-7803).
[98]. Weaver, W. (1955). Translation. Machine Translation of Languages, 14(15-23), 10. 67.
[99]. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270-280. https://doi.org/10.1162/neco.1989.1.2.270
[100]. Woods, W. A. (1969). Augmented Transition Networks for Natural Language Analysis. Harvard University, Cambridge.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.