ARPIT: Ambiguity Resolver for POS Tagging of Telugu, an Indian Language

Suneetha Eluri*, Sumalatha Lingamgunta**
* Research Scholar and Assistant Professor, Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Kakinada, Andhra Pradesh, India.
** Professor, Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Kakinada, Andhra Pradesh, India.
Periodicity:March - May'2019
DOI : https://doi.org/10.26634/jcom.7.1.15372

Abstract

Parts of Speech tagging (POS) is an essential preliminary task of Natural Languages Processing (NLP). Its aim is to assign parts of speech tag to each word in corpus. The basic POS tags are noun, pronoun, verb, adjective and adverb, etc. POS tags are needed for speech analysis and recognition, Machine translation, Lexical analysis like word sense disambiguation, named entity recognitions, Information retrieval and this system also helped to uncover the sentiments of given text in opinion mining. At the same time, many Indian languages lack POS taggers because the research towards building basic resources like corpora and morphological analyzers is still in its infancy. Henceforth in this paper, a POS tagger for Telugu language, a South Indian language is proposed. In this model, the lexemes are tagged with various POS tags by using pre-tagged corpus, however a word may be tagged with multiple tags. This ambiguity in tag assignment is resolved with Stochastic Machine Learning Technique, i.e. Hidden Markov Model (HMM) Bigram tagger, which uses probabilistic information built based on contextual information or word tag sequences to resolve the ambiguity. In this system, the authors have developed a pre-tagged corpus of size 11000 words with standard communal tag sets for Telugu language and the same is used for testing and training the model. This model tested with input text data consists of different number of POS tags at word level and achieved the average performance accuracy of 91.27% in resolving the ambiguity.

Keywords

NLP, Pre-Tagged Corpus, POS Tagging, Hidden Markov Model and Bigram Tagger, Lexicography, Morphological Analyzer.

How to Cite this Article?

Eluri, S., Lingamgunta, S.(2019). ARPIT: Ambiguity Resolver for POS Tagging of Telugu, an Indian Language, i-manager's Journal on Computer Science, 7(1), 25-35. https://doi.org/10.26634/jcom.7.1.15372

References

[1]. Agarwal, H., & Mani, A. (2006). Part of speech tagging and chunking with conditional random fields. In The Proceedings of NWAI Workshop.
[2]. Alex, M., & Zakaria, L. Q. (2014). Brill's rule-based part of speech tagger for kadazan. International Journal on Recent Trends in Engineering & Technology, 10(1), 75-82.
[3]. Antony, P. J., Mohan, S. P., & Soman, K. P. (2010, March). SVM based part of speech tagger for Malayalam. In 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (pp. 339-341). IEEE.
[4]. Baker, P., Hardie, A., McEnery, T., Xiao, R., Bontcheva, K., Cunningham, H., ... & Ursu, C. (2004). Corpus linguistics and South Asian languages: Corpus creation and tool development. Literar y and Linguistic Computing, 19(4), 509-524.
[5]. Brants, T. (2000, April). TnT: A statistical part-of-speech tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing (pp. 224-231). Association for Computational Linguistics.
[6]. Brill, E. (1992, March). A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing (pp. 152-155). Association for Computational Linguistics.
[7]. Dandapat, S., & Sarkar, S. (2006). Part of speech tagging for bengali with hidden markov model. Proceeding of the NLPAI Machine Learning Competition.
[8]. Dash, N. S., Bhattacharyya, P., & Pawar, J. D. (2016). Wordnets of Indian Languages. Springer.
[9]. Dermatas, E., & Kokkinakis, G. (1995). Automatic stochastic tagging of natural language texts. Computational Linguistics, 21(2), 137-163.
[10]. Ekbal, A., Mondal, S., & Bandyopadhyay, S. (2007). POS Tagging using HMM and Rule-based Chunking. The Proceedings of SPSAL, 8(1), 25-28.
[11]. Hasan, M. F., UzZaman, N., & Khan, M. (2007). Comparison of Unigram, Bigram, HMM and Brill's POS tagging approaches for some South Asian languages.
[12]. Joshi, N., Darbari, H., & Mathur, I. (2013). HMM based POS tagger for Hindi. In Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013).
[13]. Kumar, D., & Josan, G. S. (2010). Part of speech taggers for morphologically rich indian languages: A survey. International Journal of Computer Applications, 6(5), 32-41.
[14]. Kumavath, D., & Jain, V. (May 2015). POS tagging approaches: A comparison. International Journal of Computer Applications, 118(6), 0975-8887.
[15]. Patel, C., & Gali, K. (2008). Part-of-speech tagging for Gujarati using conditional random fields. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages.
[16]. Patil, V. F. (2010). Designing POS Tagset for Kannada. Linguistic Data Consortium for Indian Languages (LDC-IL). Retrieved from https://pdfs.semanticscholar.org/ 6659/6418dbf7113c445f48460e5b9890cff909ec.pdf
[17]. Pattabhi, R. K., Rao, T., Ram, R. V. S., Vijayakrishna, R., & Sobha, L. (2007). A text chunker and hybrid POS tagger for Indian languages. In Proceedings of International Joint Conference on Artificial Intelligence Workshop on Shallow Parsing for South Asian Languages.
[18]. Reddy, S., & Sharoff, S. (2011). Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. In Proceedings of the Fifth International Workshop on Cross Lingual Information Access (pp. 11-19).
[19]. Sarkar, K., & Gayen, V. (2013). A trigram HMM-based POS tagger for Indian languages. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) (pp. 205-212). Springer, Berlin, Heidelberg.
[20]. Selvam, M., & Natarajan, A. M. (2009). Improvement of rule based morphological analysis and POS tagging in tamil language via projection and induction techniques. International Journal of Computers, 3(4), 357-367.
[21]. Shambhavi, B. R., Ramakanth, K. P., & Revanth, G. (2012). A maximum entropy approach to Kannada part of speech tagging. International Journal of Computer Applications, 41(13), 9-12.
[22]. Sharma, S. K., & Lehal, G. S. (2011, June). Using Hidden Markov Model to improve the accuracy of punjabi POS tagger. In 2011 IEEE International Conference on Computer Science and Automation Engineering (Vol. 2, pp. 697-701). IEEE.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Online 15 15

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.