Kannada Named Entity Recognition and Classification using Support Vector Machine

  • Amarappa S Jawaharlal Nehru National College of Engineering
  • S V Sathyanarayana Department of E & C, Jawaharlal Nehru National College of Engineering, Shimoga - 577 204, India;
Keywords: Natural Language Processing, Hyperplane, Support vectors, Named Entity Recognition, Classification, Support vector machine, Training Corpus, Test Corpus.

Abstract

Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location, organization, date, time etc. Kannada NERC is an essential and challenging work which aims at developing a novel model based on Support Vector Machine. In this paper, tf-idf and POS features are used, which are extracted from a training corpus created manually. Furthermore, the model is trained and tested with different kernels: polynomial, rbf, sigmoid and linear kernels. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 1, 51,440 tokens and test corpus of 7,000, 11,000, 15,000, 20,000, 30,000, 40,000 and 50,000 tokens. It is observed that the model works with an average precision, recall and F1-measure of 87%, 88% and 87.5% respectively for a linear kernel SVM on the test corpus of 7,000 tokens.

Author Biography

Amarappa S, Jawaharlal Nehru National College of Engineering

Department of Telecommunication Engineering

Associate Professor and HOD

References

(1) Elizabeth D Liddy, 2001. Natural Language Processing. In Encyclopedia of Library and Information Science.2nd edition.

(2) James Allen, 2007. Natural Language Understanding. Pearson Publication Inc., 2nd edition.

(3) Kavi Narayana Murthy, 2006. Natural Language Processing.

Ess Ess Publications for Sarada Ranganathan Endowment for Library Science, Bangalore, India, 1st edition.

(4) Gobinda G. Chowdhury, 2003. Natural Language Processing, Annual Review of Information Science and Technology. 37(1):51-89.

(5) Kashif Riaz, 2010. Rule-based Named Entity Recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop, pages 126-135. Association for Computational Linguistics.

(6) Khaled Shaalan and Hafsa Raza, 2007. Person Name Entity Recognition for Arabic. In roceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pages 17-24. Association for Computational Linguistics.

(7) Yassine Benajiba, Mona T Diab and Paolo Rosso, 2009. Using Language Independent and Language Specific Features to Enhance Arabic Named Entity Recognition. Int. Arab J. Inf. Technol., 6(5):463-471.

(8) Padmaja Sharma, Utpal Sharma and Jugal Kalita, 2010. The First Steps towards Assamese Named Entity Recognition. Brisbane Convention Center, 1:1-11.

(9) Asif Ekbal and Sivaji Bandyopadhyay, 2009. A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi. Linguistic Issues in Language Technology, 2(1).

(10) Asif Ekbal and Sivaji Bandyopadhyay, 2009. Named Entity Recognition in Bengali. A Multi-engine Approach. Northern European Journal of Language Technology, 1(2):26-58.

(11) Asif Ekbal and Sivaji Bandyopadhyay, 2008. Bengali named entity recognition using support vector machine. In IJCNLP, pages 51-58.

(12) Maksim Tkachenko, Andrey Simanovsky and St Petersburg, 2012. Named entity recognition: Exploring features. In Proceedings of KONVENS, pages 118-127

(13) [Ashwini A Shende and Avinash J Agrawa, 2012. Domain specific named entity recognition. Proceedings of the International Conference on Advances in Computer, Electronics and Electrical Engineering, ISBN: 978-981-07-1847-31, doi:10.3850/978-981- 07-1847-3 P0999:484-487.

(14) David Nadeau, Peter Turney and Stan Matwin, 2006. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. Published at the 19th Canadian Conference on Artificial Intelligence.

(15) Sujan Kumar Saha, Partha Sarathi Ghosh, Sudeshna Sarkar and Pabitra Mitra, 2008 . Named entity recognition in Hindi using maximum entropy and transliteration. Research journal on Computer Science and Computer Engineering with Applications, pages 33-41.

(16) Deepti Chopra, Nusrat Jahan and Sudha Morwal, 2012. Hindi named entity recognition by aggregating rule based heuristics and hidden markov model. International Journal of Information, 2(6).

(17) Sujan Kumar Saha, Shashi Narayan, Sudeshna Sarkar and Pabitra Mitra, 2010. A composite kernel for named entity recognition. Pattern Recognition Letters, 31(12):1591-1597, doi:10.1016/ j.patrec.2010.05.004.

(18) Sudha Morwal and Nusrat Jahan, 2013. Named entity recognition using hidden markov model (hmm): An experimental result on Hindi, Urdu and Marathi languages. International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE), 3(4):671-675.

(19) Erik F Tjong Kim Sang and Fien De Meulder, 2003. Introduction to the conll-2003 shared task: Languageindependent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142-147. Association for Computational Linguistics.

(20) Asif Ekbal and Sivaji Bandyopadhyay, 2010. Named entity recognition using support vector machine: A language independent approach. International Journal of Electrical, Computer, and Systems Engineering, 4(2):155-170.

(21) Kishorjit Nongmeikapam, Tontang Shangkhunem, Ngariyanbam Mayekleima Chanu, Laishram Newton Singh, Bishworjit Salam and Sivaji Bandyopadhyay, 2011. CRF based name entity recognition (ner) in Manipuri: A highly agglutinative Indian language. In Emerging Trends and Applications in Computer Science (NCETACS), 2nd National Conference on, pages 1-6. IEEE.

(22) Thoudam Doren Singh, Kishorjit Nongmeikapam, Asif Ekbal and Sivaji Bandyopadhyay, 2009. Named entity recognition for Manipuri using support vector machine. In PACLIC, pages 811-818.

(23) S Pandian, Krishnan Aravind Pavithra and T Geetha, 2008. Hybrid three-stage named entity recognizer for Tamil. INFOS.

(24) R Vijayakrishna and Sobha Lalitha Devi, 2008. Domain focused named entity recognizer for Tamil using conditional random fields. In IJCNLP, pages 59-66.

(25) G.V.S.Raju B.Srinivasu, Dr.S.Viswanadha Raju and K.S.M.V.Kumar, 2010. Named entity recognition for Telugu using maximum entropy model. Journal of Theoretical and Applied Information Technology (JATIT), 13:125-130.

(26) Dr. A. Vinaya Babu, Dr. A. Govardhan, B. Sasidhar and P. M. Yohan, 2011. Named entity recognition in Telugu language using language dependent features and rule based approach. International Journal of Computer Applications (0975-888), 22(8):30-34.

(27) [S Amarappa, Dr. S V Sathyanarayana, 2013. “A Hybrid approach for Named Entity Recognition, Classification and Extraction (NERCE) in Kannada Documents”. Proceedings of the International Conference on Multimedia Processing, Communication and Information Technology (MPCIT-2013). Book Series: Advances in Engineering and Technology Series, IDES publications, DOI: 03.AETS.2013.4.91, ISBN: 2214 - 0344, Volume: 4, Page(s):173-179.

(28) S Amarappa and S V Sathyanarayana, 2013. Named entity recognition and classification in Kannada language. International Journal of Electronics and Computer Science Engineering, 2(1):281–289.

(29) S Amarappa, and S V Sathyanarayana, 2015. Kannada Named Entity Recognition and Classification (NERC) based on Multinomial Naïve Bayes (MNB) Classifier. International Journal on Natural Language Computing (IJNLC), DOI: 10.5121/ijnlc.2015.4404, Vol. 4, No.4, Pages 39-52.

(30) S Amarappa, and. S V Sathyanarayana, 2015. Kannada Named Entity Recognition and classification (NERC) based on Conditional Random fields (CRF). Second International Conference on Emerging Research on Electronics, Computer Science and Technology (ICERECT-2015), PES College of Engineering, Mandya, India. 978-1-4673-9563-2/15/$31.00 ©2015 IEEE.

(31) David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification, National Research Council, Canada / New York University.

Published
2017-03-11