An automatic identification of function words in TDIL tagged Bengali corpus

Subrata Pan, Diganta Saha

Open Access Article Go Back

An automatic identification of function words in TDIL tagged Bengali corpus

Subrata Pan¹ , Diganta Saha²

Section:Research Paper, Product Type: Journal Paper
Volume-07 , Issue-01 , Page no. 20-27, Jan-2019

Online published on Jan 20, 2019

Copyright © Subrata Pan, Diganta Saha . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: Subrata Pan, Diganta Saha, “An automatic identification of function words in TDIL tagged Bengali corpus,” International Journal of Computer Sciences and Engineering, Vol.07, Issue.01, pp.20-27, 2019.

MLA Style Citation: Subrata Pan, Diganta Saha "An automatic identification of function words in TDIL tagged Bengali corpus." International Journal of Computer Sciences and Engineering 07.01 (2019): 20-27.

APA Style Citation: Subrata Pan, Diganta Saha, (2019). An automatic identification of function words in TDIL tagged Bengali corpus. International Journal of Computer Sciences and Engineering, 07(01), 20-27.

BibTex Style Citation:
@article{Pan_2019,
author = {Subrata Pan, Diganta Saha},
title = {An automatic identification of function words in TDIL tagged Bengali corpus},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {1 2019},
volume = {07},
Issue = {01},
month = {1},
year = {2019},
issn = {2347-2693},
pages = {20-27},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=586},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=586
TI - An automatic identification of function words in TDIL tagged Bengali corpus
T2 - International Journal of Computer Sciences and Engineering
AU - Subrata Pan, Diganta Saha
PY - 2019
DA - 2019/01/20
PB - IJCSE, Indore, INDIA
SP - 20-27
IS - 01
VL - 07
SN - 2347-2693
ER -

Abstract

Function words are quite high in textual information as compared to content words; where dimensionality is a critical challenge. Performance of text processing task deteriorates due to the presence of the function words in textual context. So, elimination of these words is an important activity in text processing to reduce the computational complexity and improve accuracy in the system. Many researches are performed for standard function words identification for English, Arabic, Chinese, Punjabi, Hindi, etc. In Bengali language processing, a limited number of standard function words are available. To address this limitation, we propose a computer based automatic system for identification of high scored function words from TDIL tagged Bengali corpus, Govt. of India. Total corpus consists of total 670,831 words and 134,884 distinct words. Our proposed system identifies 8 set of function words i.e. total 33,985 function words are identified in Literature domain of monolingual tagged corpus. At the end of our experiment, we achieved 290 standard function words as per their computed rank.

Key-Words / Index Term

Bengali Text Processing, Function Words, Bag of words, NLP

References

[1] F. Louise, F. Matt, “Text Mining Handbook”, Casualty Actuarial Society E-Forum, CRC Press, pp. 1, 2010.
[2] Ministry of Electronics & Information Technology, Govt. of India, “Technology Development for Indian Languages Programme (TDIL)”, Retrieved from http://www.tdil.meity.gov.in
[3] H. Saif, M. Fernández, Y. He, H. Alani, “On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter”, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Iceland, pp. 810-817, 2014.
[4] R.T.W. Lo, B. He, I. Ounis, “Automatically Building a Stopword list for an Information Retrieval System”, Journal on Digital Information Management, Vol. 3, pp. 3-8, 2005.
[5] W.J. Wilbur, K. Sirotkin, “The Automatic Identification of Stop words”, Journal of information science, Sage Publications Sage CA: Thousand Oaks, CA, Vol. 18, pp. 45–55, Issue.1, 1992.
[6] M. Makrehchi, M.S. Kamel, “Automatic Extraction of Domain-Specific Stopwords from Labelled Documents”, Proceedings of Advances in Information Retrieval, 30th European Conference on {IR} Research, {ECIR}, Glasgow, UK, pp. 222-233, 2008.
[7] Asubiaro, T. Victor, “Entropy-based Generic Stopwords list for Yoruba texts”, International Journal of Computer and Information Technology, Vol. 2, Issue. 5, 2013.
[8] M. Sadeghi, J. Vegas, “Automatic Identification of Light Stop words for Persian information retrieval systems”, Journal of Information Science, Sage Publications Sage, UK, London, England, Vol. 40, pp. 476–487, Issue. 4, 2014.
[9] F. Zou, F.L. Wang, X. Deng, S. Han, L.S. Wang, “Automatic Construction of Chinese Stop Word List”, Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, pp. 1010–1015, 2006.
[10] H. Lili, H. Lizhu, “Automatic Identification of Stop words in Chinese Text Classification”, IEEE International Conference on Computer Science and Software Engineering, Vol. 1, pp. 718–722, 2008.
[11] S. Hassan, M. Fernandez, H. Alani, “Automatic Stopword Generation using Contextual Semantics for Sentiment Analysis of Twitter”, Proceedings of the ISWC-2014 Posters and Demonstrations Track a track within the 13th International Semantic Web Conference (ISWC), Riva del Garda, Italy, pp. 281-284, 2014.
[12] Y.Z. Fard, M. Ali, M. Bidgoli, Behrouz, Rahmani, Saeed , Shahrivari, “PSWG: An Automatic Stop-word List Generator for Persian Information Retrieval Systems based on Similarity Function & POS Information”, IEEE 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 111–117, 2015.
[13] R. Puri, R.P.S. Bedi, V. Goyal, “Automated Stopwords Identification in Punjabi Documents”, International Journal of Engineering Sciences, Vol. 8, pp. 119–125, 2013.
[14] T. Cover, J.A. Thomas, “Elements of information Theory”, John Wiley & Sons., 2012.
[15] Lin, Jianhua, “Divergence measures based on the Shannon entropy”, IEEE Transactions on Information theory, Vol. 37, pp. 145-151, Issue. 1, 1991.
[16] N. Das, “Indian Scenario in Language Corpus Generation”, Rainbow of linguistics, T. Media Publications, Kolkata, Vol. 1, pp. 129-162, 2007.
[17] G. Salton, A. Wong, C.S. Yang, “A Vector Space Model for Automatic Indexing”, Communications of the ACM, Vol. 18, pp. 613–620, Issue.11, 1975.
[18] Z.S. Harris, “Distributional Structure”, Word: Taylor and Francis. Vol. 10, pp. 146–162, Issue. 2, 1954.
[19] S. Roy, “Bengali Document Ranking”, Github Inc., 2017.
[20] M. Bilenko, R.J. Mooney, “Adaptive Duplicate Detection using Learnable String Similarity Measures”, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 39–48, 2003.
[21] T. Nayak, “Bengali Stemmer”, Github Inc., 2015.

Citations	2325
h-index	16
i10-index	47