Open Access   Article Go Back

Algorithm for Removal of Semantically Insignificant Content Words

Abhijit Barman1 , Diganta Saha2

Section:Research Paper, Product Type: Journal Paper
Volume-07 , Issue-01 , Page no. 53-56, Jan-2019

Online published on Jan 20, 2019

Copyright © Abhijit Barman, Diganta Saha . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Abhijit Barman, Diganta Saha, “Algorithm for Removal of Semantically Insignificant Content Words,” International Journal of Computer Sciences and Engineering, Vol.07, Issue.01, pp.53-56, 2019.

MLA Style Citation: Abhijit Barman, Diganta Saha "Algorithm for Removal of Semantically Insignificant Content Words." International Journal of Computer Sciences and Engineering 07.01 (2019): 53-56.

APA Style Citation: Abhijit Barman, Diganta Saha, (2019). Algorithm for Removal of Semantically Insignificant Content Words. International Journal of Computer Sciences and Engineering, 07(01), 53-56.

BibTex Style Citation:
@article{Barman_2019,
author = {Abhijit Barman, Diganta Saha},
title = {Algorithm for Removal of Semantically Insignificant Content Words},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {1 2019},
volume = {07},
Issue = {01},
month = {1},
year = {2019},
issn = {2347-2693},
pages = {53-56},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=592},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=592
TI - Algorithm for Removal of Semantically Insignificant Content Words
T2 - International Journal of Computer Sciences and Engineering
AU - Abhijit Barman, Diganta Saha
PY - 2019
DA - 2019/01/20
PB - IJCSE, Indore, INDIA
SP - 53-56
IS - 01
VL - 07
SN - 2347-2693
ER -

           

Abstract

This paper describes how the context specific semantically insignificant content words are extracted using Inverse Document Frequency (IDF) and Inverse Class Frequency (ICF) measure. We are able to remove around 42% of total corpus volume as irrelevant information which includes textual noise, function words and context specific semantically insignificant content words. We have executed different Machine Learning(ML) algorithms used for text classification on a corpus, before and after the removal of the textual noise. We found that there have been no significant change in accuracy of those ML algorithms before and after removal of the textual noise.

Key-Words / Index Term

Machine Learning(ML), Natural Language Processing(NLP), Information Retrieval (IR), Term Document Matrix, Inverse Document Frequency (IDF) and Inverse Class Frequency (ICF), Stop Words, Content Words

References

[1] Dharmendra Sharma, Suresh Jain, “Evaluation of Stemming and Stop Word Techniques on Text Classification Problem”, International Journal of Scientific Research in Computer Science and Engineering, Vol-3(2), PP (1-4) Apr 2015, E-ISSN: 2320-7639.
[2] Ljiljana Dolamic and Jacques Savoy, “When Stopword Lists Makethe Difference,”, Journal of the American Society for Information Science and Technology no. 1, pp. 200–203, 2009.
[3] M. P. Sinka and D. W. Corne, “Evolving Better Stoplists for Document Clustering and Web Intelligence,” Des. Appl. hybrid Intell. Syst., pp. 1015–1023, 2003.
[4] R. Al-Shalabi, G. Kanaan, J. M. Jaam, A. Hasnah and E. Hilat, "Stop-word removal algorithm for Arabic language," Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004., Damascus, Syria, 2004, pp. 545
[5] B. Alhadidi and M. Alwedyan, “Hybrid Stop-Word Removal Technique for Arabic Language.,” Egypt Comput Sci, vol. 30(1), no. 1, pp. 35–38, 2008
[6] R. Puri, R. P. S. Bedi, and V. Goyal, “Automated Stopwords Identification in Punjabi Documents,” An Int. J. Eng. Sci., vol. 8, no. June 2013, pp. 119–125, 2013.
[7] Ashish T, Kothari M and Pinkesh P, “Pre-Processing Phase of Text Summarization Based on Gujarati Language”, International Journal of Innovative Research in Computer Science & Technology (IJIRCST) Vol-2,Iss-4, July 2014
[8] Jaideepsinh K. Raulji, Jatinderkumar R. Saini, “Stop-Word Removal Algorithm and its Implementation for Sanskrit Language”, International Journal of Computer Applications (0975 – 8887), Volume 150 – No.2, September 2016
[9] V. Jha, N. Manjunath, P. D. Shenoy and K. R. Venugopal, "HSRA: Hindi stopword removal algorithm," 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, 2016, pp. 1-5
[10] S. Siddiqi and A. Sharan, “Construction of a generic stopwords list for Hindi language without corpus statistics,” Int. J. Adv. Comput. Res., vol. 8, no. 34, pp. 35–40, 2017.
[11] Rakholia R. M. and Saini J. R., “A Rule-based Approach to Identify Stop Words for Gujarati Language”, accepted for publication in Advances in Intelligent and Soft Computing (AISC) Series, ISSN: 1615-3871, 2194-5357, 1860-0794 by Springer-Verlag, Germany. 2017.
[12] Ankita Dhar, Niladri Sekhar Dash, Kaushik Roy, “Categorization of Bangla Web Text DocumentsBased on TF-IDF-ICF Text Analysis Scheme”, Springer Nature Singapore Pte Ltd. 2018,J. K. Mandal and D. Sinha (Eds.): CSI 2017, CCIS 836, pp. 477–484, 2018.