Open Access   Article

Document Categorization for Probabilistic Redundant Documents

S. Singh1 , K. Jain2

Section:Research Paper, Product Type: Journal Paper
Volume-7 , Issue-1 , Page no. 51-55, Jan-2019


Online published on Jan 31, 2019

Copyright © S. Singh, K. Jain . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library


IEEE Style Citation: S. Singh, K. Jain, “Document Categorization for Probabilistic Redundant Documents”, International Journal of Computer Sciences and Engineering, Vol.7, Issue.1, pp.51-55, 2019.

MLA Style Citation: S. Singh, K. Jain "Document Categorization for Probabilistic Redundant Documents." International Journal of Computer Sciences and Engineering 7.1 (2019): 51-55.

APA Style Citation: S. Singh, K. Jain, (2019). Document Categorization for Probabilistic Redundant Documents. International Journal of Computer Sciences and Engineering, 7(1), 51-55.

144 137 downloads 13 downloads


Text categorization is an active research area in information retrieval and machine learning. The major issue regarding preprocessing the document for this categorization is redundancy. The redundant documents slow down the learning steps of classification and also affect its efficiency and scalability. To resolve this issue it is preferred, first identify the duplicates and then perform the classification. This paper proposes to apply the Similarity Measure for duplicate detection and Random forest for classification. The results are evaluated using ‘20 newsgroups’ data sets with generated duplicate documents. Accuracy and time parameters show better results in the proposed method than that in the existing text categorization model.

Key-Words / Index Term

Duplicate-detection, text categorization, information retrieval, similarity measure


[1] D. Xue, F. Li, “Research of Text Categorization Model based on Random Forests,” IEEE International Conference on Computational Intelligence & Communication Technology, pp. 173-176, 2015.
[2] G. Gao, S. Guan, “Text Categorization Based on Improved Rocchio Algorithm,” International Conference on Systems and Informatics, pp. 2247-2250, 2012.
[3] Thamarai, S.S., Kartikeyan, P., Vincent, A., Abinaya, V., Neeraja, G. and Deepika, R. 2016.Text Categorization using Rocchio Algorithm and Random Forest Algorithm. In the IEEE 2016 Eighth International Conference on Advanced Computing (ICoAC) held at Chennai, India, pp. 7-12, 2017.
[4] J.Y. Jiang, S.C. Tsai, S.J. Lee, “FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors,” Expert Systems with Applications, Vol. 39, Issue. 3, pp. 2813-2821, 2012.
[5] M.L. Zhang, Z.H. Zhou, “A lazy learning approach to mullti-label learning,” National Laboratory for Novel Software Technology, Vol. 40, Issue. 7, pp. 2038-2048, 2007.
[6] S. Seshasai,” Efficient near duplicate document detection for specialized corpora“, Massachusetts Institute of Technology, 2008.
[7] W. Zong, F. Wu, L.K. Chu, D. Schulli, “A discriminative and semantic feature selection method for text Categorization,” School of Management, Xian Jiatoong University, China, IntJ.Production Economics, Vol.165, pp. 215-222, 2015.
[8] M. Bilenko, R.J. Mooney,” Adaptive Duplicate Detection Using Learnable String Similarity Measures”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, 2003.
[9] G.S. Manku, A.D. Sarma, A. Jain,” Detecting Near Duplicates for Web Crawling”, International World Wide Web Conference Committee (IW3C2), pp 141-149, 2007.
[10] E.P. Sim,” Classification & Detection of Near Duplicate Web Pages using Five Stage Algorithm”,Online International Conference on Green Engineering and Technologies (IC-GET), 2015.