Open Access   Article Go Back

Degraded Bangla Character Recognition by k- NN Classifier

Jayati Mukherjee1 , S. K. Parui2 , Utpal Roy3

Section:Research Paper, Product Type: Journal Paper
Volume-07 , Issue-01 , Page no. 42-47, Jan-2019

Online published on Jan 20, 2019

Copyright © Jayati Mukherjee, S. K. Parui, Utpal Roy . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Jayati Mukherjee, S. K. Parui, Utpal Roy, “Degraded Bangla Character Recognition by k- NN Classifier,” International Journal of Computer Sciences and Engineering, Vol.07, Issue.01, pp.42-47, 2019.

MLA Style Citation: Jayati Mukherjee, S. K. Parui, Utpal Roy "Degraded Bangla Character Recognition by k- NN Classifier." International Journal of Computer Sciences and Engineering 07.01 (2019): 42-47.

APA Style Citation: Jayati Mukherjee, S. K. Parui, Utpal Roy, (2019). Degraded Bangla Character Recognition by k- NN Classifier. International Journal of Computer Sciences and Engineering, 07(01), 42-47.

BibTex Style Citation:
@article{Mukherjee_2019,
author = {Jayati Mukherjee, S. K. Parui, Utpal Roy},
title = {Degraded Bangla Character Recognition by k- NN Classifier},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {1 2019},
volume = {07},
Issue = {01},
month = {1},
year = {2019},
issn = {2347-2693},
pages = {42-47},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=590},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=590
TI - Degraded Bangla Character Recognition by k- NN Classifier
T2 - International Journal of Computer Sciences and Engineering
AU - Jayati Mukherjee, S. K. Parui, Utpal Roy
PY - 2019
DA - 2019/01/20
PB - IJCSE, Indore, INDIA
SP - 42-47
IS - 01
VL - 07
SN - 2347-2693
ER -

           

Abstract

Digitization of Bangla degraded document by Optical Character Recognition is a research activities now a days. Some historical documents particularly of 60s and 70s are degrading day by day due to lack of preservation. Those need to be retrieved. In this article, we present our recent study on recognition of degraded printed document images of Bangla, the 7th most popular language in the world. In the proposed approach the input will be low quality degraded images and the output is the recognized characters. In the first step some preprocessing are done on the document image to improve the quality of the scanned image. The proposed approach is an analytic approach. The segmentation is carried out line by line, word by word and finally character by character. The database used is the ISIDDI database. The total number of historical pages in TIF and JPG formats are 535, containing different fonts, sizes, formats and most importantly different levels of degradations. After segmentation we have manually identified 320 classes of such segmented symbols and divided the whole character dataset into test set (30%) and training set (70%). From the training set of 320 classes we have computed the Histogram of gradient feature or HOG feature on the samples. By applying the K-means clustering algorithm clusters for 320 classes has been generated and labeled according to the classes. For a character of test set again the HOG is computed and by applying k-nearest neighbour algorithm with the 320 classes the character is assigned to a character class with the minimum distance. The classification accuracy obtained on the test set is encouraging. We have achieved 82. 80% character or symbol level accuracy on 320 classes from the confusion matrix.

Key-Words / Index Term

Degraded document recognition, Bangla document analysis, K-Means, k-nearest neighbour

References

[1] BB Chaudhuri, U Pal and Mandar Mitra, “Automatic recognition of printed Oriya script”, Sadhana, Vol. 27, Pp. 23–34, 2002.
[2] R Seethalakshmi, TR Sreeranjani, T Balachandar, Abnikant Singh, Markandey Singh, Ritwaj Ratan and Sarvesh Kumar, “Optical character recognition for printed Tamil text using Unicode”, Journal of Zhejiang University-SCIENCE A, Vol.6,Pp. 1297–1305, 2005.
[3] BB Chaudhuri and U Pal, “A complete printed Bangla OCR system”, Pattern Recognition, Vol. 31, Pp. 531–549, 1998.
[4] Ujjwal Bhattacharya, Malayappan Shridhar and Swapan K Parui, “On recognition of handwritten Bangla characters”, Computer Vision, Graphics and Image Processing, Springer publisher, Pp. 817–828, 2006.
[5] Apurva A Desai, “Gujarati handwritten numeral optical character reorganization through neural network”, Pattern Recognition, Vol. 43 Pp. 2582–2589, 2010.
[6] Binu P Chacko, VR Vimal Krishnan, G Raju and P Babu Anto, “Handwritten character recognition using wavelet energy and extreme learning machine”, International Journal of Machine Learning and Cybernetics, Vol. 3,Pp. 149–161, 2012.
[7] C Vasantha Lakshmi and C Patvardhan, “An optical character recognition system for printed Telugu text”, Pattern analysis and applications, Vol. 7, Pp. 190–204, 2014.
[8] Kapil Dev Dhingra, Sudip Sanyal, and Pramod Kumar Sharma, “A robust ocr for degraded documents”, In Advances in Communication Systems and Electrical Engineering, Springer publisher, Pp. 497–509 , 2008.
[9] Laurence Likforman-Sulem, Abderrazak Zahour, and Bruno Taconet. “Text line segmentation of historical documents: a survey”, International journal on document analysis and recognition,Vol. 9(2), Pp. 123–138, 2007.
[10] Tapan Kumar Bhowmik, Swapan Kumar Parui, Utpal Roy, and Lambert Schomaker, “Bangla handwritten character segmentation using structural features: A supervised and bootstrapping approach”, ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 15(4), Pages. 29, 2016.
[11] Chandan Biswas, Partha Sarathi Mukherjee, Koyel Ghosh, Ujjwal Bhattacharya, and Swapan K. Parui, “A hybrid deep architecture for robust recognition of text lines of degraded printed documents”, In 24th International Conference on Pattern Recognition, IEEE, 2018.
[12] Jaakko Sauvola and Matti Pietikäinen, “Adaptive document image binarization”, Pattern Pecognition, Vol. 33(2), Pp. 225–236, 2000.
[13] Chandan Singh, Nitin Bhatia, and Amandeep Kaur, “Hough transform based fast skew detection and accurate skew correction methods”, Pattern Recognition, Vol. 41(12), Pp. 3528– 3546, 2008.
[14] Ying Jie Liu and Fu Cheng You, “Application of mathematical morphology on touching or broken characters processing”, In Advanced Materials Research, Vol. 171, Pp. 73–77, 2011.
[15] BB Chaudhuri and U Pal, “A complete printed bangla ocr system”, Pattern Recognition, Vol 31(5), Pp. 531–549, 1998.
[16] Mohamed Becha Kaaniche, Francois Bremond, “Tracking HoG Descriptors for Gesture Recognition”, Advanced Video and Signal Based Surveillance, 2009 AVSS`09, Sixth IEEE International Conference on, Pp. 140–145, 2009, IEEE.
[17] John A Hartigan and Manchek A Wong, “Algorithm as 136: A k-means clustering algorithm”, Journal of the Royal Statistical Society. Series C (Applied Statistics),Vol. 28(1), Pp. 100–108, 1979.
[18] Keinosuke Fukunaga and Patrenahalli M. Narendra, “A branch and bound algorithm for computing k-nearest neighbors”. IEEE transactions on computers, Vol. 100(7), Pp. 750–753, 1975.