Open Access   Article Go Back

Classification of Text and Images from PDF Using Graph Based Technique

D. Selvanayagi1

Section:Research Paper, Product Type: Journal Paper
Volume-7 , Issue-3 , Page no. 1141-1146, Mar-2019

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v7i3.11411146

Online published on Mar 31, 2019

Copyright © D. Selvanayagi . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: D. Selvanayagi, “Classification of Text and Images from PDF Using Graph Based Technique,” International Journal of Computer Sciences and Engineering, Vol.7, Issue.3, pp.1141-1146, 2019.

MLA Style Citation: D. Selvanayagi "Classification of Text and Images from PDF Using Graph Based Technique." International Journal of Computer Sciences and Engineering 7.3 (2019): 1141-1146.

APA Style Citation: D. Selvanayagi, (2019). Classification of Text and Images from PDF Using Graph Based Technique. International Journal of Computer Sciences and Engineering, 7(3), 1141-1146.

BibTex Style Citation:
@article{Selvanayagi_2019,
author = {D. Selvanayagi},
title = {Classification of Text and Images from PDF Using Graph Based Technique},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {3 2019},
volume = {7},
Issue = {3},
month = {3},
year = {2019},
issn = {2347-2693},
pages = {1141-1146},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=3980},
doi = {https://doi.org/10.26438/ijcse/v7i3.11411146}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v7i3.11411146}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=3980
TI - Classification of Text and Images from PDF Using Graph Based Technique
T2 - International Journal of Computer Sciences and Engineering
AU - D. Selvanayagi
PY - 2019
DA - 2019/03/31
PB - IJCSE, Indore, INDIA
SP - 1141-1146
IS - 3
VL - 7
SN - 2347-2693
ER -

VIEWS PDF XML
337 213 downloads 144 downloads
  
  
           

Abstract

Today’s e-book plays an important role in all fields to learn new things through personal computer, laptop or mobile phones. There are various formats available for an e-book. The extensively used format is PDF because it retains the original format of the document. Segmentation is for reusing the content but in existing system the documents are segmented as the text content only. It doesn’t consider the non-text elements such as graphs, tables, and images. In this research layout analysis is performed by extracting both text objects and non-text objects from the PDF document and segmenting the objects separately using Support Vector Machine (SVM) classifiers. Finally we get the output as text objects and non-text objects separately. This method utilizes both bottom up approach for text line extraction and top down approach to divide graph tree created by Kruskal’s algorithm into sub graph which use Euclidean distance between adjacent vertices. Both text and non-text objects are classified using SVM technique. For each segmented text and non-text different dimensional features are extracted for labeling purpose. Several E-book PDF documents are tested and some sample input and output PDF documents are shown in the experimental result.

Key-Words / Index Term

E-book, PDF, Kruskal’s algorithm, Euclidean distance, SVM

References

[1] Gupta, N., &Banga, V. K. (2012, April). Image segmentation for text extraction. In 2nd International Conference on Electrical, Electronics and Civil Engineering (ICEECE’2012) (pp. 182-185).
[2] Pasha, S., & Padma, M. C. (2015, December). Handwritten Kannada character recognition using wavelet transform and structural features. InEmerging Research in Electronics, Computer Science and Technology (ICERECT), 2015 International Conference on (pp. 346-351). IEEE.
[3] Adak, C. (2013, August). Unsupervised text extraction from G-maps. InHuman Computer Interactions (ICHCI), 2013 International Conference on (pp. 1-4). IEEE.
[4] Liu, J., Fan, X. Z., & Chen, K. (2007, October). Research on method of extracting Chinese domain terms based on rough and fuzzy clustering. InSemantics, Knowledge and Grid, Third International Conference on (pp. 366-369). IEEE.
[5] Chaple, G. N., Daruwala, R. D., & Gofane, M. S. (2015, February). Comparisons of Robert, Prewitt, Sobel operator based edge detection methods for real time uses on FPGA. In Technologies for Sustainable Development (ICTSD), 2015 International Conference on (pp. 1-4). IEEE.
[6] Gautam, A. (2013). Segmentation of Text From Image Document. International Journal of Computer Science and Information Technologies,4(3), 538-540.
[7] Tounsi,M., Mo Moalla, I., Alimi, A. M., & Lebouregois, F. (2015, August). Arabic characters recognition in natural scenes using sparse coding for feature representations. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on (pp. 1036-1040). IEEE.
[8] O`Gorman, L. (1993). The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11), 1162-1173.
[9] Nathiya, N., & Pradeepa, K. (2013, December). Optical Character Recognition for scene text detection, mining and recognition. In Computational Intelligence and Computing Research (ICCIC), 2013 IEEE International Conference on (pp. 1-4). IEEE.
[10] Yuan, Q., & Tan, C. L. (2001). Text extraction from gray scale document images using edge information. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on (pp. 302-306). IEEE.
[11] Kumari, S., & Vijay, R. (2012). Effect of symlet filter order on denoising of still images. Advanced Computing, 3(1), 137.
[12] Lienhart, R., & Wernicke, A. (2002). Localizing and segmenting text in images and videos. IEEE Transactions on circuits and systems for video technology, 12(4), 256-268.
[13] Wu, L., Shivakumara, P., Lu, T., & Tan, C. L. (2015). A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video. IEEE Transactions on Multimedia, 17(8), 1137-1152.
[14] Ranjini, S., &Sundaresan, M. (2013). Extraction and Recognition of Text From Digital English Comic Image Using Median Filter. International Journal on Computer Science and Engineering, 5(4), 238.
[15] Mehta, A., Parihar, A. S., & Mehta, N. (2015, September). Supervised classification of dermoscopic images using optimized fuzzy clustering based Multi-Layer Feed-forward Neural Network. In Computer, Communication and Control (IC4), 2015 International Conference on (pp. 1-6). IEEE.
[16] Tehsin, S., Masood, A., &Kausar, S. (2014). Survey of Region-Based Text Extraction Techniques for Efficient Indexing of Image/Video Retrieval. International Journal of Image, Graphics and Signal Processing, 6(12), 53.
[17] Green, R., & Oliver, C. (2013, November). Layout analysis of book pages. In2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013) (pp. 118-123). IEEE.
[18] Hoang, T. V., &Tabbone, S. (2010, June). Text extraction from graphical document images using sparse representation. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (pp. 143-150). ACM.
[19] Moniz, N., & Rodrigues, F. (2012). Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation. In KDIR (pp. 123-131).