A Clustering Framework for Large Document Datasets

K.K.  Mohbey, G.S. Thakur

Open Access Article Go Back

A Clustering Framework for Large Document Datasets

K.K. Mohbey¹ , G.S. Thakur²

Section:Research Paper, Product Type: Journal Paper
Volume-1 , Issue-1 , Page no. 26-30, Sep-2013

Online published on Sep 30, 2013

Copyright © K.K. Mohbey, G.S. Thakur . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: K.K. Mohbey, G.S. Thakur , “A Clustering Framework for Large Document Datasets,” International Journal of Computer Sciences and Engineering, Vol.1, Issue.1, pp.26-30, 2013.

MLA Style Citation: K.K. Mohbey, G.S. Thakur "A Clustering Framework for Large Document Datasets." International Journal of Computer Sciences and Engineering 1.1 (2013): 26-30.

APA Style Citation: K.K. Mohbey, G.S. Thakur , (2013). A Clustering Framework for Large Document Datasets. International Journal of Computer Sciences and Engineering, 1(1), 26-30.

BibTex Style Citation:
@article{Mohbey_2013,
author = {K.K. Mohbey, G.S. Thakur },
title = {A Clustering Framework for Large Document Datasets},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {9 2013},
volume = {1},
Issue = {1},
month = {9},
year = {2013},
issn = {2347-2693},
pages = {26-30},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=11},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=11
TI - A Clustering Framework for Large Document Datasets
T2 - International Journal of Computer Sciences and Engineering
AU - K.K. Mohbey, G.S. Thakur
PY - 2013
DA - 2013/09/30
PB - IJCSE, Indore, INDIA
SP - 26-30
IS - 1
VL - 1
SN - 2347-2693
ER -

VIEWS	PDF	XML
4854	4654 downloads	4572 downloads

Bar Line

Abstract

Document set is the collection of different types of document. Each document contains special type of information, which is beneficial for the peoples. We have the need of document clustering by their similarity. Document may contain data related to the blogs, website access pattern, any transaction or simply text. By the clustering of similar documents one can find the future trends of the people and it is also useful for the business point of view. In this paper, we have proposed a clustering approach for large size document sets. This proposed approach immediately assign document into appropriate cluster. Experiments are conducted with the twenty newsgroup dataset using java and MATLAB software. Comparisons are also performed with the existing methods. Experimental results show the effectiveness of the proposed approach for large document sets.

Key-Words / Index Term

Large Document Set, Similarity measurement, Term Extraction, Dendrogram

References

[1] Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks Vol. 16, No. 3, May 2005.
[2] Bidyut kr. Patra,Sukumar Nandi,P.Viswanath, A distance based clustering method for arbitrary shaped clusters in large datasets,Pattern Recognition 44(2011) 2862-2870.
[3] M. Anderberg, Cluster Analysis for Applications. New York: Academic,1973.
[4] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. NewYork: Wiley, 2001.
[5] Jin Chen, Alan M. MacEachren, and Donna J. Peuquet, ï¿½Constructing Overview + Detail Dendrogram-Matrix Views ï¿½, IEEE Transactions on Visualization and Computer Graphics, Vol .15, No.6 ,Nov 2009.
[6] B. Duran and P. Odell, Cluster Analysis: A Survey. New York:Springer-Verlag, 1974.
[7] B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London: Arnold, 2001.
[8] P. Hansen and B. Jaumard, ï¿½Cluster analysis and Math- ematical programming,ï¿½ Math. Program., vol. 79, pp. 191ï¿½215, 1997.
[9] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[10] E. Backer and A. Jain, ï¿½A clustering performance measure based on fuzzy set decomposition,ï¿½ IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-3, no. 1, pp. 66ï¿½75, Jan. 1981.
[11] C. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.
[12] V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory, and Methods. New York: Wiley, 1998.
[13] A. Baraldi and E. Alpaydin, ï¿½Constructive feedforward ART
[14] clustering networksï¿½Part I and II,ï¿½ IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 645ï¿½677, May 2002.
[15] M. Steinbach, G.Karypis, V.Kumar, A Comparison of document clustering techniques, Proc. Of the 6th ACM SIGKDD intï¿½l conf. on Knowledge Discovery and Data Mining(KDD), 2000.
[16] P.Willet, Recent trends in hierarchical document clustering: a critical review, Information processing & Management 24(5) (1988), pp 577-597.
[17] Ghanshyam Thakur, Rekha Thakur and R.C. Jain, ï¿½Association Rule Generation from Textual Documentï¿½ International Journal of Soft Computing, 2: 2007 pp. 346-348.
[18] M. Dash, H.Liu, P. Scheuermann, K.L. Tan, fast hierarchical clustering and its validation, Data & Knowledge Engineering 44(1) (2003) pp. 109-138.
[19] R. Balaji And R.B. Bapat, Block Distance Matrices, Electronic Journal of Linear Algebra ISSN 1081-3810 A publication of the International Linear Algebra Society Volume 16, pp. 435-443, December 2007.
[20] M.Nanni, speeding-up hierarchical agglomerative clustering in presence of expensive metrics, in proc. Of Ninth Pacific-Asia conference on knowledge discovery and Data mining (PAKDD)2005, pp. 378-387.
[21] P.A.Vijaya, M.N.Murty, D.K. Subramanian, Efficient bottom up hybrid hierarchical clustering techniques for protein sequence classification, pattern Recognition 39 (12) (2006), pp.2344-2355.

Citations	2325
h-index	16
i10-index	47