Open Access   Article Go Back

A Review on Duplicate and Near Duplicate Documents Detection Technique

Patil Deepali E.1 , Ghatage Trupti B.2 , Takmare Sachin B.3 , Patil Sushama A.4

Section:Review Paper, Product Type: Journal Paper
Volume-4 , Issue-3 , Page no. 59-62, Mar-2016

Online published on Mar 30, 2016

Copyright © Patil Deepali E., Ghatage Trupti B., Takmare Sachin B., Patil Sushama A. . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Patil Deepali E., Ghatage Trupti B., Takmare Sachin B., Patil Sushama A., “A Review on Duplicate and Near Duplicate Documents Detection Technique,” International Journal of Computer Sciences and Engineering, Vol.4, Issue.3, pp.59-62, 2016.

MLA Style Citation: Patil Deepali E., Ghatage Trupti B., Takmare Sachin B., Patil Sushama A. "A Review on Duplicate and Near Duplicate Documents Detection Technique." International Journal of Computer Sciences and Engineering 4.3 (2016): 59-62.

APA Style Citation: Patil Deepali E., Ghatage Trupti B., Takmare Sachin B., Patil Sushama A., (2016). A Review on Duplicate and Near Duplicate Documents Detection Technique. International Journal of Computer Sciences and Engineering, 4(3), 59-62.

BibTex Style Citation:
@article{E._2016,
author = {Patil Deepali E., Ghatage Trupti B., Takmare Sachin B., Patil Sushama A.},
title = {A Review on Duplicate and Near Duplicate Documents Detection Technique},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {3 2016},
volume = {4},
Issue = {3},
month = {3},
year = {2016},
issn = {2347-2693},
pages = {59-62},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=828},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=828
TI - A Review on Duplicate and Near Duplicate Documents Detection Technique
T2 - International Journal of Computer Sciences and Engineering
AU - Patil Deepali E., Ghatage Trupti B., Takmare Sachin B., Patil Sushama A.
PY - 2016
DA - 2016/03/30
PB - IJCSE, Indore, INDIA
SP - 59-62
IS - 3
VL - 4
SN - 2347-2693
ER -

VIEWS PDF XML
1462 1348 downloads 1387 downloads
  
  
           

Abstract

Duplicated web pages in consist of identical structure but regarded as clones regarded as clones different data. The identification of similar and near-duplicate pairs in a large collection is a significant the problem with the twide-spread application. The problem deliberated for diverse data types in diverse settings. The contemporary materialization is efficient of the problem identification of the near duplicate Web pages. This is challenging in the web scale to the voluminous data and the high dimensionalities of documents. This review has a fundamental intention to present an up-to-date review of the existing of literature in duplicate and near duplicate detection of general documents and web documents in web crawling. The classification of the existing literature in duplicate and the near duplicate detection techniques and a detailed description of same are the presented so as to make the review more comprehensible.

Key-Words / Index Term

Web crawling, web pages, web mining, web content mining, and duplicate document, near duplicate detection

References

[1] Andrei Z. Broder., "Identifying and Filtering Near-Duplicate Documents", Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. UK: Springer-Verlag, pp. 1-10, 2000.
[2] Broder, A., Glassman, S., Manasse, M., and Zweig, G., “Syntactic Clustering of the Web”, In 6th International World Wide Web Conference, pp: 393-404, 1997.
[3] Bernstein, Y., Shokouhi, M., and Zobel, J., "Compact Features for Detection of Near- Duplicates in Distributed Retrieval", in 'Proceedings of String Processing and Information Retrieval Symposium (to appear)', Glasgow, Schotland, 2006.
[4] Charikar, M.,“Similarity estimation techniques from rounding algorithms”, In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pp. 380-388, 2002.
[5] Chowdhury, A., Frieder, O., Grossman, D., and Catherine Mccabe, M., “Collection Statistics for Fast Duplicate Document Detection", In. ACM Transactions on Information Systems (TOIS), Vol. 20, No. 2, 2002.
[6] Deng, F., Rafiei, D., "Approximately detecting duplicates for streaming data using stable bloom filters" ,Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 25-36, 2006.
[7] Deng, F., Rafiei, D., "Estimating the Number of Near Duplicate Document Pairs for Massive Data Sets using Small Space", University of Alberta, Canada, 2007.
[8] Manku, G. S., Jain, A., Sarma, A. D., "Detecting near-duplicates for web crawling", Proceedings of the 16th international conference on World Wide Web, pp: 141 – 150, 2007.
[9] Udi Manber., "Finding Similar Files In A Large File System", Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, San Francisco, California, pp. 2-2, 1994.
[10] Ye, S., Wen, J., R., and Ma, W.Y., "A systematic study of parameter correlations in large scale duplicate document detection", Text and Document Mining, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, pp. 275-284, 2006.