Open Access   Article Go Back

Deduplicates In Big Data: A Technical Survey

A.Sahaya Jenitha1 , V.Sinthu Janita Prakash2

Section:Survey Paper, Product Type: Journal Paper
Volume-06 , Issue-02 , Page no. 59-65, Mar-2018

Online published on Mar 31, 2018

Copyright © A.Sahaya Jenitha, V.Sinthu Janita Prakash . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: A.Sahaya Jenitha, V.Sinthu Janita Prakash, “Deduplicates In Big Data: A Technical Survey,” International Journal of Computer Sciences and Engineering, Vol.06, Issue.02, pp.59-65, 2018.

MLA Style Citation: A.Sahaya Jenitha, V.Sinthu Janita Prakash "Deduplicates In Big Data: A Technical Survey." International Journal of Computer Sciences and Engineering 06.02 (2018): 59-65.

APA Style Citation: A.Sahaya Jenitha, V.Sinthu Janita Prakash, (2018). Deduplicates In Big Data: A Technical Survey. International Journal of Computer Sciences and Engineering, 06(02), 59-65.

BibTex Style Citation:
@article{Jenitha_2018,
author = {A.Sahaya Jenitha, V.Sinthu Janita Prakash},
title = {Deduplicates In Big Data: A Technical Survey},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {3 2018},
volume = {06},
Issue = {02},
month = {3},
year = {2018},
issn = {2347-2693},
pages = {59-65},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=206},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=206
TI - Deduplicates In Big Data: A Technical Survey
T2 - International Journal of Computer Sciences and Engineering
AU - A.Sahaya Jenitha, V.Sinthu Janita Prakash
PY - 2018
DA - 2018/03/31
PB - IJCSE, Indore, INDIA
SP - 59-65
IS - 02
VL - 06
SN - 2347-2693
ER -

           

Abstract

Deduplication is a task of identifying one or more records in repository that represents same object or entity. The problem is that the same data may be represented in different way in every database. While merging the databases, duplicates occur despite different schemas, writing styles or misspellings. They are called as replicas. Removing replicas from the reposi¬tories provides high quality information and saves processing time. With the development of cloud computing through virtualization technology, creation of VMs rapidly increasing, this in turn increases data centres. Backup in virtualized environments takes the snapshot of VM called VM image and moved to backup device. Data is duplicated by VMs for many purposes like backup, fault tolerance, consistency, disaster recovery, high availability, etc., these results in unnecessary consumption of resources, such as network bandwidth and storage space. Data Deduplication is a process of detecting and removing duplicate data thus the amount of data, energy consumption and network bandwidth is reduced. This paper describes Deduplication methods for large scale databases (Big data) and several Deduplication techniques like Extreme Binning, MAD2, and Multi-level Deduplication where Deduplication is performed in backup services. The paper also describes Cloud spider, Liquid Deduplication techniques for VM images in Big Data extracted from cloud environment, their comparison based on several factors.

Key-Words / Index Term

Deduplication, Big data, Cloud, Live Virtual Machine Migration, Cloud spider, MAD2, Extreme Binning, Liquid, SAFE, Multi-Level Deduplication.

References

[1] T.Y.J.Naga Malleswari, D.Rajeswari, Dr.V. Jawahar Senthil Kumar, “A Survey of Cloud Computing, Architecture & Services Provided by Various Cloud Service Providers”, in proceedings of Interenational Conference on Demand Computing, 978-93-5087- 502-5 201, Bangalore, 2012.
[2] http://www.techopedia.com/definition/16821/virtual-machine-snapshot-vm-snapshot.
[3] K. Parimala1 G. Rajkumar, A. Ruba, S. Vijayalakshmi, "Challenges and Opportunities with Big Data", International Journal of Scientific Research in Computer Science and Engineering, Vol.5, Issue.5, pp.16-20, 2017
[4] EMC2, “Information Storage and Management”, in Wiley India Edition, 2nd ed.USA, 2012, pp.249-251.
[5] Qinlu He, Zhanhuai Li, Xiao Zhang, “Data Deduplication Techniques”, in International Conference on Future Information Technology and Management Engineering, China, 2010, pp.43-433.
[6] Priyanka Sethi, Prakash Kumar, “Leveraging Hadoop Framework to develop Duplication Detector and analysis using Map Reduce, Hive and Pig”, in IEEE 978-1-4799- 5173-4/14, 2014.
[7] Sumit Kumar Bose, Scott Brock, Ronald Skeoch, Nisaruddin Shaikh, Shrisha Rao, “Optimizing Live Migration of Virtual Machines Across Wide Area Networks Using Integrated Replication and Scheduling” in IEEE 978-1-4244-9493-4/11, 2011.
[8] Jiansheng Wei, HongJiang,Ke Zhou, Dan Feng, “MAD2: A Scalable High- Throughput Exact Deduplication Approach for Network Backup Service”, in IEEE, 978-1-4244- 7153-9/10, China, 2010.
[9] B.H.Bloom, “Space/Time trade-offs in hash coding with allowable errors”, Communications of the ACM, vol.13, no.7, p.422-426, July 1970.
[10] Z.Broder andM.Mitzenmacher, “Network Applications of Bloom Filters: A Survey”, Internet Mathematics, vol.1, pp.485-509,2005.
[11] Guohua Wang, Yuelong Zhao, Xiaoling Xie, Lin Liu, “Research on a clustering data de-duplication mechanism based on Bloom Filter, in IEEE, 978-1-4244 7874-3/10, 2010.
[12] Marcin Bienkowski, Miroslaw Korzeniowski , Friedhelm Meyer auf der Heide, “Dynamic Load Balancing in Distributed Hash Tables”, International Graduate School of dynamic intelligent systems, Germany, [Online]
[13] Yan Zhang, “on Protocol-Independent Data Redundancy Elimination” , IEEE Communicaitons Surveys &Tutorials, vol 16, No.1, First Quarter 2014.
[14] M.Al-laham, and I.M.M.E.Emary, “ Comparitive Study between various algorithms of data compression techniques”, International Journal Computer Science and Network Security, vol. 7, no.4, April 2007.
[15] M.O.Rabin, “Fingerprinting by random polynomials”, Center for Research in Computing Technology, Harvard University, Tech.Rep. TR-15-81, 1981.
[16] Deepavali Bhagwat, Kave Eshghi, Darrell D.E.Long, Mark Lillibridge, “Extreme Binning: Scalable, Parallel Deduplication for Chunk- Based File Backup” in IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS `09.
[17] R.Rivest, “The MD5, message-digest algorithm”, IETF, Request For Comments (RFC) 1321, Apr. 1992, [Online]
[18] National Institute of Standards and Technology, “Secure hash Standard,”, FIPS 180-1, Apr. 1995. [Online]
http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf
[19] C.Policroniades and I.Pratt, “Alternatives for detecting redundancy in storage systems data”, in Proceedings of the General Track: 2004 USENIX Annual Technical Conference, 2004, pp. 73-86. Deduplication Techniques: A Technical Survey (IJIRST/ Volume 1 / Issue 7 / 062) All rights reserved by www.ijirst.org 325
[20] M.Dutch, “Understanding data Deduplication ratios”, SNIA Data Management Forum, June 2008.
[21] A.Z.Broder, “On the resemblance and containment of documents”, in SEQUENCES’97: Proceedings of the Compression and Complexity of Sequences 1997, pp. 21-29.
[22] K.Jin and E.L.Miller, “TheEffectiveness of Deduplication on Virtual Machine Disk Images”, in Proc. SYSTOR,Israeli Exp. Syst. Conf., New York, NY, USA, 2009, pp.1-12.
[23] A.Liguori and E.Hensbergen, “Experiences with Content Addressable Storage and Virtual Disks,” in Proc. WIOV08, SanDiego, CA, USA, 2008, p.5.
[24] XunZhao, Yang Zhang, Yongwei Wu, Kang Chen, Jinlei Jiang, Keqin Li, “ Liquid: A Scalable Deduplication File System for Virtual Machine Images”, IEEE Transactions on Parallel and Distributed Systems, vol.25, No.5, May 2014.
[25] A.V.Aho, P.J. Denning, and J.D. Ullman , “Principles of Optimal Page Replacement”, J.A.C.M, vol.18, no.1, pp. 80-93, Jan. 1971.
[26] Jaehong Min, Daeyoung Yoon, and Youjip Won, “Efficient Deduplication Techniques for Modern Backup Operation”, IEEE Transactions on Computers, vol.60. June 2011.
[27] A.Muthitachareon, B.Chen, D.Mazieres, “ A Low bandwidth Network File System”, SIGOPS Operating Systems Rev., vol.35, no.5, pp.174-187,2001.
[28] B.Zhu, K.Li, H.Patterson, “Avoiding the disk bottleneck in the data domain Deduplication file system” in FAST’08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pp. 1-14, Berkely, CA, USA, 2008.
[29] U.Manber. Finding similar files in a large file system, in proceedings of the USENIX Winter 1994 Technical Conference, pp. 1-10, 1994.