Data Deduplication

What is “data deduplication”?  Why is it done?  How is it done?  What are the primary methods of doing it? Are there pros and cons of each method of doing it?  These are the fundamental issues related to getting a basic understanding of data deduplication.
Data deduplication is a specialized method of compressing data.  It is generally employed in situations where very large quantities of data need to be stored.  The process involves a complex analysis to identify data byte patterns in the initial and subsequent new data sets.  As data is updated, added to, stored and backed up, all new data is compared against previously identified byte patterns.  The segments that match these byte patterns are not put into the data base again, but instead are marked with a reference to the original byte pattern.

Data deduplication is done to reduce redundancy of data within a data storage system.  Reducing data redundancy within a data storage system has a significant positive “domino effect”.  When the sheer amount of bytes being stored can be significantly reduced, the total amount of storage space required can also be reduced.   Subsequently, the hardware, bandwidth and infrastructure requirements can all be reduced as well.  Ultimately, the costs as well as the carbon footprint associated with data storage can be significantly reduced.

A cursory look at the back up of a typical e-mail system will illustrate the point.  A typical e-mail system might contain 100 instances of the exact same 1MB attached file.  Without data deduplication, each time the system is backed up, the same 1MG file attachment would be saved, which would require 100MB of storage space instead of just 1MG of storage space.   Over time, this creates a very negative and counter-productive “domino effect”.

There are two basic approaches to data deduplication.  The first approach is “in-line” data deduplication while the second approach is called “post-process” data deduplication.  The in-line approach compresses the data before the data goes to its ultimate storage point.  This means the data is deduplicated at its creation point before it is ever stored.  Conversely, the post-process deduplication approach involves first storing new data and then later analyzing it at the storage point.

There are basic pros and cons for each approach.  In-line data deduplication requires less upfront storage space and infrastructure than the post-process approach does.  This is because the data is deduplicated before it is ever written to disk.  On the down side, the time to complete the storage can be lengthened some since the required deduplication calculations have to be done before the data is stored.  There is some concern that doing the calculations at the creation point can slightly degrade the performance of the system at the data creation point as well.  Post-process deduplication, on the other hand, can require significantly more upfront data storage space but does not degrade performance of the system at the data creation point.  The choice as to method depends to some extent on the level of data integrity and the disaster recovery time requirements involved.

Companies with deduplication technology:

www.backblaze.com
www.carbonite.com
www.iozeta.com
www.dropbox.com
www.livedrive.com

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>