What is deduplication technology?
Deduplication (sometimes called Single-Instance Storage, Capacity
Optimization or Factoring) is a data reduction technology intended to
eliminate redundant (duplicate) data on a storage system by saving
only one instance of each data item in order to reduce disk space and network
bandwidth. Deduplication technologies rely on an index that tracks the
data in the repository and allows for the identification of data redundancy.
The management software will look at the new data, compare it to data that
already exists on the system, and then store only data that doesn't match
existing data.
For example, suppose that a company has 100 members and the mailbox
of each member has around 1GB. However, most of the emails are the same:
emails distributed among company staff members or emails sent to several
company staff from outside. That's 100 GB of disk space consumed to store
basically the same information. Data deduplication ensures that only the
unique data is saved to disk. Subsequent iterations of the data are only
saved as references that point to the saved copy, so end-users still see
their own files in place.
There are three types of deduplication technologies:
- File deduplication. Only one copy of each identical file is stored.
This technology is also known as Single File Instance technology.
- Block level deduplication. Divide the information into blocks and only
one copy of each identical block is stored.
- Byte level deduplication. Analyze the content of the information to
be deduplicated at byte level and only store the unique data. This is
the only technology that guarantees full redundant elimination.
This means that different deduplication technologies can also provide
different granular control, removing redundant portions of files,
potentially down to the block level or even to the byte level.
When evaluating a deduplication product, it's important to understand
the granularity offered by their platform.
Benefits of deduplication technology.
By not storing duplicate bits of data, potentially huge savings
in disk space result. For instance byte level deduplication
technologies can reduce the total amount of stored data by a ratio
of 50:1 or more, depending on the environment. In other words, if
you are keeping a terabyte of disk backups on your VTL today, tomorrow
that number reduces to 20GB. And the 980GB of storage that is left
over means you can defer additional VTL storage purchases for years
before you will need to add more spindles to your VTL's storage
capacity.
This also means that if you free up more storage capacity, you
can choose to keep data on-line because it can be sent via secure
WAN to remote sites for disaster-recovery purposes or replication.
How does deduplication differ from other
similar technologies?
Data deduplication differs from compression in that compression
looks only for repeating patterns of information and reduces them.
For example, a compressed file cannot be compressed when it is modified
because it has huge entropy. Data deduplication would result in
reducing the unique data regardless its internal format. It just
compares the content of the file with previous versions and extracts
the new unique data. This provides a much greater data reduction
capability than compression. In fact, most of the products apply
compression algorithms after deduplicating the data to get even a
higher data reduction.
Deduplication also differs from incremental or differential
backups in that only the byte-level changes are backed up. Incremental
backups scan selected files for changes. If there is a change in the
file, even of a single bit, the whole file is saved in the newest
backup file. If that file is a 500 MB file, it saves the whole file
to the new backup. Data-deduplication technology will only store the
pieces of data that have changed, not the entire file.
Kondar deduplication technology.
Based on our philosophy of developing software components to be
integrated into third party companies’ software or hardware products, Kondar
is not a final product or a close component. Instead, Kondar is a
technology that can provide the API that best suits our clients'
products. Even more, we can fine-tune our technology in order to get the
best performance and easiest integration with your products.
Basically Kondar deduplication technology is able to compare two
blocks of data and find the differences between them at byte level.
The main feature of Kondar is that it’s able to do this deduplication
process with very large blocks of data, at byte level and very fast.
Kondar is data-format independent and can work with any kind of
data: files, memory buffers, disk images or data in streaming mode.
Here are just some examples to demonstrate the power and
flexibility of Kondar deduplication technology:
- Kondar can receive a stream with new data, compare it with other
previously stored data and create an output stream containing the data
which is unique in the new stream. All of this is done at byte-level.
This approach is also known as delta-based caching deduplication.
- Kondar can also work in a client/server architecture where it
compares a new version of a file or piece of data to be deduplicated
(on the client) with the initial version of that file (on the server),
transferring only a minimal amount of information to create a patch
file on the server with the differences. This approach is very useful
for data replication products.
- Another variant of this approach is to store on the client a small
file instantiating the initial version of a file or piece of data to be
deduplicated, and do the comparison without having the initial file cached
locally. Then the patch file with the differences can be sent to a remote
location.
- Kondar can provide a Lortu proprietary file system API to create a
data vault. Your software can send any kind of file to the Kondar component
and Kondar will apply single-file instance and byte-level deduplication
technology comparing all new files with all information stored so far in
the vault. This approach guarantees that the vault holds only unique data.
Thanks to its byte-level deduplication algorithms, Kondar can be used to
get a data reduction ratio of between 10 and 100 times.
For companies interested in implementing deduplication
technology in hardware, Kondar algorithms have been especially
designed to be implemented in FPGA, ASIC or proprietary multi-processor
hardware where you can take advantage of their parallelism and
multithreading capabilities, providing a great throughput and
scalability grade.
There are many other possibilities and we will be glad to
discuss the API that best fits your product.
|