What is deduplication technology?
Deduplication (sometimes called Single-Instance Storage, Capacity
Optimization or Factoring) is a data reduction technology intended to
eliminate redundant (duplicate) data on a storage system by saving only
one instance of each data item, in order to reduce disk space and
network bandwidth. Deduplication technologies rely on an index which
tracks the data in the repository and allows for the identification of
data redundancy. The management software will look at the new data,
compare it to the data that already exists on the system, and then
store only the data which doesn't match existing data.
For example, suppose that a company has 100 members and the mailbox
of each member has around 1GB. However, most of the emails are the same:
emails distributed among company staff members or emails sent to several
company staff from outside. That's 100 GB of disk space consumed to store
basically the same information. Data deduplication ensures that only the
unique data is saved to disk. Subsequent iterations of the data are only
saved as references which point to the saved copy, so that end-users
still see their own files in place.
There are three types of deduplication technologies:
- File deduplication. Only one copy of each identical file is stored.
This technology is also known as Single File Instance technology.
- Block-level deduplication. Divide the information into blocks and
only one copy of each identical block is stored.
- Byte-level deduplication. Analyze the content of the information to
be deduplicated at byte-level and only store the unique data. This is
the only technology which guarantees full redundant elimination.
This means that different deduplication technologies can also provide
different granular control by removing redundant portions of files down
to the block level or even to the byte level.
When evaluating a deduplication product, it's important to understand
the granularity offered by the platform.
Benefits of deduplication technology.
By not storing duplicate pieces of data, potentially huge savings
in disk space result. For instance byte-level deduplication technologies
can reduce the total amount of stored data by a ratio of 50:1 or more,
depending on the environment. In other words, if you are keeping a
terabyte of disk backups today, tomorrow that number reduces to 20GB.
And the 980GB of storage that is left over means you can defer
additional storage purchases for years before you will need to add
more disks to your storage capacity.
This also means that if you free up more storage capacity, you can
choose to keep data online because it can be sent via secure WAN to
remote sites for disaster recovery purposes or replication.
How does deduplication differ from other
similar technologies?
Data deduplication differs from compression in that compression
looks only for repeating patterns of information and reduces them.
For example, a compressed file cannot be compressed when it is
modified because it has huge entropy. Data deduplication reduces
the unique data regardless of its internal format. It just compares
the content of the file with previous versions and extracts the new
unique data. This provides a much greater data reduction capability
than compression. In fact, most of the products apply compression
algorithms after deduplicating the data to get an even higher data
reduction.
Deduplication also differs from incremental or differential backups
in that only the byte-level changes are backed up. Incremental backups
scan selected files for changes. If there is a change in the file,
even of a single bit, the whole file is saved in the newest backup
file. If that file is 500 MB, it saves the whole file to the new backup.
Data-deduplication technology will only store the pieces of data that
have changed, not the entire file.
How does Lortu deduplication technology
differ from other deduplication technologies?
There are several approaches to implement deduplication, and
even though each approach has its own advantages and drawbacks,
some are much better than others.
Let us explain the main differences between each approach:
Post-process deduplication vs. in-line deduplication:
The main advantage of post-process deduplication as opposed
to in-line deduplication is a higher backup throughput and smaller
backup time window. This is because the information is first
stored in the appliance and then deduplicated later without
interfering with the backup process.
Lortu provides post-process deduplication.
Byte-level differencing vs. pattern matching (storing a hash
for each pattern or block):
Pattern matching is less scalable than differencing as the data
to be deduplicated grows, because the table with hashes uses more
memory and CPU as it has to manage more data. However, its greater
drawback is the restore time.
If backup time is critical, the restore time is much more
critical. Since the patterns are spread over the full disk in very
small blocks of information, the system requires reads of one or two
clusters for each small pattern. This can mean that restore time
can be more than 10 times slower than copying the non-deduplicated
information. With byte-level differencing, the information is stored
in much larger blocks, and usually the restore time is very close
to copying the non-deduplicated information.
Also pattern matching technology requires several weeks before
the deduplication process can be effective. With byte-level
differencing the deduplication is very effective from the second
backup, and effectiveness improves as new files are included in
the vault
Lortu provides byte-level differencing deduplication.
Data agnostic vs. content-aware approach:
Data agnostic technologies work with any kind of information
or file format. The drawback of the content-aware approach is
that the technology needs to understand the format of the files.
If the file format is different than expected (a new version of
the application for instance), or if the application isn't
supported by the technology, the deduplication process is
not possible.
Lortu deduplication technology is agnostic to the data. It
can deduplicate data from any kind of data, file format or
type of file.
|