Beyond that, de-dupe could eventually end up in many other EMC products including even its high-end arrays, said EMC senior vice president Mark Sorenson.

Likely places are our Celerra NAS products and I think even our other arrays, Sorenson said. Perhaps one day de-duplication will be as ubiquitous as RAID, he added. Only last week NetApp began shipping software-based de-dupe for its combined file-level and block-level disk arrays.

Sorenson stressed that EMC is taking a measured approach to the technology. The company bought itself into de-duplication last November when it acquired start-up Avamar Technologies, whose software and staff are now the foundation for all of EMC’s de-dupe plans.

One of those plans is to build de-duplication into EMC virtual tape library, which uses virtualization software licensed from FalconStor Software. Yesterday the company said it will ship this de-dupe extension early in 2008.

FalconStor is itself working up a de-dupe extension to that software, but there are some questions about its performance and reliability, and EMC had already declared its intention to rely on its Avamar-inspired technology.

Currently the only EMC product to feature de-dupe is the Avamar backup system that was tailored to provide remote backup of branch and small offices, using de-dupe to reduce backup volumes so that they can be sent over existing low-cost network links to backup targets at data centers.

EMC also plans to integrate some form of de-dupe directly into its NetWorker backup management tool, but is not saying any dates for that integration. Currently the only integration between the Avamar system and NetWorker is limited to the ability of NetWorker to create backup copies of data that has been de-duplicated by Avamar. Networker itself cannot read that data.

De-duplication is achieved by breaking up data into chunks, inspecting the chunks and then replacing the duplicates with pointers to a chosen single instance. That means that data cannot be read until it has been reassembled. This and the de-duping process itself are the overheads that have so far restricted de-dupe mostly to backup applications – where it also happens to deliver the best results in terms of reducing data volumes.

But the overheads will be reduced, Sorenson said. CPU cycles are cheap, and you’ll see us make improvements to the process so that we can do more and more de-duping using less processor cycles, he said.

Most of the focus is on secondary storage, where the reassembly doesn’t have to be done in milliseconds. That’s not the case for primary storage, bur primary storage isn’t where we’re aiming de-dupe today, he said.