Information classification and management (ICM) is an emerging technology that proponents say could be the bridge that carries storage users to file-level information lifecycle management (ILM) in the real world.
By and large, companies practicing ILM today tend to focus more on tiered storage, using broad classification, such as by application, to move data to the type of storage most appropriate for its value. ILM based on file-level classification remains something of a dream for storage users, but a number of startups hope to change that.
Automated lifecycle management of unstructured data in the file system won’t happen until the tools are there to address the problem. These tools — for indexing, classifying, and policy-based management of files — are beginning to arrive, and it is these tools that currently carve out the relatively new territory of ICM.
Defining ICM and ILM
The terms ICM and ILM are both somewhat murky. It will help to have some clear definitions before we discuss how the two relate. Analysts from the Taneja Group define ICM like this:
A class of application-independent software that utilizes advanced indexing, classification, policy, and data access capabilities to automate data management activities above the storage layer.
From this perspective, ICM is a product category, not an abstract description of technology. Products in this category can help to automatically determine the “class” to which each file belongs.
Once classes are established, they can be used as the basis for assigning a storage tier, in the case of storage optimization, or as the basis for other decisions, such as determining an appropriate retention period for regulatory compliance, or for assigning appropriate access control.
And here is SNIA’s description of ILM:
A standards-based business-driven management practice that uses the value of information as the basis for setting policies and service requirements for information, data, and security services.
Two things stand out. First, in the words of Michael Peterson, program director for SNIA’s Data Management Forum and president and senior analyst at Strategic Research Corporation: “ILM is really a practice. It isn’t just a tool.” That means that a discussion of ICM’s role as a bridge to ILM is really a discussion of how well ICM helps to automate the process of lifecycle management.
Second, the “value of information” is key to the whole process. Again, according to Peterson, “the principal of this practice is putting information as the central actor, as the key element around which we make decisions.” What has been lacking in an automated approach to ILM are solutions that can help determine the value of unstructured enterprise data, and to implement the assignment of storage policies based on that value. That’s where ICM comes in.
Enabling Technology
Kazeon is one of the players in the ICM space. Troy Toman, Kazeon’s marketing vice president, says that in his experience, most companies have little knowledge of the nature of their information, let alone its value. “They can tell you they might have 10 or 20 or 100 terabytes of data,” says Toman. “They probably can’t tell you much more than that. The first step in implementing ILM is classifying and getting visibility into your data.”
Brad O’Neill, senior analyst and consultant at the Taneja Group, says ICM is what makes true ILM possible. He says, “You could go so far as to say it is the enabling technology for ILM. Without indexing and classification of the content, you are still in the realm of data management and HSM, not true information management.”
The Players
There are a number of vendors in the ICM space, and each has its own approach to the problems of indexing, classifying and managing unstructured data. No single tool automates the entire process of classification and application of storage policy, and none currently can, because users need to be in the loop to peg the relative value of data.
“Different types of data are important to different customers, to different groups within a customer organization,” says Buzz Walker, vice president of marketing and business development at Arkivio. “The input has to come from the business, rather than something you can intrinsically find out from the environment.”
Current players include:
Abrevity: Abrevity markets a distributed software classification solution that runs on individual servers and other Windows systems across the enterprise, an approach that the company claims scales more effectively than other architectures.
Abrevity’s software classifies files based on content, and on metadata extracted from the file system. “Our secret sauce is that we’re context-aware, not just content-aware,” says Eric Madison, Abrevity’s director of marketing. “We can give you, in context, where and why and how the terms are used, and not just classify based on raw words.”
Arkivio: Arkivio provides an integrated, tiered-storage management solution that uses file metadata to classify files against user-defined groups. Arkivio’s product then applies migration, retention, or other policies to files in those groups.
“First you discover, then you classify, build your policies,” says Walker. “When you start thinking about classification, people know what’s the most important data, people know what’s the least important data … everything else must be in between. You work at the two ends.”
Index Engines: Index Engines builds an appliance that sits in the backup stream, rapidly indexing files on their way to backup storage. According to CEO Tim Williams, “All the important data in the enterprise is being backed up. If we’re integrated with the backup process, you have, for the first time, an enterprise-wide index of all the data.”
Though indexing is not strictly part of classification, it is an important part of information management because it allows users to query managed data across storage tiers. Says CTO Gordon Harris, “since we have a comprehensive index, we can answer the questions.”
Kazeon: Kazeon’s appliance plugs into the network and scans unstructured files on CIFS and NFS volumes, indexing and classifying based on file content and file system metadata. There is also an active policy engine that can copy, move, and delete files.
Toman believes that metadata classification may be sufficient for storage optimization, the first part of ILM, but that content classification is required for domains such as regulatory compliance, privacy, and litigation support. Says Toman, “Ultimately, we believe this is kind of an enabling technology around which content-aware management applications can be built.”
Njini: Njini provides an in-band classification and policy engine that classifies files as they are stored. Being in-band allows Njini, for now, to control how many duplicates of a file are stored, based on user policy. Eliminating redundant copies can save significant storage, and modules planned for later release will further allow storage of files on the most appropriate tier.
“Our game is not just to provide metadata,” says Phil Tee, chairman and CTO of Njini. “We believe that it is critical to take action on the information that you find.”
StoredIQ: StoredIQ delivers an ICM appliance with a strong focus on content-based classification. Although StoredIQ competes with Kazeon in this arena, StoredIQ’s solution doesn’t include indexing, relying instead on integrating with third-party indexers such as Google Enterprise Search and Index Engines.
In fact, although users can move files from tier to tier using StoredIQ’s products, the company prefers to focus on classification and to work with other storage systems to build end-to-end tiered-storage solutions. Says StoredIQ CEO Bob Fernander, “We’re an intelligence layer. We’re not a storage management solution.”
Others: Other ICM players include Scentric, a startup still operating in stealth mode, and Trusted Edge. Trusted Edge’s product suite classifies files stored on user desktops and captures them, if specified by policy, to network repositories.
What of the major storage players, and their approach to ICM? O’Neill expects big names such as EMC, Network Appliance, HDS, IBM and HP to enter the ring shortly. “Expect to see all the major vendors including indexing and classification within their offerings by the middle of 2006,” he says.
O’Neill says that in many cases, these offerings will be brought about through partnerships or by acquiring technologies from smaller companies. He says ICM “is not an area that has received significant internal investment in most of these large players to date.” NetApp’s recent OEM deal with Kazeon may be on the leading edge of this wave.
ILM Beyond 1.0
The long-term vision of ILM, as outlined in the SNIA statement above, is obviously not here yet. But pieces of the technology that will help to support that vision are beginning to arrive. ICM tools are a critical piece because they automate classification, arguably the last completely manual part of the process.
StoredIQ’s Fernander says the first part of a first-generation ILM solution is heterogeneous HSM, a maturing technology. “The other piece,” he says, “is the classification piece, which is the last brick in the wall.”
O’Neill puts it in terms of different ILM generations. “You can argue that we have file-level ILM 1.0 today, it’s just not totally automatic,” he says. “ILM 2.0 is what most of the ICM players are all about right now: automating file-level classification.”
Further down the road, there will be a convergence of structured and unstructured data. O’Neill says, “ILM 3.0 would then address the automation and integration of database, e-mail and files into unified classification platforms, across all levels of the enterprise and across heterogeneous storage platforms. This is a project will take place over the next few years, but it is inevitable.”
Peterson also sees the convergence, but isn’t clear which side — structured or unstructured — is going to drive it. “The database companies and the ECM companies are making tools as well,” says Peterson. “So there’s a fight going on to determine who’s going to be able to classify that unstructured data, and to bring it into a managed domain.”
For more storage features, visit Enterprise Storage Forum Special Reports