There is No Magic Pixie Dust for Metadata

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Everyone has seen the eye-popping charts that show the rate of data creation growing almost exponentially. As an example, last year I heard a talk at the IEEE Conference on Massive Data Storage that explained that filming the television program Deadliest Catch creates almost 1PB per episode!

In addition, an article I recently wrote pointed out that switching to higher-resolution video monitoring cameras can increase the data storage by at least a factor of 6 and more likely a factor of 13.5. Multiple that by the number of public and private video cameras, and you can easily see that storing this data is going to be a grand challenge.

The need to store all of this video data from public cameras is justified by the recent use of video to identify the Boston Marathon bombers. Just imagine being able to see really clear images of their faces instead of grainy images because new higher resolution cameras were used,

The more cynical readers might say, “We’re going to store a lot of data—so what?” In classic Abbott and Costello fashion, my answer is “yes.”

Let me explain. The precise reason we are storing the data is to use it. It’s not simply a matter of taking a data stream and looking at it, making some notes and then erasing it. It’s a matter of taking a huge amount of data that is all probably different and creating information from it (the so-called “Big Data” approach).

There is no way a single person or even a small group could know everything about the data. In the case of surveillance video cameras, we will need to know what time the video was taken. Where was the camera located? What was the angle of the camera? When was the camera last serviced? What is the camera model? And so on. It is impossible for one person to know all of this data, so we need some additional information about these cameras and possibly other sources. All of this additional information beyond just the raw video stream is what we call metadata.

Introduction to Metadata

Metadata is data about the data. With POSIX file systems, you get some standard metadata information such as,

File ownership (User ID and Group ID)
File permissions (world, group, user)
File times (atime, ctime, mtime)
File size
File name
Is it a true file or directory?

There are several other standard bits of metadata, such as links, which I didn’t mention but are used by the file system. All of this information or data describes the data contained in the file.

But what if we want, or need, to add metadata to the file or somehow associate additional information with the file itself? In the case of video data, you may want to add some information to the video file about the general weather during the time the video was taken. Or you may want to add some data about how to determine the weather for that time. Or you may want to point to data about activities in the area, such as festivals or conventions, during the time of the video. Or you may want to point to police data or sales data in the case of a camera being used by a business, or whatever data you think is important and relevant. Or you may want to attach notes about events in the video. The point is that you may want to add or associate data or information to the raw data itself. This is metadata.

Here is the rub. Despite what some people want, there is no magic pixie dust that creates the metadata.Many people discuss how to search data using metadata tags and how glorious it would be to have the ability to start developing graph representations of data to help create information, and ultimately some sort of knowledge. However, all of these discussions are based on the premise that the metadata is there and that it is meaningful and accurate.

In my opinion this can’t be done by automated tools. The only way to get this metadata attached to, or associated with, a data stream, is to have someone do the work.

About a year ago, I went to a National Science Foundation (NSF) conference that brought librarians and high-performance computing (HPC) people together to discuss how to handle metadata for digital files. It was a very interesting and useful conference, and it brought home a few points.

The first point was that the librarians thought that the metadata categories would be the same for all data since that is the way things happen in their world. But I would hate to have a biologist try to figure out how to cram useful information into the same tags that climate researchers use.

The second point was that the motivation for HPC researchers to add metadata to their files was as close as you could get to zero. These observations started me thinking about how and why people add metadata to their existing data.

I’m not saying that HPC researchers are not motivated or that they are lazy because the opposite is true. However, when a research project is finished, it is time to move onto the next one. At that point, the researchers, and most importantly, their management, want them to move on to the next project, so metadata tagging is very low in the list or priorities. As things stand today, if the researchers leave without metadata tagging and without leaving behind extensive notes, the data becomes somewhat worthless because no one knows what it is. Eventually, it gets squirreled away or even erased.

To solve this problem, some people have suggested that there are ways to automate the metadata tagging of data. For example, if you take an HDF5 data file, you can scan the data file for the headers and then use them as metadata. However, I think it’s fairly obvious that would not work. The headers may only have meaning to some researchers and not to others. Or there may not even be useful headers in the data file.

Despite the desire of some people to believe that you can scan a data file to gather metadata, it’s just not possible. There is no magic pixie dust to sprinkle on the file to generate the metadata. It will require human intervention to create the tags.

Motivation – Carrots, Sticks, and Games

Researchers aren’t motivated to tag their data files, but without their help, the data will quickly become worthless. As a result, we have a dilemma.

But there are projects trying to solve this problem. One of them is a basic carrot-and-stick model. You reward good behavior, such as adding metadata to files, with a carrot, and you punish bad behavior, such as not adding metadata to the file, with a stick. A carrot could be more allocation on the HPC system or more storage space—or even something more basic as a monetary reward or a reward of food. (Having been a graduate student I can safely say that we would do anything for free food).

In the case where files are not tagged, a stick could be used. This could be a reduction in HPC allocation or a reduction in storage space.

There are variations on the carrot and stick theme, such as using just a carrot approach or using just a stick approach. The efficacy of using this approach in your research groups is really up to you, but the approach has been studied before and can lead to interesting results.

Another, more recent, approach is called gamification. Gamification is fairly easy to understand—engage the techniques people use in solving games, and use them to solve non-game problems. My favorite example of this is using the game Doom to kill system processes. The idea is to use Doom where your character kills system processes as it would to kill zombies (or whatever they are in Doom).

An additional example is that of Stack Overflow. It allows people to post questions and then other people provide answers. Based on the answers and how they are disseminated, people answering questions gain “points” and/or “badges.” They can be earned in a variety of ways, such as simply posting links to the questions and answers via Facebook and Twitter. Moreover, people’s answers can be rated by other people with higher ratings, getting you more “status” within the community. As a user’s reputation points go past certain levels, they gain additional privileges. At the higher end, they get the privilege of moderating the Stack Overflow site.

The challenge in gamification becomes how to use these techniques, among others, to motivate people to tag their data files. Ideally, you want to employ the technique(s) while the researchers are creating the data—otherwise, you end up using a large stick approach at the end of the project (e.g., metadata tag your data or you can’t move onto the next project).

You also have to be very careful about using any of these techniques because it could encourage people to just enter anything for the metadata, which doesn’t help anyone. In some ways, I think of this as “metadata malware” because you are doing damage to the search and ultimately any analysis done with the search data.

Consequently, you need to have a review system that checks the quality of the metadata. This could be a part of the gamification process as well. This checking process would result in something like Wikipedia where there are people who check entries and tag them if there are flaws, all in the interest of improving the quality of the entry.

I think checking the quality of the metadata could a much more important task than one might think. For example, if the metadata is incorrect and if someone used it, it could invalidate their search or any conclusions made. The only way around this is to check every piece of data you use prior to using it, which defeats the purpose of creating metadata tags.

Summary

Fundamentally, metadata is difficult. You have to begin by designing and testing what data needs to be saved as metadata for a particular file, and this takes time and experimentation. You may have to develop tools as part of the process to help users easily add metadata to the files, again adding time. And finally, you have to develop the process where users create metadata tags for their files. This process includes quality checking as well.

Moreover, you need to find ways of motivating researchers to tag their data. The general consensus seems to be that asking researchers to go through all of their data at the end of a project, just to metadata tag it, is an exercise in frustration for everyone involved. Consequently, we need some motivation techniques to help researchers keep up with their metadata tagging as the create it and use it. In my opinion, you may even need a psychiatrist to help with these steps.

There is no magic pixie dust that can be spread on the metadata tagging process to make it better. There is no magic way to ensure that the creation of useful metadata happens automatically. And finally, there is no magic that can make this all go away.

You can’t avoid the need for adding meaningful, useful and accurate metadata tags to files. And that means you have to put in the effort.