Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Everyone has seen the eye-popping charts that show the rate of data creation growing almost exponentially. As an example, last year I heard a talk at the IEEE Conference on Massive Data Storage that explained that filming the television program Deadliest Catch creates almost 1PB per episode!
In addition, an article I recently wrote pointed out that switching to higher-resolution video monitoring cameras can increase the data storage by at least a factor of 6 and more likely a factor of 13.5. Multiple that by the number of public and private video cameras, and you can easily see that storing this data is going to be a grand challenge.
The need to store all of this video data from public cameras is justified by the recent use of video to identify the Boston Marathon bombers. Just imagine being able to see really clear images of their faces instead of grainy images because new higher resolution cameras were used,
The more cynical readers might say, “We're going to store a lot of data—so what?” In classic Abbott and Costello fashion, my answer is "yes."
Let me explain. The precise reason we are storing the data is to use it. It's not simply a matter of taking a data stream and looking at it, making some notes and then erasing it. It's a matter of taking a huge amount of data that is all probably different and creating information from it (the so-called "Big Data" approach).
There is no way a single person or even a small group could know everything about the data. In the case of surveillance video cameras, we will need to know what time the video was taken. Where was the camera located? What was the angle of the camera? When was the camera last serviced? What is the camera model? And so on. It is impossible for one person to know all of this data, so we need some additional information about these cameras and possibly other sources. All of this additional information beyond just the raw video stream is what we call metadata.
Introduction to Metadata
Metadata is data about the data. With POSIX file systems, you get some standard metadata information such as,
- File ownership (User ID and Group ID)
- File permissions (world, group, user)
- File times (atime, ctime, mtime)
- File size
- File name
- Is it a true file or directory?
There are several other standard bits of metadata, such as links, which I didn't mention but are used by the file system. All of this information or data describes the data contained in the file.
But what if we want, or need, to add metadata to the file or somehow associate additional information with the file itself? In the case of video data, you may want to add some information to the video file about the general weather during the time the video was taken. Or you may want to add some data about how to determine the weather for that time. Or you may want to point to data about activities in the area, such as festivals or conventions, during the time of the video. Or you may want to point to police data or sales data in the case of a camera being used by a business, or whatever data you think is important and relevant. Or you may want to attach notes about events in the video. The point is that you may want to add or associate data or information to the raw data itself. This is metadata.
Here is the rub. Despite what some people want, there is no magic pixie dust that creates the metadata.Many people discuss how to search data using metadata tags and how glorious it would be to have the ability to start developing graph representations of data to help create information, and ultimately some sort of knowledge. However, all of these discussions are based on the premise that the metadata is there and that it is meaningful and accurate.
In my opinion this can't be done by automated tools. The only way to get this metadata attached to, or associated with, a data stream, is to have someone do the work.
About a year ago, I went to a National Science Foundation (NSF) conference that brought librarians and high-performance computing (HPC) people together to discuss how to handle metadata for digital files. It was a very interesting and useful conference, and it brought home a few points.
The first point was that the librarians thought that the metadata categories would be the same for all data since that is the way things happen in their world. But I would hate to have a biologist try to figure out how to cram useful information into the same tags that climate researchers use.
The second point was that the motivation for HPC researchers to add metadata to their files was as close as you could get to zero. These observations started me thinking about how and why people add metadata to their existing data.
I'm not saying that HPC researchers are not motivated or that they are lazy because the opposite is true. However, when a research project is finished, it is time to move onto the next one. At that point, the researchers, and most importantly, their management, want them to move on to the next project, so metadata tagging is very low in the list or priorities. As things stand today, if the researchers leave without metadata tagging and without leaving behind extensive notes, the data becomes somewhat worthless because no one knows what it is. Eventually, it gets squirreled away or even erased.