The amount of data we want to collect, archive, and search, is amazing. The use of metadata allows us to quickly find data files in which we are interested. However, the storage and search of the metadata alone has become a “big data” problem. One important aspect of this is where you store the metadata.
The craze of “big data” is upon us, and the world is adjusting to what it means, how they can use it, and how to build systems for it. Feeding the big data beast is a sea of sensors. For example, there are video cameras everywhere—outside stores, inside stores, intersections, on helicopters, dash cams, people with cell phones. There are road sensors, sensors in our cars, sensors all around wooded areas, sensors along bridges. There are domain-specific sensors, such as those for the power grid, the oil and gas industry, hospitals, ISPs, websites, weather, oceans, the military, and on and on.
All of this data has a common thread—the need for metadata.
Metadata is simply data about data. For example, metadata can include information about where a sensor is located (GPS coordinates), the time period for a specific recording, the direction the sensor was sensing, the firmware of the sensor, and the model of the sensor, and more.
You can also “tag” files with new metadata about information that is usually found by post-processing the data. In the case of a camera, these metadata tags could be time stamps where there is something interesting happening (perhaps along with a note of the interesting event itself). Other metadata tags could also be pointers to other related sources of information such as other cameras or weather data.
It’s obvious that the usefulness of metadata depends upon its quality. If the metadata isn’t accurate, then using the associated raw data will result in a degrade—if not an outright failure—of the resulting analysis. Some of the metadata has to be created by people and cannot be automated, so there is always the possibility of error.
Understanding which metadata is important for a specific data file and how to make it useful for researchers is an extremely important question. It’s also probably a question that has not only a technological solution but also sociological and psychological solution.
But one seemingly simple question has a huge impact on the use of metadata: where do you store the metadata?
A Place for Your Stuff
When I originally investigated the question of where to place metadata I explored two options. The first option is to put the metadata in a central location for all data. The second option was to put the metadata with the data itself.
The first option is one that is used by many search or archive systems. The idea is fairly simple—gather metadata about a specific file and store it, typically in a database. Then you can search the database as you like, hopefully finding the files that contain the information in which you are interested (assuming that the metadata is correct—but that’s another story).
One of the outputs from the search should be the location (fully qualified name which contains the full path to the file) of the file(s) of interest. Then you can copy the file(s) into some sort of working storage and have at it.
The dangers of centralizing the metadata mostly deal with the file mapping interaction between the metadata and the files. For example, you need a mechanism to update the centralized metadata server when the metadata for the various files is updated. Ideally, this update mechanism should be fairly fast; otherwise, the search data can be out of date. How you define “fast” is up to you based on your users and usage model.
In this update mechanism there is a buried problem. What happens if the database and the files are no longer in synch? For example, what happens if a file is moved so that its full path in the database is no longer valid?
The result is obvious, the database is no longer valid, at least concerning that file. Hopefully, the update mechanism can tell the database that the file has moved and to either create new metadata for the new location or to update the existing metadata to reflect the new location of the file. In either case, the update window will have an impact on updating the database.
A third aspect that needs attention is the data integrity of the database itself. You will need to provide data protection of the database using backups, copies, or something similar. Don’t forget that the database is primarily used for reads, requiring that you pay attention to the size of the database and the rate of read errors. Building an index from consumer SATA drives, as some manufacturers do, means that when you read as little as 100GB you are likely to hit a read error. This forces a rebuild if you built the storage with RAID controllers, and could cause further problems during the rebuild.
The second option, storing the metadata with the data, is a very desirable approach because now you have to worry about data integrity of just one file system and not two. If the file moves, the metadata moves with it. You can add metadata to the files at any time because they stay with the files.
Ideally, if we had good tools for copying and moving metadata with files, you could easily copy or move the data files somewhere else. For example, if you copied the file to some work storage, the metadata would need to come with it. This also means that you could update the metadata of the file and then copy it back, taking the updated metadata with it.
One way to achieve this is by using extended attributes (xattr). Many file systems support extended attributes and there are ways for users to add metadata to the file via xattrs and ways to read them. Some file systems impose limitations on the extended attributes, such as the amount that can be added, but others do not. Regardless, being able to store metadata along with the data is a very attractive proposition.
But just like anything else, there are some downsides to storing the metadata with the data. The first one that comes to mind is how do you effectively search the metadata?
It basically boils down to walking the file system tree structure and searching the metadata of each file and returning the information. Depending upon the number of files and the tree structure of the files, this could take a long time, perhaps unnecessarily wasting time.
If you find the files you need by searching the metadata of all of the files, unfortunately, there are very few tools and techniques for copying the file, including the metadata, to some other storage (working storage). For example, NFS does not transmit the xattrs of a file, preventing you from using it to copy a file from and archive/collection to a working location. There are ways to copy the file and the metadata, but you need to pay very close attention to ensure that it works as anticipated.
Better Approach
I admit that I thought the second approach was better because the metadata always stayed with the data, allowing you to always access the metadata. For the option of a centralized metadata index, you had to spend additional resources to protect that database and ensure its accuracy, in addition to paying attention to the data protection of the data itself. Given the incredible size of some data collections, coupled with the projections on the size of the new collections/archives, I now believe that either approach by itself is not sufficient.
Rather, I believe a hybrid approach is a more desirable one.
I still fundamentally believe that metadata needs to be stored with the data. The basic reason for this is that the metadata is about the data—it is really all part of the same thing. Separating them makes no sense to me.
But at the same time, walking a tree with potentially billions of files is not practical. To overcome this, we need a centralized index of the metadata in a framework that is designed for searching. But the authority of all metadata is still the file itself.
If we create an archive and start pushing data into it, that gives us an opportunity to grab the metadata and store it a centralized index. We can gather the metadata for a file when the file is moved into the archive.
Hopefully, the archive has mechanisms that indicate when a file in the archive has changed, kicking off a metadata acquisition step which updates the central index. However, with archives that have a REST interface, there is no obvious way to “update” a file. If you “get” a file from the archive and then change it, most of the time you must “put” the entire file into the archive again. To the archive it is a “new” file. There are some archives that allow you to update files but the mechanisms are not easy to use and it’s much easier to just “get” the file, modify it, and then “put” the changed file back into the archive. For these cases, the “put” operation initiates a metadata acquisition step making life much easier.
The authority for all metadata are the files, which have the metadata attached (using xattrs, in my opinion). If the centralized index fails for some reason, you have to spend some time walking the file system to “re-gather” the metadata, but you haven’t fundamentally lost your index for good.
Summary
Who would have thought that something as simple as where to store metadata could result in a lengthy discussion? But suddenly, in today’s world of millions to billions of files in data individual collections and archives, the importance of where you store your metadata is a key component in making the data actually useable in a practical way.
Putting all of the metadata in a centralized index is impractical because you now have to provide good data protection on the data and on the centralized index. Simply moving a file location without updating the index could wreak havoc with someone’s search.
Similarly, putting the metadata with the data using something like xattrs is not as desirable, because each data search means that all of the metadata has to be collected via a tree-walk and searched. While keeping the metadata with the data is very natural, it could easily result in much longer searches than necessary.
I believe the better solution is to combine aspects of both into a single solution. Leave the metadata with the data, but as the data is pushed into the collection/archive, you can extract the metadata for use in a centralized index. Then the index is used for data searching.
With a simple REST interface, it is fairly easy to keep the index up to date. But in the event that the index is lost or corrupted, it is possible to walk the tree structure of the collection/archive and recover the metadata.
Even if you don’t agree with these concepts, at the very least I hope this article has caused some people to think about where they want to store their metadata and how it is used for searching. If possible, write about your solution and the path you took to get there. I’m sure there are many people who would like to learn from what you have done.