The Metadata Storage Problem - Page 2
But just like anything else, there are some downsides to storing the metadata with the data. The first one that comes to mind is how do you effectively search the metadata?
It basically boils down to walking the file system tree structure and searching the metadata of each file and returning the information. Depending upon the number of files and the tree structure of the files, this could take a long time, perhaps unnecessarily wasting time.
If you find the files you need by searching the metadata of all of the files, unfortunately, there are very few tools and techniques for copying the file, including the metadata, to some other storage (working storage). For example, NFS does not transmit the xattrs of a file, preventing you from using it to copy a file from and archive/collection to a working location. There are ways to copy the file and the metadata, but you need to pay very close attention to ensure that it works as anticipated.
I admit that I thought the second approach was better because the metadata always stayed with the data, allowing you to always access the metadata. For the option of a centralized metadata index, you had to spend additional resources to protect that database and ensure its accuracy, in addition to paying attention to the data protection of the data itself. Given the incredible size of some data collections, coupled with the projections on the size of the new collections/archives, I now believe that either approach by itself is not sufficient.
Rather, I believe a hybrid approach is a more desirable one.
I still fundamentally believe that metadata needs to be stored with the data. The basic reason for this is that the metadata is about the data—it is really all part of the same thing. Separating them makes no sense to me.
But at the same time, walking a tree with potentially billions of files is not practical. To overcome this, we need a centralized index of the metadata in a framework that is designed for searching. But the authority of all metadata is still the file itself.
If we create an archive and start pushing data into it, that gives us an opportunity to grab the metadata and store it a centralized index. We can gather the metadata for a file when the file is moved into the archive.
Hopefully, the archive has mechanisms that indicate when a file in the archive has changed, kicking off a metadata acquisition step which updates the central index. However, with archives that have a REST interface, there is no obvious way to "update" a file. If you "get" a file from the archive and then change it, most of the time you must "put" the entire file into the archive again. To the archive it is a "new" file. There are some archives that allow you to update files but the mechanisms are not easy to use and it's much easier to just "get" the file, modify it, and then "put" the changed file back into the archive. For these cases, the "put" operation initiates a metadata acquisition step making life much easier.
The authority for all metadata are the files, which have the metadata attached (using xattrs, in my opinion). If the centralized index fails for some reason, you have to spend some time walking the file system to "re-gather" the metadata, but you haven't fundamentally lost your index for good.
Who would have thought that something as simple as where to store metadata could result in a lengthy discussion? But suddenly, in today's world of millions to billions of files in data individual collections and archives, the importance of where you store your metadata is a key component in making the data actually useable in a practical way.
Putting all of the metadata in a centralized index is impractical because you now have to provide good data protection on the data and on the centralized index. Simply moving a file location without updating the index could wreak havoc with someone's search.
Similarly, putting the metadata with the data using something like xattrs is not as desirable, because each data search means that all of the metadata has to be collected via a tree-walk and searched. While keeping the metadata with the data is very natural, it could easily result in much longer searches than necessary.
I believe the better solution is to combine aspects of both into a single solution. Leave the metadata with the data, but as the data is pushed into the collection/archive, you can extract the metadata for use in a centralized index. Then the index is used for data searching.
With a simple REST interface, it is fairly easy to keep the index up to date. But in the event that the index is lost or corrupted, it is possible to walk the tree structure of the collection/archive and recover the metadata.
Even if you don't agree with these concepts, at the very least I hope this article has caused some people to think about where they want to store their metadata and how it is used for searching. If possible, write about your solution and the path you took to get there. I'm sure there are many people who would like to learn from what you have done.