Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
To solve this problem, some people have suggested that there are ways to automate the metadata tagging of data. For example, if you take an HDF5 data file, you can scan the data file for the headers and then use them as metadata. However, I think it's fairly obvious that would not work. The headers may only have meaning to some researchers and not to others. Or there may not even be useful headers in the data file.
Despite the desire of some people to believe that you can scan a data file to gather metadata, it's just not possible. There is no magic pixie dust to sprinkle on the file to generate the metadata. It will require human intervention to create the tags.
Motivation - Carrots, Sticks, and Games
Researchers aren't motivated to tag their data files, but without their help, the data will quickly become worthless. As a result, we have a dilemma.
But there are projects trying to solve this problem. One of them is a basic carrot-and-stick model. You reward good behavior, such as adding metadata to files, with a carrot, and you punish bad behavior, such as not adding metadata to the file, with a stick. A carrot could be more allocation on the HPC system or more storage space—or even something more basic as a monetary reward or a reward of food. (Having been a graduate student I can safely say that we would do anything for free food).
In the case where files are not tagged, a stick could be used. This could be a reduction in HPC allocation or a reduction in storage space.
There are variations on the carrot and stick theme, such as using just a carrot approach or using just a stick approach. The efficacy of using this approach in your research groups is really up to you, but the approach has been studied before and can lead to interesting results.
Another, more recent, approach is called gamification. Gamification is fairly easy to understand—engage the techniques people use in solving games, and use them to solve non-game problems. My favorite example of this is using the game Doom to kill system processes. The idea is to use Doom where your character kills system processes as it would to kill zombies (or whatever they are in Doom).
An additional example is that of Stack Overflow. It allows people to post questions and then other people provide answers. Based on the answers and how they are disseminated, people answering questions gain "points" and/or "badges." They can be earned in a variety of ways, such as simply posting links to the questions and answers via Facebook and Twitter. Moreover, people's answers can be rated by other people with higher ratings, getting you more "status" within the community. As a user's reputation points go past certain levels, they gain additional privileges. At the higher end, they get the privilege of moderating the Stack Overflow site.
The challenge in gamification becomes how to use these techniques, among others, to motivate people to tag their data files. Ideally, you want to employ the technique(s) while the researchers are creating the data—otherwise, you end up using a large stick approach at the end of the project (e.g., metadata tag your data or you can't move onto the next project).
You also have to be very careful about using any of these techniques because it could encourage people to just enter anything for the metadata, which doesn't help anyone. In some ways, I think of this as "metadata malware" because you are doing damage to the search and ultimately any analysis done with the search data.
Consequently, you need to have a review system that checks the quality of the metadata. This could be a part of the gamification process as well. This checking process would result in something like Wikipedia where there are people who check entries and tag them if there are flaws, all in the interest of improving the quality of the entry.
I think checking the quality of the metadata could a much more important task than one might think. For example, if the metadata is incorrect and if someone used it, it could invalidate their search or any conclusions made. The only way around this is to check every piece of data you use prior to using it, which defeats the purpose of creating metadata tags.
Fundamentally, metadata is difficult. You have to begin by designing and testing what data needs to be saved as metadata for a particular file, and this takes time and experimentation. You may have to develop tools as part of the process to help users easily add metadata to the files, again adding time. And finally, you have to develop the process where users create metadata tags for their files. This process includes quality checking as well.
Moreover, you need to find ways of motivating researchers to tag their data. The general consensus seems to be that asking researchers to go through all of their data at the end of a project, just to metadata tag it, is an exercise in frustration for everyone involved. Consequently, we need some motivation techniques to help researchers keep up with their metadata tagging as the create it and use it. In my opinion, you may even need a psychiatrist to help with these steps.
There is no magic pixie dust that can be spread on the metadata tagging process to make it better. There is no magic way to ensure that the creation of useful metadata happens automatically. And finally, there is no magic that can make this all go away.
You can't avoid the need for adding meaningful, useful and accurate metadata tags to files. And that means you have to put in the effort.