Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
You may have heard that metadata is data about data (if that makes any sense). But how about metadata about metadata (data about data about data)?
Actually this is an important topic—understanding and monitoring how and when files are created, used, modified and removed. This information can tell you a great deal about what's happening with your data.
Quick questions: Can you quickly tell me the average file size on your storage system? Can you tell me the average file age?
Being able to answer these questions quickly can help you determine if you need new storage, what type of storage you need, and if you need to archive or move old data off your system, to something cheaper or perhaps just get rid of it (I know, heresy, right?). Gathering the information about the files on a file system, that is, gathering metadata about the metadata, is a critical—and very neglected—task in the storage world.
Suppose you are responsible for the budget for data storage. An admin or a user comes to you and says they are running out of space or that the performance of the storage is slow and they need new storage. I don't know about you, but I would want them to explain why. That means they need to explain what's going on with the storage. A friend of mine calls it "showing your math." If you are the administrator you will need to be able to present some understandable and useful information. For example,
- How quickly is the capacity is being consumed? Be sure to show this with some detail (not just a global view). For example, a simple chart that could illustrate which user or group of users is consuming the data the quickest is important.
- How much "old" data is there? (Maybe create a histogram showing the age of the files.)
- Who has the oldest files? (histogram by user and/or group)
- If you want a faster storage solution, can you show why you need one? Maybe you could show that you have a large number of files, lots of them are really small or really large, etc.
Plus, if I were responsible for the purse strings for new storage I would not like to be blind-sided with such a request at the last minute. I would like to have some idea that this request is coming so I can prep my management and also prep the funding channels. Therefore, it is a good idea to present this information (trends) to management over time. Maybe once a quarter or so you can do a quick presentation of the information?
There are some commercial tools that can do some of this. For example, spaceobserver can provide a great deal of metadata information about Windows file systems with the ability to sort it and view it. But it can't describe everything and isn't flexible enough (plus it doesn't do non-Windows file systems).
There are some open-source tools that might help as well. For example, fsstats is a Perl script that walks file trees and gathers some file system information and presents it to the user with a breakdown of the file data.
These tools, while providing some information, didn't provide everything I wanted in a way that was useful to me. Therefore, I decided to write my own tools so that I can understand a bit more about my data. My goal is to help you learn a bit about how important metadata of your metadata can be.
Tools, Tools, Tools
I chose to divide the tool into two pieces. The first one just walks a file tree and gathers metadata. The second piece takes two or more of these collections and performs a statistical analysis on them and creates a report. These are not intended to be final tools for production use, but rather starting points for developing your own tools or your own approach.
I've chosen to write the tools in Python because there are a wide variety of libraries, modules and tools that can be used, making my life easier. A few years ago, I wrote a simple tool, FS_scan that would walk a file tree and gather the file statistics and create a Comma Separate Values (csv) file that you could use in a spreadsheet. I'm going to start with that tool but separate the fundamental tasks of (1) gathering the data, and (2) processing the scan(s) for statistical information. The tools can be downloaded from this page.
Just a quick note about the tools and the programming style. I'm sure my Python coding style isn't really what is considered "pythonic." I've tried to use more modern features of Python such as iterators, but my overall style is not typical of Python developers. Plus, I use comments to indicate the end of a loop or an if statement. I've found that these help my coding. Comments and suggestions about the coding are always appreciated.
Gathering the Data
I won't spend too much time on the data gathering tool since much of the detail is in the older article. But for completeness sake, let me explain a bit about it.
Python has a module called os that allows you to walk a file system and also gather "stat()" information on the file using the os.stat function (method). As a result, the tool can easily gather the following information:
- The size of the file in bytes
- The ctime of the file (change time)
- The mtime of the file (modify time)
- The atime of the file (access time)
- The uid and gid of the file
The three "times" are output as seconds since the epoch but are easy enough to convert to something more meaningful using the Python time module.
The uid and gid are converted to "real" names, if possible, using the Python module/functions, pwd.getpwuid and grp.getgrgid. I like to do this because if you just store the numeric values and process the files on a different system, you may get different mappings from uid/gid to actual names. So I like to do the uid/gid "decoding" before I store the data.
The data gathering code, which I refer to as fsscan, stores the data in what is called a pickle file. This process takes Python objects and converts them to byte streams for actual writing. Reading the byte stream is just the reverse. Pickling allows you to take Python data structures and write them to a file in a single function.