Gathering and Analyzing Metadata About Metadata: Page 2 - EnterpriseStorageForum.com

Gathering and Analyzing Metadata About Metadata - Page 2

Fsscan takes two options, "-d " and "-o ." The first option allows you to specify the root directory for the scan by just passing the full path to the starting directory. If you don't specify a directory, the code will use the current working directory (cwd). The second option specifies the name of the output pickle file. By default it uses "file.pickle" and puts it in the directory where the code is executed.

The advantage of this scanning code is that you can break up a file system into a number of pieces and either scan each piece at the same time or scan them one at a time to reduce the load on the file system hardware.

Processing the Data

This is where I want to spend a majority of the explanation in this article, discussing how I process the file system metadata and what type of metadata I want to create from it. As an example of what one could do, I wrote a simple postprocessing code in Python as well because I want an easy way to create plots. The analysis code, which I call mdpostp (metadata post-processing), reads in a file that contains a list of the pickle files to be analyzed. It then reads each pickle file in turn, doing a statistical analysis on each one (recall that a single pickle file is a file tree scan). At this time the following aspects are analyzed by mdpostp:

  • Mtime age statistics where mtime age is the time difference from between when the analysis is run and the mtime (modify time) of the particular file. The age is presented in days.
    • The oldest file
    • The youngest file
    • The average file age
    • The standard deviation of mtime age
    • Intervals for file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the mtime age of all files
  • Ctime age statistics where ctime age is the time difference from between when the analysis is run and the ctime (change time) of the particular file. The age is presented in days.
    • The oldest file
    • The youngest file
    • The average file age
    • The standard deviation of ctime age
    • Intervals for file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the ctime age of all files
  • Ctime–Mtime time difference statistics (difference between the two times). The differences are presented in days.
    • The oldest file based on ctime-mtime
    • The youngest file based on ctime-mtime
    • The average file age based on ctime-mtime
    • The standard deviation of ctime-mtime
    • Intervals for ctime-mtime file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the ctime-mtime age of all files
  • Atime age statistics where atime age is the time difference from between when the analysis is run and the atime (access time) of the particular file. The age is presented in days.
    • The oldest file
    • The youngest file
    • The average file age
    • The standard deviation of atime age
    • Intervals for file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the atime age of all files
  • Largest files statistics:
    • The smallest file (in KB)
    • The largest
    • The average file size in KB
    • Intervals for file size (in KB)
    • A list of the Top 10 largest files (the number of files in the "top" list is controllable in the script).
    • A histogram of all the file sizes
  • Biggest users list. This is the Top 10 biggest users in terms of capacity (the number of files in the "top" list is controllable in the script).
  • Biggest group users list. This is the Top 10 biggest group users in terms of capacity (the number of files in the "top" list is controllable in the script).
  • Duplicate list. The analysis code will search the scan file for duplicate files. It determines if the file are the same by comparing the file name and the file size in bytes. If they are both the same then the file is said to be a duplicate. The output lists the "root file" which is the first file in the list and then the duplicate files that match the "root file."

This type of information is just a start of the things I like to initially examine, but it gives me a good snapshot of what is happening in the file system before I dive in deeper. There are lots of other statistics we could develop, but that starts to get a little more specific to your needs. I hope the Python scripts are easy to understand so that you can add your spin on things.

The code produces some output to stdout but it also creates an html file that has all of the same data, as well as some plots. The file, report.html, is in a subdirectory, HTML_REPORT. You can just open the file with your browser to read the report. While I like seeing stdout for immediate results I also like to create an html file for a more detailed report.

Let Loose the Hounds!

The point of this article is not to develop tools for analyzing the file system, but to actually start analyzing the file system. As an example of this, I wanted to analyze my home directory on my home system. It might not be very interesting in some respects because it's a single user, but I think it serves the purpose of exploring how one looks at file system information.

I have an external USB drive that I use for booting Linux on a laptop (I actually like it better than running Linux in a VM). I do some coding, article writing, etc. on the drive, so I will use it as my test example. I ran fsscan on my home directory and then processed it with mdpostp. The first bit of output from the analysis reveals that there were 54,036 files in my home directory (didn't know I had that many).


Page 2 of 5

Previous Page
1 2 3 4 5
Next Page

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 

Storage Daily
Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date