Data Storage: How to Save Space Frugally - Page 3


Want the latest storage insights?

Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure

Share it on Twitter  
Share it on Facebook  
Share it on Google+
Share it on Linked in  

Then notice that it says the file system size is about 11.88 MB. If you scan further in the output you will see that it found 115 duplicate files, 6,854 inodes, 190 fragments, 172 directories, and 1 ids (unique uid + gid). Also notice that it took about 10 seconds for the command to complete (adding real, user, and sys time).

I checked the size of the resulting squashfs file to see how much compression I achieved.

[laytonjb@test8 Documents]$ ls -lstarh /squashfs/laytonjb/Documents.sqsh
12M -rwx------. 1 root root 12M Oct 19 12:27 /squashfs/laytonjb/Documents.sqsh

The original directory used a total of 58MB. The compression ratio is about 4.83:1.

I then wanted to mount the squashfs file in place of the original directory. So I moved the original directory, "Documents", to "Documents.original" and then created the "Documents" directory again which will be the mount point for the squashfs file.

[laytonjb@test8 ~]$ mv Documents/ Documents.original
[laytonjb@test8 ~]$ mkdir Documents

Finally, I mounted the squashfs file and then checked if the data was actually there.

[root@test8 laytonjb]# mount -t squashfs /squashfs/laytonjb/Documents.sqsh /home/laytonjb/Documents -o loop
[root@test8 laytonjb]# mount
/dev/sda3 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext2 (rw)
/dev/sda5 on /home type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/squashfs/laytonjb/Documents.sqsh on /home/laytonjb/Documents type squashfs (rw,loop=/dev/loop0)

All of the files were there (I checked them as a user rather than root but I didn't show the output because the output of "ls" is not that interesting).

As mentioned in a study, users may insist that their data be online all of the time, but in actuality there is a great deal of data that is accessed very infrequently. This data is never really used but the user insists that it be online. A simple way to reduce the storage requirements of this type of data is to copy it to a subdirectory, create a squashfs file and mount it, but symlink the orignal files to the squashfs mount point (files and directories). It can sound a little complicated so let's walk through an example.

In my account I have a binsubdirectory that contains applications that I have built and installed into my account for various projects. I still use those binaries and I don't change them so they are a perfect candidate for being put into a squashfs image. The process I will follow is the following:

  • Copy the subdirectories from /home/laytonjb/bin/ to /home/laytonjb/.ARCHIVE_data. The directory /home/laytonjb/.ARCHIVE_datais where I store the data before creating the squashfs image.
  • Create symlinks from subdirectories in /home/laytonjb/bin/ to /home/laytonjb/.ARCHIVE. /home/laytonjb/.ARCHIVE is the mount point for the squashfs image so you want the original file/directory locations to point to the mount point.
  • Create the squashfs image and store it in /home/laytonjb/SQUASHFS/
  • Mount the squash image to /home/laytonjb/.ARCHIVE/
  • Check if all files are there
  • Erase /home/laytonjb/.ARCHIVE_date

I hope these steps clarify my intent but to summarize I want to take old data, copy to a specific location, symlink the original file and directory locations to the squashfs image mount point, create the squashfs image and mount it.

The first step is to create a storage directory and a mount point. I've chosen to use directories that begin with "." so they are not visible to a classic "ls" but you could have chosen to use any directories you want.

[laytonjb@test8 ~]$ mkdir .ARCHIVE
[laytonjb@test8 ~]$ mkdir .ARCHIVE_data

The first directory is the squashfs mount point and the second one is where I store the data that goes into the squashfs file.

My bin subdirectory looks like the following.

[laytonjb@test8 ~]$ cd bin
[laytonjb@test8 bin]$ ls -s
total 28
4 hdf5-1.8.7-gcc-4.4.5    4 openmpi-1.4.5-gcc-4.4.5  4 open-mx-1.5.0-gcc-4.4.5 
4 zlib-1.2.5-gcc-4.4.5    4 netcdf-1.4.3-gcc-4.4.5  4 openmpi-1.5.4-gcc-4.4.5
4 parallware

Now I want to copy all of these directories to /home/laytonjb/.ARCHIVE_data but create a symlink to /home/laytonjb/.ARCHIVE. An example of this is the following.

[laytonjb@test8 bin]$ mv hdf5-1.8.7-gcc-4.4.5/ ~/.ARCHIVE_data/
[laytonjb@test8 bin]$ ln -s ~/.ARCHIVE/hdf5-1.8.7-gcc-4.4.5 .

It's a pretty easy process that is very amenable to automation (cron job using bash or python).

When you are finished the bin subdirectory should be populated with symlinks.

[laytonjb@test8 bin]$ ls -s
total 0
0 hdf5-1.8.7-gcc-4.4.5    0 openmpi-1.4.5-gcc-4.4.5  0 open-mx-1.5.0-gcc-4.4.5
0 zlib-1.2.5-gcc-4.4.5    0 netcdf-1.4.3-gcc-4.4.5   0 openmpi-1.5.4-gcc-4.4.5
0 parallware

The squashfs image is then created.

[laytonjb@test8 ~]$ time mksquashfs /home/laytonjb/.ARCHIVE_data /home/laytonjb/SQUASHFS/ARCHIVE.sqsh
Parallel mksquashfs: Using 4 processors
Creating 4.0 filesystem on /home/laytonjb/SQUASHFS/ARCHIVE.sqsh, block size 131072.
[============================================================================/] 4274/4274 100%
Exportable Squashfs 4.0 filesystem, data block size 131072
compressed data, compressed metadata, compressed fragments
duplicates are removed
Filesystem size 72740.63 Kbytes (71.04 Mbytes)
21.25% of uncompressed filesystem size (342305.09 Kbytes)
Inode table size 23311 bytes (22.76 Kbytes)
32.54% of uncompressed inode table size (71646 bytes)
Directory table size 19765 bytes (19.30 Kbytes)
43.35% of uncompressed directory table size (45599 bytes)
Number of duplicate files found 257
Number of inodes 1909
Number of files 1803
Number of fragments 128
Number of symbolic links  0
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 106
Number of ids (unique uids + gids) 1
Number of uids 1
laytonjb (500)
Number of gids 1
laytonjb (500)

real     0m20.382s
user     1m19.682s
sys      0m0.559s

Notice that the creation time was very fast (little over a minute) and of the 1,803 files, 257 were duplicates.

Fortunately there is a FUSE-based tool called squashfuse that allows us to mount squashfs images in user-space so root access is not needed. I used that tool to mount the ARCHIVE.sqsh image.

[laytonjb@test8 ~]$ squashfuse /home/laytonjb/SQUASHFS/ARCHIVE.sqsh /home/laytonjb/.ARCHIVE
[laytonjb@test8 ~]$ mount
/dev/sda3 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext2 (rw)
/dev/sda5 on /home type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
squashfuse on /home/laytonjb/.ARCHIVE type fuse.squashfuse (rw,nosuid,nodev,user=laytonjb)

We can check if the files are there pretty easily.

[laytonjb@test8 ~]$ cd bin
[laytonjb@test8 bin]$ cd openmpi-1.5.4-gcc-4.4.5/
[laytonjb@test8 openmpi-1.5.4-gcc-4.4.5]$ ls -s
total 0
0 bin  0 etc  0 include  0 lib  0 share

The files are definitely there so we know the process was successful.

The goal was to reduce the amount of space the files used so let's check if this was successfully achieved.

[laytonjb@test8 ~]$ cd .ARCHIVE_data
[laytonjb@test8 .ARCHIVE_data]$ du -sh
339M     .
[laytonjb@test8 .ARCHIVE_data]$ ls -sh ~/SQUASHFS/ARCHIVE.sqsh
72M /home/laytonjb/SQUASHFS/ARCHIVE.sqsh

The compression ratio is about 4.7:1 which I believe is pretty good.

Data Storage Squeeze

Considering that the world is creating more data and that we want to keep most or all of it, we need to store it. One way to help ourselves is to use efficient storage mechanisms such as compressed file systems. There are several options available ranging from typical file systems such as zfs or btrfs. But there are also options using FUSE for user space file systems. One example, archivemount, was examined in this article.

The advantage of archivemount is that it can be controlled by users and doesn't need administrator intervention. Also, when the archive is unmounted the compressed archive is updated and saved to the underlying file system. However, the disadvantage of using archivemount is that no updates to the underlying compressed archive happen until it is unmounted. If anything happens to the archive while it is mounted then there is the definite possibility of losing data.

If you don't want to use a compressing file system you can use read-only compressed images to achieve some of the same results. I bet that users have files that are old (using your definition of old) and haven't been accessed in a long time. It's very simple to compress these little used files using squashfs and use symlinks from the original file/directory location to the compressed image. This is a terribly simple method to save capacity and can be easily scripted to look for older files, create the symlinks and the compressed image and mount it.

Hopefully you've realized there are many ways to save capacity. With tight budgets you need to make the best use of what you have. You have to be frugal. Go forth and compress!

Photo courtesy of Shutterstock.

Submit a Comment


People are discussing this article with 0 comment(s)