Data Storage: How to Save Space Frugally
In general, people tend to keep all data. If you have been a system administrator you know the constant stream of user requests for more space that result from this behavior. To help the situation, we need to revisit a time when we didn't have much storage – and start compressing the data back then.
At the expense of revealing my age, when I first started working on computer systems, storage space was at a premium. This was the very beginning of the post-punch card era (I actually learned programming on punch cards), and magnetic storage (hard drives) were rare and had a very limited capacity.
To save space we would do everything possible in our code to save memory and we would keep what we stored to a bare minimum. For example, we would link our object files immediately and then erase the object files before running the executable. Then we would erase the executable after our run. We would look at the output and if it was useful, we would make a print out. If the application output file(s) were useful we made a printout and kept the file, otherwise we erased them. We would compress every file if they weren't being used, such as input files, source code, output files, makefiles, etc. It was the storage capacity jihad but it was very effective.
These habits die hard and they came in handy when I moved from minicomputers to microcomputers (PC's) where we had very small hard drives and floppy drives. Initially during the PC era, I compressed all files where I could. Storage was not cheap but I won't tell stories about my first giant 20MB hard drive or the boxes of cheap, mail order floppies I used instead of hard drives. Even today with very large 4TB drives readily available, I still find the urge to save as much storage space as possible by erasing files and compressing the remaining ones if I'm not using them (hint, 7-zip is your friend when you want to save as much space as possible).
I think a fundamental axiom of IT is that users never had enough space. If you've been a system administrator you remember the constant stream of requests for more storage space. Even if you start with 1 PB of storage it will fill up faster than you anticipated (nature abhors unused storage capacity?). At the same time if you ask the users to maybe delete their older data, they tell you that the data is needed and they can't possible erase it.
What do you do? One answer, and the one that is typically taken, is to just add more storage. But perhaps there are different methods to help the situation.
Compressed File Systems
One way that we can save space is to use a compressed file system. The idea is that the users don't have to consciously compress the files but the file system automatically compresses the data for them. There are several file system options including btrfs, ZFS, or Reiser4 that can compress/decompress any data written to it or read from it. There are also FUSE (File Systems in User Space) file system options such as fusecompress, avf, and to some degree archivemount.
A compressed file system compresses the data as it writes to the data blocks or extents and uncompresses the data blocks or extents when read. There are many methods for doing this including holding a large number of data blocks or extents before writing them so that you can get much larger compression ratios. But you have to pay careful attention to the file system's logs and you need larger buffers. This approach uses more computing resources than a non-compressed file system but given the large number of cores on systems today – coupled with larger amounts of memory – this is probably not an issue (except maybe if you use an ARM processor). But the details really depend upon the file system and the implementation.
An example of a file system that offers compression is btrfs. To use compression you use a mount option. For example you can use "compression=zlib" or "compression=lzo" or "compression=no". The last option disables compression. There is also a mount flag "compress-force=
Btrfs does compression on a file basis. That is, the entire file is either compressed or not compressed, although in actuality it is based on extents rather than files. Btrfs can handle file systems that have some files that are compressed and some that are not. It can also handle files that are compressed with the two methods (zlib or lzo). You can read about the file compressions options at this FAQ.
To illustrate how a compressed file system might be used, I want to discuss archivemount. It is a FUSE based file system that allows you to mount a compressed archive file such as a .tar.gz file and read and write data to it a though it were a directory tree. You can also create sub-directories, delete them, even compress files within the compressed archive. While not strictly a compressed file system it gives you a flavor of how one works.
Even better, since archivemount is a FUSE-based file system, users can mount and umount it when they want or when they need. This means less intervention by the system administrator. The example I will use is creating an archive, mounting it, adding files and directories, and then unmounting it. I'll do all of this as a user.
The first step is to create a compressed tar file. I'll create a tar file using a simple text file and then use bzip2 to compress it (I like the higher compression that bzip2 offers over gzip).
[laytonjb@test8 ~]$ tar cf NOTES.tar jeffy.out.txt [laytonjb@test8 ~]$ bzip2 NOTES.tar [laytonjb@test8 ~]$ ls -sh NOTES.tar.bz2 12K NOTES.tar.bz2
I created a simple archive using tar with a small text file I had ("jeffy.out.txt"). Then I compressed it using bzip2. The resulting file is 12KB in size.
I'm using a fresh CentOS 6.4 system where I installed fuse, fuse-devel, libarchive, and libarchive-devel. I then downloaded the latest version of archivemount which is 0.8.2, coincidentally uploaded the day I wrote this article – it’s so fresh I can smell the electrons.
I built and installed archivemount which is a pretty easy sequence of "./configure", "make", and "make install". Then I created a mount point in my account and mounted the .tar.bz2 archive file.
[laytonjb@test8 ~]$ mkdir ARCHIVE [laytonjb@test8 ~]$ archivemount NOTES.tar.bz2 ARCHIVE fuse: missing mountpoint parameter [laytonjb@test8 ~]$ mount /dev/sda3 on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext2 (rw) /dev/sda5 on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) archivemount on /home/laytonjb/ARCHIVE type fuse.archivemount (rw,nosuid,nodev,user=laytonjb)
I'm not sure about the error notification when the archive is mounted but it didn't stop the archive from being mounted as you can see in the last command output.