Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Now we need functions that allow us to do byte-range IO to the file. You can probably imagine the first IO functions, which are "read" and "write". Below is an excerpt of the POSIX read function manpage for Linux:
ssize_t read(int fd, void *buf, size_t count);
read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf.
On files that support seeking, the read operation commences at the current file offset, and the file offset is incremented by the number of bytes read. If the current file offset is at or past the end of file, no bytes are read, and read() returns zero.
If count is zero, read() may detect the errors described below. In the absence of any errors, or if read() does not check for errors, a read() with a count of 0 returns zero and has no other effects.
If count is greater than SSIZE_MAX, the result is unspecified.
In addition to the read() function, there should also be a write() function. Below is an excerpt of the Linux man page for the write() function:
ssize_t write(int fd, const void *buf, size_t count);
write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.
The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).)
For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the current file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step.
POSIX requires that a read(2) which can be proved to occur after a write() has returned returns the new data. Note that not all file systems are POSIX conforming.
At this point there are open() and close() functions for an object, and there are also read() and write() functions. The last function that is needed to complete basic byte access is lseek(). This IO function is both a curse and a blessing. It can be used to move the file pointer to a new position. The blessing is that it can be very useful for byte-range access in a file and/or deal with complicated data patterns.
An excerpt of the man page for lseek for Linux is below:
off_t lseek(int fd, off_t offset, int whence);
The lseek() function repositions the offset of the open file associated with the file descriptor fd to the argument offset according to the directive whence as follows:
- SEEK_SET - The offset is set to offset bytes.
- SEEK_CUR - The offset is set to its current location plus offset bytes.
- SEEK_END - The offset is set to the size of the file plus offset bytes.
The lseek() function allows the file offset to be set beyond the end of the file (but this does not change the size of the file). If data is later written at this point, subsequent reads of the data in the gap (a "hole") return null bytes (aq\0aq) until data is actually written into the gap.
With these few IO functions, the world of byte-range access to object files is opened. Now you have the option of reading only part of an object instead of having to GET the entire file. This opens a vast new set of possibilities.
For example, an application might need to read the first n bytes of a number of files to gather information and present it to the user. With object storage systems, all of the data needs to be first downloaded from the object store to a POSIX file system and read the files. That is a tremendous amount of data movement. With an object-POSIX combination, the files could remain in the object store and the first n bytes read from the file.
You can actually put numbers to this example. Imagine having 100 files that are 100MB each. If you need to read the first 16KB of each file, then you are only reading a total of 1.6MB of data. However, if the files are stored in an object store you have to first copy 10 GB worth of data from the object store to a POSIX file system. Then you have to read the 1.6MB of data. That's 1.6MB versus 1001.6MB (a few orders of magnitude difference).
This is a wonderful result of adding byte-range functions to object storage — the amount of data either accessed or touched, is much smaller than a pure object storage solution.
The Case for Combining Object Storage and POSIX Storage
People who use object storage want it to behave more like POSIX storage, but they also want to keep the storage costs at an object level and improve the performance. People who use POSIX file systems like the simplicity of object storage systems and also want the price to come down to object storage levels. In other words, both sides want the rainbow unikitty butterfly.
I think it's important to consider what these two groups want. Having an object storage system that allows byte range access is very appealing. It means that rewriting applications to access object storage is now an infinitely easier task. It can also mean that the amount of data touched when reading just a few bytes of a file is greatly reduced (by several orders of magnitude).
Think about it. Conceptually the idea has great appeal. Because I'm not a file system developer I can't work out the details, but the end result could be something amazing.
Photo courtesy of Shutterstock.