Storage Technology in Depth - DAFS Page 2
VM System SupportMaintaining virtual/physical address mappings and page access rights, used by the main CPU memory-management hardware, is done by the machine-dependent physical mapping (pmap) module. Low level machine-independent kernel code such as the buffer cache, kernel malloc and the rest of the VM system, are using pmap to add or remove address mappings and alter page access rights.
Symmetric multiprocessor (SMP) systems sharing main memory can use a single pmap module as long as translation lookaside buffers (TLB) on each CPU are kept consistent. Pmap operations apply to page tables shared by all CPU. TLB miss exceptions thrown by a CPU, result in a lookup for mappings in the shared page tables. Invalidations of mappings are applied to all CPUs.
Memory-to-memory NIC, store virtual-to-physical address translations, and access rights for all user and kernel memory regions directly addressable and accessible by the NIC. Main CPUs use their on-chip translation look-aside buffer (TLB) to translate virtual to physical addresses. A typical TLB page entry includes a number of bits such as verification/validation (V) and Analog Control Channel (ACC)--signifying whether the page translation is valid, and what the access rights to the page are, along with the physical page number. A miss on a TLB lookup requires a page table lookup in main memory. NIC on the Protocol Capability Indicator (PCI--(or other I/O)) bus have their own translation and protection (TPT) tables. Each entry in the TPT includes bits enabling RDMA Read or Write (i.e., the W bit in the diagram) operations on the page; the physical page number; and, a Ptag value identifying the process that owns the pages (or the kernel). Whereas the TLB is a high-speed associative memory, the TPT is usually implemented as a dynamic random access memory (DRAM) module on the NIC board. To accelerate lookups on the TPT, remote memory access requests carry a Handle index that helps the NIC find the right TPT entry.
Buffer Cache LockingIn an RDMA-based data transfer, the server sets up the RDMA transfer in the context of the requesting RPC. Once issued, the RDMA proceeds asynchronously to the RPC. The latter does not wait for RDMA completion. To serialize concurrent access to shared files in the face of asynchrony, the vnode (vp) of a file needs to be locked for the duration of the RPC. However, the data buffers (bp's) transferred need to be locked for the full duration of the RDMA. Locking the vp (i.e., the entire file) for the duration of the RDMA would also work, but would limit performance in case of sharing, since requests for non-overlapping regions of a file would have to be serialized.
A multithreaded event-driven kernel server that directly uses the buffer cache and does event processing in kernel process context, faces problems in the following circumstances: When a thread tries to lock a buffer, it is already locking (because a transfer is in progress on that buffer) and expecting to block until that lock is released by some other thread; and, when a buffer is released from a different thread than the one that locked it.
Transferring lock ownership to the kernel during asynchronous network I/O, does not help, since the lock release is done by some kernel process (whichever happens to have polled for that particular event), rather than by the kernel itself. The solution presently used, is for the kernel process that issued an RDMA operation to wait until the transfer is done in order to release the lock. This also prohibits that process from trying to lock the same buffer again, thus causing a deadlock panic. A better solution is to enable recursive locking and allow lock release by any of the server threads.
Device Driver SupportMemory-to-memory network adapters virtualize the NIC hardware and are directly accessible from user space. One such example is the virtual interface (VI)--where the NIC implements a number of VI contexts. Each VI is the equivalent of a socket in traditional network protocols, except that a VI is directly supported by the NIC hardware and usually has a memory-mapped rather than a system call interface. The requirement to create multiple logical instances of a device, each with its own private state (separate from the usual device softcopy state), and to map those devices in user address spaces, requires new support from Berkeley Software Design (BSD) kernels.
Network Driver ModelFinally, network drivers in BSD systems are traditionally accessed through sockets and do not appear in the file-system name space (i.e., under /dev). User-level libraries for memory-to-memory network transports require these devices to be opened and closed multiple times, with each opened instance appearing as a separate logical device, maintaining a private state, and be memory-mapped.
Summary And ConclusionsAs previously explained, the Direct Access File System (DAFS) is an emerging commercial standard for network-attached storage on server cluster interconnects. The DAFS architecture and protocol, leverage network interface controller (NIC) support for user-level networking, remote direct memory access, efficient event notification, and reliable communication. This article demonstrated how the current server structure can attain read throughput of more than 100 MB/s over a 1.25 Gb/s network, even for small (i.e., 4K) block sizes, when pre-fetching and using an asynchronous client API. Finally, to reduce multithreading overhead, you should integrate the NIC with the host virtual memory system.
About the Author :John Vacca is an information technology consultant and author. Since 1982, John has authored 36 technical books including The Essential Guide To Storage Area Networks, published by Prentice Hall. John was the computer security official for NASA's space station program (Freedom) and the International Space Station Program, from 1988 until his early retirement from NASA in 1995. John can be reached at email@example.com.