Online Data Migration Performance: Preparing a Game Plan
Modern computer systems are expected to be up continuously: even planned downtime to accomplish system reconfiguration is becoming unacceptable, so more and more changes are having to be made to live systems that are running production workloads.
One of those changes is data migration: moving data from one storage area network (SAN) device to another for load balancing, system expansion, failure recovery, or a myriad of other reasons. Traditional methods for achieving this either require application downtime or severely impact the performance of foreground applications -- neither a good outcome when performance predictability is almost as important as raw speed.
This article presents the solution to this problem, as well as others. It will show you how to use a control-theoretical approach to statistically guarantee a "bound" on the amount of impact on foreground work during a data migration, while still accomplishing the data migration in as short a time as possible. (Data-bound controls simply refer to the use of data-binding properties of the various controls, in which these controls are automatically linked to result set columns of a data control, allowing for automatic synchronization between the control data and the result set data.) The result is better quality of service (QoS) for the end users, less stress for the system administrators, and systems that can be adapted more readily to meet changing demands.
Online Data Migration Architecture
Current enterprise computing systems store tens of terabytes of active, online data in dozens to hundreds of disk arrays, interconnected by storage area networks (SANs) like Fibre Channel or Gigabit Ethernet. Keeping such systems operating in the face of changing access patterns (whether gradual, seasonal, or unforeseen), new applications, equipment failures, new resources, and the need to balance loads to achieve acceptable performance requires that data be moved, or migrated between SAN components -- sometimes on short notice. In other words, creating and restoring online backups can be viewed as a particular case of data migration in which the source copy is not erased.
Existing approaches to data migration either take the data offline while it is moved or allow the I/O resource consumption engendered by the migration process itself to interfere with foreground application accesses and slow them down - sometimes to unacceptable levels. The former is clearly undesirable in today's global, always-on Internet environment, where people from around the globe are accessing data day and night. The latter is almost as bad, given that the predictability of information-access applications is almost as much a prerequisite for the success of a modern enterprise as is their raw performance.
The Data Migration Problem
The data migration problem is formalized as follows: The data to be migrated is accessed by client applications that continue to execute in the foreground in parallel with the migration. The inputs to the migration engine are a migration plan -- a sequence of data moves to rearrange the data placement on the system from an initial state to the desired final state; and, client application quality-of-service (QoS) demands--I/O performance specifications that must be met while migration takes place.
Highly variable service times in SANs (e.g., due to unpredictable positioning delays, caching, and I/O request reordering) and workload fluctuations on arbitrary time scales, make it difficult to provide absolute guarantees, so statistical guarantees are preferable unless gross over-provisioning can be tolerated. In other words, the data-migration problem is to complete the data migration in the shortest possible time that is compatible with maintaining the QoS goals.
QoS Goals Formulation
One of the keys to the problem is a useful formalization of the QoS goals. In other words, a "store" is a logically contiguous array of bytes, such as a database table or a file system; its size is typically measured in gigabytes. Stores are accessed by streams, which represent I/O access patterns. Each store may have one or multiple streams. The granularity of a stream is somewhat at the whim of the definer, but usually corresponds to some recognizable entity such as an application.
On the other hand, Global QoS guarantees bound -- the aggregate performance of I/Os from all client applications in the system, but do not guarantee the performance of any individual store or application. They are seldom sufficient for realistic application mixes - especially since access demands on different stores may be significantly different during migration. Nevertheless, stream-level guarantees have the opposite difficulty: they can proliferate without bound, and so run the risk of scaling poorly due to management overhead.
At the intermediate level (the one adopted by an online data migration architecture), the goal is to provide store-level guarantees. In practice, this has similar effects to stream-level guarantees for real-life workloads, because the data-gathering system that is normally used to generate workload characterizations creates one stream for each store by default. Furthermore, such QoS specifications may be derived from application requirements (e.g., based on the timing constraints and buffer size of a media-streaming server), or specified by hand, or empirically derived from workload monitoring and measurements.