Nearly all large enterprises have implemented shared storage based on Fibre Channel SAN technology. Typically, these installations began as fairly modest configurations, with a few dozen servers and one or more large SAN-attached storage arrays. But with SAN technology demonstrating its end-user value and ability to more efficiently manage storage data, data center SANs in many cases have quickly grown to encompass more servers and more storage capacity.
Although thankfully McDonald's has sworn off super-sizing fast food, data centers worldwide continue to super-size their SANs in an attempt to connect hundreds, and in some cases thousands, of servers and storage devices in a single storage network.
Standards vs. Reality
According to ANSI standards, a Fibre Channel SAN may have as many as 239 switches in a single fabric and support up to 15½ million devices. These standards limits are based on the number of bits that are available within a three-byte Fibre Channel address associated with unique Domain IDs for individual switches and device identifiers for attached nodes such as servers and storage targets. Although for competitive reasons all switch vendors proclaim their support for these impressive figures, none has been able to implement them in real product.
Supposing for a moment that a vendor could actually support these standards limits, the time required to stabilize a single SAN composed of hundreds of switches and the complexity of registering tens of thousands of devices would be unacceptable in any enterprise environment. Nonetheless, many customers are attempting to build large fabrics to support their business applications, either by connecting director-class fabrics or by building complex meshes of departmental switches.
Why Build Big SANs?
Storage managers are attempting to build large SAN fabrics for numerous reasons. Typically, the purpose of a large storage network is not to provide any-to-any connectivity; rather in most cases, a large multi-switch fabric results from an attempt to share one or more storage assets by hundreds of servers — a many-to-few connection strategy.
A large enterprise, for example, may have invested hundreds of thousands of dollars in a large robotic tape library. In order to share this single resource with hundreds of devices, it is necessary to attach multiple directors or switches by interswitch links (ISLs).
Likewise, sharing several large storage arrays between 500 or more servers may require multiple ISLs between ten or more director-class fabrics. From a customer standpoint, the business objective of more efficiently sharing storage assets is fairly straightforward. From a technology perspective, however, achieving this business goal can be very complex.
Hefty Issues for Hefty SANs
Building a large multi-switch SAN requires careful network design to ensure that sufficient bandwidth is allocated between switches to optimize application performance. In addition, safeguarding against failed links requires a meshed design to provide alternate paths through the fabric.
Allocating ISLs for both performance and alternate pathing consumes expensive fabric ports and reduces the total port count for servers and storage targets. So the more fully meshed the fabric, the lower the total productive population of the SAN.
This becomes painfully obvious when customers attempt to build large fabrics with 16- or 32-port fabric switches. In such configurations, a third or more of the total port count may be devoted to ISLs. In general, higher port count directors are far more efficient when scaling to large SANs, since more ports per chassis are available for device attachment. In addition, new 10 Gbps ISL options simplify switch-to-switch connectivity and avoid multi-ISL trunking issues such as potential out-of-order frame delivery.
Building large SANs has several unintended consequences that may affect fabric stability. Due to inherent architectural characteristics of Fibre Channel, as well as specific vendor implementation in products, connecting 8 or more switches in a single fabric may result in erratic behavior. Fibre Channel is a link layer architecture, much like bridged LANs. A layer 2 network gives optimum performance and the lowest protocol overhead, which aligns nicely with performance requirements of block data over a channel.
Connecting multiple fabric switches therefore extends a flat network space that, like bridged LANs, may be vulnerable to network-wide disturbances. In the bridged LAN environment, broadcast storms can negatively impact all attached nodes. In Fibre Channel SANs, the equivalent disruption may be due to state change notification broadcasts and occasional fabric reconfigurations caused by unexpected changes in the fabric (e.g., plugging a live switch into a large operational fabric). As discussed below, SAN routing addresses such large fabric issues via network segmentation.
In addition, as more switches are connected to a single fabric, more switch-to-switch communication is required to properly allocate unique address blocks, resolve zoning information, add entries into simple name server (SNS) tables, and exchange routing information. In some cases, limited SNS capacity may restrict the number of devices that can be supported in a single fabric.
In most cases, as the fabric grows to the 1000+ device count, the convergence time required to stabilize the network may become quite lengthy if a disruption occurs. The switch-to-switch chatter required for initial fabric building and registration of servers and storage devices increases in volume as more switches are added to the fabric. If SNS entries are inadvertently exceeded, the fabric may finally stabilize, but not all devices will be recognized.
Getting Outsized SANs Under Control
The good news is that resolving SNS, ISL, convergence, and broadcast issues for large SANs has largely been facilitated by new fabric technologies such as 10 Gbps interswitch links, dynamic partitioning, and SAN routing. Using one or more 10 Gbps ISLs between directors not only simplifies the cabling scheme, but overcomes trunking and load balancing issues.
Fabric vendors have been struggling for some time over the question of how to increase performance on ISL links while ensuring in-order delivery of frames. A single sequence of frames sent over four trunked interswitch links, for example, might arrive at the destination switch in a random order. Without complex queuing algorithms, it is not possible to leverage multiple interswitch links to increase bandwidth.
Alternately, only sending sequences over specific ISLs may result in over-utilization of some ISLs and under-utilization of others. Simply replacing multiple 2 Gbps ISLs with a single 10 Gbps ISL resolves both the performance and frame delivery issues.
Dynamic partitioning (pioneered by Sanera and now part of the McDATA product line) addresses the problem of swollen SNS tables by providing hardware-based segmentation of a single large director into separate SAN partitions. A 256-port director, for example, can be divided into isolated partitions that service separate departments or applications. Since each partition has a smaller SNS and operates independently of the other partitions, the convergence time for each partition is improved.
Dynamic partitioning also allows different microcode versions to be run on each partition and for individual partitions to be reset without affecting the entire switch. A software-based scheme using frame tagging (e.g., Cisco VSANs) provides separation of traffic within a switch, but does not accommodate multiple microcode versions or selective reset.
SAN routing (pioneered by Nishan Systems) resolves large fabric issues by providing a layer 3 routing function for connecting SANs. Instead of building a single large fabric with the accompanying SNS, convergence, and reconfiguration issues, SAN routing maintains the autonomy of each SAN segment that is connected while simultaneously allowing designated storage conversations between SANs to occur.
SAN routing aligns with the customer requirement to share storage assets between hundreds or thousands of devices, but avoids the inherent flat network issues that large meshed SANs imply. Just as IP routing solved the problem of broadcast storms for bridged LANs, SAN routing enables customers to build very large and stable storage networks out of multiple, separate SANs.
Collectively, these new options can help storage managers achieve their business goals of sharing and managing their storage assets more efficiently and at lower cost. The requirements of Fibre Channel architecture can thus be fulfilled without forcing the customer to adapt their business objectives to the peculiar behavior of the underlying SAN technology.