A guest Post By Michael Richtberg The first blog in this series, “Resiliency Explained — Understanding PowerFlex’s Self-Healing, Self-Balancing Architecture,” covered an overview of how the PowerFlex system architecture provides […]
A guest Post By Michael Richtberg
The first blog in this series, “Resiliency Explained — Understanding PowerFlex’s Self-Healing, Self-Balancing Architecture,” covered an overview of how the PowerFlex system architecture provides superior performance and reliability. Today, we’ll take you through another level of detail with specific examples of recoverability.
Warning: Information covered in this blog may leave you wanting for similar results from other vendors.
PowerFlex possess some incredible superpowers that deliver performance results that run some of the world’s most demanding applications. But what happens when you experience an unexpected failure like losing a drive, a node, or even a rack of servers? Even planned outages for maintenance can result in vulnerabilities or degraded performance levels, IF you use conventional data protection architectures like RAID.
Just a reminder, PowerFlex is a high-performance software defined storage system that delivers the compute and storage system in a unified fabric with the elasticity to scale either compute, storage or both to fit the workload. PowerFlex uses all-flash direct attached media located on standard x86 servers utilizing industry standard HBA adapters and 10 Gb/s or higher ethernet NICs that interconnect servers. The systems scale from 4 nodes to multi-rack 1000+ nodes while increasing capacity, linearly increasing IOPS, all while sustaining sub-millisecond latency.
PowerFlex takes care of dynamic data placement that ensures there are NO hot spots, so QoS is a fundamental design point and not an after-thought bolt-on “fix” for a poor data architecture scheme; there’s no data locality needed. PowerFlex handles the location of data to ensure there are no single points of failure, and it dynamically re-distributes blocks of data if you lose a drive, add a node, take a node off line, or have a server outage (planned or unplanned) containing a large number of drives. It automatically load balances the placement of data as storage use changes over time or with node expansion.
The patented software architecture underlying PowerFlex doesn’t use a conventional RAID protection mechanism. RAID serves a purpose, and even options like erasure coding have their place in data protection. What’s missing in these options? Let’s use a couple of analogies to compare traditional RAID and PowerFlex protection mechanisms:
Think of RAID as a multi-cup layout where you’re looking to ensure each write places data in multiple cups. If you lose a cup, you don’t necessarily re-arrange the data. You’re protected from data loss, but without the re-distribution, you’re still operating in a deprecated state and potentially vulnerable to additional failures until the hardware replacement occurs. If you want more than one level of cup failure, you have multiple writes to get multiple cups which creates more overhead (particularly in a software-defined storage versus a hardware RAID controller-based system). It still only takes care of data protection and not necessarily performance recovery.
Think of the architectural layout of data like a three-dimensional checkerboard where we ensure the data placement keeps your data safe. In the checkerboard layout, we can quickly re-arrange the checkers if you lose a box on the board or a row/column or even a complete board of checkers. Re-arranging the data to ensure there’s always two copies of the data for on-going protection and restoration of performance. The three-dimensional aspect comes from all nodes and all drives participating in the re-balancing process. The metadata management system seamlessly orchestrates re-distribution and balancing data placement.
Whether the system has a planned or unplanned outage or a node upgrade or replacement, this automatic rebalancing happens rapidly because every drive in the pool participates. The more nodes and the more drives, the faster the process of reconstituting any data rebuilding processes. In the software defined PowerFlex solution there’s no worrying about a RAID level or the performance trade-offs, it’s just taken care of for you seamlessly in the background without any of the annoying complications RAID often introduces or the need any specialized hardware controllers and associated cost.
PowerFlex looks at actual data stored on each drive rather than treating the whole drive capacity as what needs recovering. In this example, a drive failure occurs. The data levels illustrated here represent the total used capacity in these 6, 9 or 12 node configuration examples (we can scale to over 1,000 nodes). The 25%, 50% and 75% levels show relative rebuild times for this 960GB SAS SSD to return to restore the data to a full heathy state (re-protected).
We’re showing you a rebuild scenario to emphasize the performance, but taking it to another level, you wouldn’t be urgently needing to replace the drive as we leverage the data redistribution to other drives for protection and sustaining performance while using virtual spare space provided by all of the drives to pick up the gap. Unlike RAID, we don’t need to replace the drive to return the system to full health. You can replace the drive when it’s convenient.
Notice a few things:
- More nodes = less rebuild time! Try this if you scale out alternative options and I think you’ll find the inverse.
- The near linear rebuild performance improves as you add more drives and nodes. Imagine if this showed even more nodes participating in the rebuild process!
- More data density doesn’t result in a linear increase in the rebuild time. As you see in the 12-node configuration, it starts to converge on a vanishing point.
This illustrates what happens when you have 35, 53, and 71 drives participating in the parallel rebuild process for the six, nine and twelve node configurations, respectively.
Node Rebuild (6 drives)
Here we show an example using a similar load level of data on the nodes. The nodes each contain six drives with a maximum of 5.76TB to be rebuilt. The entire cluster of drives participates in taking over the workloads, automatically rearranging the data placement and making sure the cluster always has two copies of the data residing on different nodes. Just as in the above drive rebuild example, the process leverages all the remaining drives from the cluster to take on the rebuild process to return to a fully protected state. That means for the six-node configuration there are 30 drives participating in the parallelized rebuild, 48 drives in the nine-node configuration and 66 drives in the twelve nodes.
Notice again the near linear improvement in rebuild times as you increase the number of nodes and drives. As in the drive rebuild scenario, the node rebuild time observed also tends to approach a vanishing point for the varying data saturation levels.
As mentioned previously, PowerFlex scales to 1000+ nodes. Take a scenario where you need to affect an entire rack of servers and remain operational and recoverable (unthinkable in conventional architectures) and you see why our largest customers depend on PowerFlex.
If the above tests were done just to show off the best rebuild times, we would just run these systems without any actual other work occurring. However, that wouldn’t reflect a real-world scenario where the intention is to continue operating gracefully and still recover to full operational levels.
These tests were done with the PowerFlex default rebuild setting of one concurrent I/O per drive. For customers with more aggressive needs to return to fully protected, PowerFlex can be configured to accelerate rebuilds as a priority. To optimize rebuilds even more than illustrated, you can set the number of concurrent I/Os per drive to two or more or even unlimited. Since changing the number of I/Os per drive does affect latency and IOPS, which could adversely impact workloads, we chose to illustrate our default example that intentionally balances keeping workload performance high while doing the rebuild.
Using FIO* as a storage I/O generator, we ran these rebuild scenarios with ~750k random IOPS of activity on the 12 node configuration, ~600k random IOPS on the 9-nodes and ~400k on the 6-nodes, all while sustaining 0.5mS latency levels (cluster examples here can drive well over 1M IOPS at sub-mS levels). This represents a moderately heavy workload operating while we performed these tests. Even with the I/O generator running and the rebuild process taking place, the CPU load was approximately 20%. The I/O generator alone only consumed 8 to 10% of the available CPU capacity. Both CPU utilization figures underscores the inherent software defined infrastructure efficiency of PowerFlex that leaves a lot of available capacity to host application workloads. In this test case scenario, both the compute and storage occupied the same node (hyperconverged), but remember that we can also run a in 2-layer configuration using compute only and storage only nodes for asymmetrical scaling.
The systems used for these tests had the following configuration. Note that we used six drives per node in the R740xd chassis that can hold 24 drives, which means there were another 18 slots available for additional drives. As noted previously, more drives mean more parallel capabilities for performance and rebuild velocity.
- 12x R740xd nodes with 2 sockets Intel Xeon Gold 2126 2.6Ghz (12 cores /socket)
- Six had 256GB RAM & six utilized 192GB RAM
PowerFlex delivers cloud scale performance with unrivaled grace under pressure reliability for delivering a software defined block storage product with six nines of availability. Be sure to read Part 1 of this blog “Resiliency Explained — Understanding PowerFlex’s Self-Healing, Self-Balancing Architecture” to see the other protection architecture elements not covered here. For more information on our validated mission critical workloads like Oracle RAC, SAP HANA, MySQL, MongoDB, SAS, Elastic, VDI, Cassandra and other business differentiating applications, please visit our PowerFlex product site.
* FIO set for 8k, 20% random write, 80% random reads