It might not be obvious, from the announcement, what we got the award for. We, and a few others, received the award for HSM (Hierarchical Storage Management), which is a fancy name for what we usually call tiering.
Tiering, on a scale out cluster, is the process of storing more important data on higher performing, more expensive nodes, while putting less important data on lower performing, less expensive nodes. This has cost and performance benefits, letting customers get the best performance for their high access working data set, without having to pay the penalty of placing all the data they’re retaining on the most expensive storage. What data is important changes over time, so tiering is a continuing process of migrating data to the right locations.
Having a mix of different node types is natural for clustered storage, as it’s built into the upgrade process. Tiering requires grouping those nodes into sensible tiers, making layout decisions so data is written to the right tier, and discovering data that needs to be migrated as the important data set changes.
The layout and discovery process for tiering have obvious similarities to the way the repair process works when a node or drive fails. In both repair and tiering, we must scan the storage to find files that need to have their layout modified. The modifications require putting the file back through our normal layout engine, which tells us which blocks need to be moved, and where.
What’s not as obvious is that reliability during the repair process requires grouping drives into pools, similar to what is done for tiering. Drives are individually fairly reliable, but when you get a few thousand of them together they’re like mayflies. Correct handling of multiple failures is essential to keep larger clusters functioning. Individual data stripes can’t handle larger numbers of concurrent failures without increasing the number of parity blocks, which would decrease storage efficiency and increase cost.
The solution is to align the data stripes to pools of disks, much like we do for tiering. That way multiple failures are fine, as long as we don’t have too many failures in any one pool. The disk pools we use for reliability are actually subsets of the node pools we use for tiering, as the best reliability/efficiency solution involves using disk pools just big enough to hold our maximum supported data stripe width.
Given that the repair process is necessary just to have a clustered filesystem in the first place, tiering becomes merely a matter of tagging each file with the appropriate pool, and letting the disk layout logic do the rest. The added complexity of tiering comes from the process of deciding what files should be tiered where. How the decision is made is separate from the process of tiering the data. Some ways of making the decision can be more expensive, due to extra compute complexity (e.g., regular expression matching) or the need to update periodically (e.g., file age), but none of this affects the work the system needs to do to move the data around.
This policy approach to tier management is where the real value lies. Tiering is entirely transparent to applications and users, and storage administrators can make useful decisions about where the data is stored without knowing every detail of their users’ workflow.