Back in June 2020, we introduced the PowerScale platform, which was an hardware / software refresh of our iSilon platform. That launch was massive and you can read about it […]
Share this:
Back in June 2020, we introduced the PowerScale platform, which was an hardware / software refresh of our iSilon platform. That launch was massive and you can read about it here
That was only the beginning of the journey and it is now time, to take it to the next level, but before we do that, I wanted to take a quick look back as to where we are..
Unstructured Data Trends
If we consider the latest unstructured data trends, there are a few particularly worth noting. For example, in the coming years 75% of enterprise data will be generated outside of core data centers, contributing to the growth of the global datasphere to 175 zettabytes by 2025 – all requiring scalable storage and easy management. This kind of growth will require a strategy that ensures you can quickly adapt when these datasets inevitably expand.
Another consideration will be the shift to Hybrid Multi-Cloud environments. Already underway, this trend presents a unique challenge as these large, dense unstructured datasets require massive performance capabilities and present their own unique challenges. Moving these datasets to the cloud is not easy; and in many instances, it’s simply not feasible or cost effective.
Leveraging all that the cloud has to offer with unstructured data, especially file-based data, requires a different approach. An approach that focuses on the Data First.
It should be no surprise that – in the data age – data growth is measured exponentially. What may surprise you is that what fundamentally drives this growth are essentially applications and workloads, combined with the fact that 80% of that growth is Unstructured Data.
What do we mean when we say, “Unstructured Data”?
Specifically, we’re referring to datasets that are accessed via object or file protocols, and they’re comprised of three main categories:
IT Workloads – such as home directories, group shares, research data, video surveillance, archive & backup, and file shares which support thousands of business applications.
High Performance Production Workloads – such as Life Sciences, Media and Entertainment, Financial Services, Manufacturing, EDA, Analytics, and Healthcare imaging.
Emerging Workloads – such as Artificial Intelligence, Smart Factory, Internet of Things, and Automated Driving Assistance.
Today, these “Unstructured” datasets constitute an estimated 80% of the world’s data.
But more importantly, they’re growing faster than any other type of data; and as a result, new workloads are emerging daily in this data first era.
The Challenge – Data Gravity
Another important concept to consider is that of “data gravity”. This is a useful metaphor that works particularly well when describing the challenges with unstructured data today. There are two ways to look at this:
As a dataset grows and accumulates “mass”, it attracts more services and applications. Consequently, it pulls these lighter weight applications towards its own center of gravity. As new applications are created it’s more practical to bring the application to the dataset, rather than moving the data to the application.
As a dataset grows, it inevitably becomes far more difficult to manage. For example, simple tasks such as moving or replicating the data become progressively more challenging. Furthermore, growing, maintaining and managing the lifecycle of the hardware it’s stored on can develop into a massive cost burden with significant risk.
So, the question arises: How do you position your business to be able rapidly adapt for – not only what comes next – but what comes next at scale? And how do you take advantage of new services and applications by providing them access to your unstructured data – whether those services are created in the cloud, or those applications are deployed within your own datacenter?
McKinsey conduct “The state of AI in 2020” study recently. And turns out AI is not just a buzz word anymore, they found “80% of people said that AI has helped increase revenue. ” However, there is a problem.
One of the biggest challenges to AI & Machine Learning adoption is scaling up. Traditional storage technologies are failing to keep up.
In addition, AI and GPU powered workloads demand speed and scale. Today, they are fragmented in data silos across the edge, the core, and the cloud. And because of this fragmentation, there is inconsistent performance, vendor lock-in, unpredictable economics, and often low availability problem with only 3 9s.
Based on the latest research from ESG we’re seeing that customers are building their data lakes on flash storage to overcome these challenges. In fact, in a recent survey, ESG found that more than 80% of the customers with production AI/ML environments tend to have a majority of their storage footprint on flash storage.
Building on that leadership position and track record of success, ….
Now in 2020, we’re further decoupling OneFS from its native hardware and combining it with the world’s #1 server platform in PowerEdge. This commitment to a Software Defined journey will allow us to exploit hardware advancements faster than ever before. Moreover, it will allow us to expand into markets and use-cases previously unable to benefit from the power of OneFS.
Many Ways to Deploy, One Powerful Experience
With the introduction of PowerScale we’re enabling our customers to consume OneFS in several powerful ways. OneFS has always been a highly flexible software platform which was designed from the ground up to provide agility when managing data at scale. In fact, OneFS has historically provided many of the benefits you would usually associate with a “software defined” solution, such as: single pool aggregation of multiple types of media, policy-based management and automation, dynamic scaling of performance and capacity, and distributed architecture – just to name a few.
One of the challenges most vendors face is a long hardware development cycle which makes it challenging to stay current and provide a tightly integrated, highly supportable and resilient storage solution. For most hardware vendors, this is the only path to an appliance-like supportability model with the levels of predictability enterprises expect.
On the other side of the spectrum, you have the purely software-only storage solutions. These solutions trade off ease of deployment, supportability and resilience for complete hardware flexibility. With a solution like this, it’s impossible to provide a consistent experience when it comes to support, deployment, or operations.
Specifically, there are three ways to deploy OneFS:
The first option is the PowerScale node, which gives you a unique advantage by combining OneFS with PowerEdge, allowing you to take advantage of new hardware faster than ever before, and provide more flexibility and choice than ever – while still providing the supportability, resiliency, and ease of deployment of an appliance.
Second, we have purpose-built hardware with best-in-class density and scale. These are the Isilon chassis-based nodes that continue to satisfy the needs of a wide variety of applications and are compatible with the new PowerScale nodes.
And finally, we provide cloud consumption options not available with scalable file solutions until now. Our multi-cloud offering provides the full fidelity experience of OneFS in a hosted, managed service with simultaneous connectivity on the same IP range with Google, AWS, and Azure. Our Native Cloud integration with Google enables a cloud-native experience where Google handles all provisioning, management, and billing. All these solutions can be run as cloud-only or interoperate with on-prem instances of OneFS for a hybrid deployment.
Many Ways to Deploy, One Powerful Experience
With the introduction of PowerScale we’re enabling our customers to consume OneFS in several powerful ways. OneFS has always been a highly flexible software platform which was designed from the ground up to provide agility when managing data at scale. In fact, OneFS has historically provided many of the benefits you would usually associate with a “software defined” solution, such as: single pool aggregation of multiple types of media, policy-based management and automation, dynamic scaling of performance and capacity, and distributed architecture – just to name a few.
One of the challenges most vendors face is a long hardware development cycle which makes it challenging to stay current and provide a tightly integrated, highly supportable and resilient storage solution. For most hardware vendors, this is the only path to an appliance-like supportability model with the levels of predictability enterprises expect.
On the other side of the spectrum, you have the purely software-only storage solutions. These solutions trade off ease of deployment, supportability and resilience for complete hardware flexibility. With a solution like this, it’s impossible to provide a consistent experience when it comes to support, deployment, or operations.
Specifically, there are three ways to deploy OneFS:
The first option is the PowerScale node, which gives you a unique advantage by combining OneFS with PowerEdge, allowing you to take advantage of new hardware faster than ever before, and provide more flexibility and choice than ever – while still providing the supportability, resiliency, and ease of deployment of an appliance.
Second, we have purpose-built hardware with best-in-class density and scale. These are the Isilon chassis-based nodes that continue to satisfy the needs of a wide variety of applications and are compatible with the new PowerScale nodes.
And finally, we provide cloud consumption options not available with scalable file solutions until now. Our multi-cloud offering provides the full fidelity experience of OneFS in a hosted, managed service with simultaneous connectivity on the same IP range with Google, AWS, and Azure. Our Native Cloud integration with Google enables a cloud-native experience where Google handles all provisioning, management, and billing. All these solutions can be run as cloud-only or interoperate with on-prem instances of OneFS for a hybrid deployment.
Strength in the market
For nearly two decades, we’ve been successfully helping our customers solve their unstructured data challenges, and the market data and industry analyst commentary reflect that commitment and proven track record. To date, we’ve shipped over 23 Exabytes of capacity, including a single customer managing more than 1EB of unstructured data on OneFS. The bottom-line: We have unmatched experience in addressing these challenges, and every day we think about how to help our customers overcome them in a sustainable way.
Backup integration with leading backup vendors; NDMP backup 2-way and 3-way supported
Replication 1-many or many-1. replicate from edge to core to cloud (or vice versa); asynch replication; push-button failover
Snapshot for user self-restore; snapshot at directory or volume level
Access controls and permissioning across all protocols; integrates with Active Directory, NIS etc
Can survive failure of up to 4 disks or nodes
With PowerScale, we’re harnessing years of OneFS development and enterprise-class infrastructure experience to be able to deploy on industry-leading PowerEdge servers. Our pursuit continues with efforts currently underway to provide touchless deployment, as well as remove significant hardware constraints which will help us realize our ultimate software-defined goal: Deployment on PowerScale-compatible compute platforms on prem, or in the cloud.
The PowerScale Family
PowerScale is now the family brand that encompasses what you’ve known as Isilon since it first shipped in 2005. OneFS residing on the current Isilon 6th Generation Nodes – or on the newly introduced PowerScale Nodes (F200 and F600) – now constitutes the new PowerScale Family. At Dell Technologies, we’re doubling down on OneFS software, which has driven the solution since its inception. OneFS is still the core software that lives on all hardware in the portfolio – whether it’s Isilon chassis-based nodes, PowerEdge-based nodes, future platforms based on PowerEdge, or in the cloud.
The introduction of PowerScale marks the next evolution of OneFS providing access to the full power of Dell Technologies enabling us to thoroughly exploit hardware advancements faster than ever before. Today, the Isilon Gen 6 nodes still provide incredible density, performance, and a wide range of options whatever your unstructured needs. The PowerScale F200 and F600 nodes provide compact all-flash and NVMe-based options which expand the reach of OneFS into lower storage capacities, while providing higher bandwidth performance often required in many emerging use-cases and verticals, such as Media and Entertainment.
One of the most – if not THE most important considerations regarding unstructured storage – is to ensure complexity, risk, and administrative overhead don’t increase while scaling performance and capacity.
It must start simple and stay simple, at ANY scale.
This is the central challenge for all enterprise-class storage systems being used to manage datasets that grow at exponential rates.
If administrative effort and complexity increase as your storage environment grows, you’re assuming risk both from a cost and complexity standpoint that can often spiral out of control.
Likewise, it’s easy to back yourself into a corner limiting your ability to quickly adapt to a new business need. Let’s look the key features of OneFS that address these concerns.
We’ll start with the architecture of a OneFS cluster.
We begin with a node – a minimum of three (3) nodes forms a cluster. All nodes in a cluster connect via an Ethernet or InfiniBand backend. These nodes together form a cluster containing a single volume and single file system.
This three-node cluster may easily be expanded from three to 252 nodes, still with a single volume and single file system.
Each node has front-end network connectivity…
…and all nodes can serve data to clients…
…over any of the supported protocols (i.e., SMB, NFS, HTTP, FTP, HDFS, RAN, SWIFT, S3).
Architecturally, this looks very much like what you’d see in an HPC cluster.
Single Namespace Across a Single Filesystem
So lets talk a bit about OneFS leveraging a single namespace and single filesystem.
In OneFS, there is always ONLY one filesystem; you have just one to manage, whether it is three nodes in a cluster or the maximum of 252 nodes in a cluster.
In most systems, as you scale, there are more points of management that administrators must deal with which ultimately introduces risk and complexity.
This highlights how other systems typically start with something more complex (i.e., multiple controllers and volumes) to make it appear as a single large system to the user.
Consequently, these systems require additional layers of software to glue them together, creating even more complexity.
Over time, systems designed like this become inherently unmanageable.
In contrast, OneFS starts with something incredibly simple – a single directory structure evenly stretched across all nodes…
And can be shared as one drive letter…
Or any number of shares and exports, depending on application need, still, based on this single file system…
Persistence Across Generations
One important point to remember is that the software configuration of OneFS – the filesystem, the shares or exports created, along with the features and how they’re configured – all exist independently of the hardware.
In other words, while running and serving data to clients and applications, you can completely swap the underlying hardware without ever needing to change the software configuration.
When you hear that there are no data migrations with PowerScale, this configuration persistence underpins how we accomplish that design goal.
This is a concept that has been ingrained in our technology since its inception.
We have customers who have been running the same ‘instance’ of OneFS since 2005.
OneFS itself is updated, giving access to new features and performance enhancements over time, but the data was never migrated in the traditional sense; that is, there was no reconfiguration required, and applications kept accessing data with no awareness of the transition happening behind the scenes.
Nodes were added, and the old nodes were failed-out from generation to generation.
Quotas and ‘Thin Provisioning’
When you have a single large filesystem at Petabyte scale, the way most systems work when a share is created is that the client sees the entire capacity of the containing volume. In many cases you don’t want this. Our quota implementation is unique in that it serves three vital purposes:
Capacity Limiting and Masking – You can set quotas on a user, group, or directory. When applying quotas to a directory (or subdirectory), you have the option to mask the capacity of any share that’s created on that directory. So, if you set a 10TB quota on directory A (and the directory is shared), any client or application can see only 10TB total of available capacity. Additionally, you can apply a ‘default’ directory quota to any subdirectory, and it will apply those same settings to all subdirectories. This is important with a large shared system; because some applications don’t respond well if the share appears to have capacity, yet a hard quota blocks additional writes. In fact, this is a common cause of data loss. Additionally, there are three basic quota types: Hard (with the option to also mask capacity); Soft, these will alert and then give a grace period where writes can continue for a specified amount of time; or Advisory, which will only alert but will never restrict.
Chargeback and Capacity accounting – Many customers also use quotas for showback or chargeback. It’s very easy to create a report that shows the usage of each directory with the assigned quota. You can access it via API, CLI, or even have a .xml file generated with the details. Another useful feature is where you can choose to show the capacity used in a couple different ways that can be useful for chargeback. For example, you can show just logical space consumed, or you can show the total capacity used, including parity. You can also choose whether to include snapshots in the quota calculations.
Capacity based alerting – You can also configure alerts in a very flexible way, either emailed to the user or the administrator – or both as thresholds are passed.
Implementing a Snapshot with OneFS is straightforward.
You simply select the directory you want to snap; define a policy which includes how frequently the snapshot should occur and how long it should be maintained; and you’re done.
There is a per-directory default of 1,024; and there are guidelines for total number of snapshots on a cluster, but this is not a hard limit. The guidelines are soft limits and provide useful guardrails to ensure good design practices are considered. If your Snapshot needs exceed that, we’ll work with you to design Snapshot policies that best meets your requirements without adding unnecessary overhead.
There is also considerable flexibility when it comes to recovering data from snapshots. For example, OneFS provides:
Administrators with the ability to centrally control large shared folders;
End users with the ability to recover data in their home shares; and,
Applications with the ability to recover data via an API call – completely touchless.
Replication
Let’s take a deeper look at our replication technology, SyncIQ, and how it simplifies management and automatically leverages all the nodes within a cluster.
Just like the other OneFS software features, you manage directories; and with replication, it’s no different. Within a single directory tree, you select the directory (at any level in the hierarchy), define a policy that includes how often you want it to synchronize, select the target directory and which pool of nodes it should use, and specify any throttles you want to implement. That’s it!
From there, OneFS will automatically break the replication job into threads and spread them across all the nodes that are participating in the replication job. The throttles are important, because we generate a tremendous amount of bandwidth when all nodes are participating. You can throttle and control impact based on two dimensions: Network Bandwidth (MB/sec) or CPU Throttling (Files/sec).
Higher storage utilization (85%, on average), reduced overprovisioning, and lower overall cost-per-GB with PowerScale.
Management: Simplicity and automation improve administrator efficiency, resulting in one administrator managing 5 PB of data (vs. 500 TB before).
Data Center Cost: Improved storage efficiency, density, and cloud capability of PowerScale reduce data center space needs by up to 28 racks.
Business Value Add: PowerScale’s performance, resilience, and scalability provide $1 million per year in incremental business growth by Year 3.
Flexibility: Using PowerScale Data Lake, SmartDedupe, and inline data reduction results in additional cost efficiencies.
With OneFS we have always seamlessly migrated between generations of hardware. OneFS can grow the namespace and filesystem dynamically and automatically.
This enables users and applications to continue uninterrupted access to the data while nodes and capacity are being added to the cluster.
This is equally important for retiring nodes and for in-place migrations, OneFS can shrink the namespace and filesystem, again without disruption to users and applications.
Our multi-cloud offering makes use of Faction to achieve the benefits you get from a public cloud while keeping data secure and under the customer’s control. By hosting the data in a co-location facility with a direct connection into an AWS, Azure, Google Cloud Platform, or Oracle Cloud, the data stays on-premises but can quickly connect to public cloud compute services to perform complex calculations and much more.
Customers now have the ability to deploy PowerScale OneFS software directly in GCP natively. This empowers customers to take full advantage of our leading file storage from a familiar hyperscale cloud environment. Our native file storage via GCP now offers a lower entry point at 25TB for customers.
Insights @ Scale
When you have data that’s distributed across many locations – edge, cloud, on prem, and across systems using many protocols (both file- and object-based) – you need a way to gain a full visibility of your data footprint regardless of where the assets sit. Equally important is automated and intelligent system health information that you can access from anywhere.
To meet these needs, PowerScale leverages three key tools – DataIQ, InsightIQ and CloudIQ
DataIQ is a product born out of our acquisition of Data Frameworks. It provides a single pane of glass view into all your unstructured data assets, regardless of where they live. It works just as well on Dell EMC technology as 3rd party and even cloud locations. It supports both file (NFS, SMB) and S3 object protocols to give a complete view of your unstructured data estate. You can easily add cost profiles to each storage target to gain insight into your costs for storing content. You can also tag data as its scanned to give a cross platform view of assets related to a specific project, department, application or workflow.
DataIQ allows end users to understand where their content is located and how much it costs to store; it also provides the ability to move that data to a lower cost archive location directly from the DataIQ interface. This requires no administrator intervention and users can be empowered to take control of their own assets.
InsightIQ has long served as a Performance and File Systems analytics tool providing performance insights from the client thru the network and cache, all of the way to the disk within a node in a cluster.
Over time, these analysis capabilities will be moving to DataIQ, but in the interim, this free tool for our customers still has much value to provide.
Delivered by a desktop or mobile app, CloudIQ provides actionable insights on infrastructure health anywhere, anytime for our customers. These insights include predictive analysis tools for capacity consumption, performance anomaly detection and impact analysis across the customer’s Dell infrastructure.
We’ve several customer successes with PowerScale but we’ve picked out some of the recent ones that have stood out.
There’s University of Pisa, Develop, test and run innovative algorithms through their HPC labs
Like a photo, had Two animated features completed 2 weeks early by transforming its business to remote work, and realized 20x increase in read/write performance
Nature Fresh Farms is enabling greenhouse innovation for fresher produce accelerated with AI, robotics and automation
Best of all, our industry professionals and best practices are available for every single deployment. Here’s what we can do for you:
We assess your existing business objectives and infrastructure
We help you plan, design and configure an architecture that meets said objectives
We help you implement the solution and assist in deployment
We monitor the success criteria to ensure the infrastructure is running as expected
We continually advise your teams on how to tune and optimize the infrastructure
And ultimately, we help you innovate with data and uncover new opportunities to utilize data within your business
But, this is just the beginning, in the next post, we are going to discuss the —redacted—, stay tuned!
Thanks loved your writeup