Dell EMC Streaming Data Platform : Architecture, Configuration, and Considerations

Dell EMC™ Streaming Data Platform (SDP), a scalable solution that is used to ingest, store, and analyze streaming data in real time. This paper provides information about the solution components, logical and physical infrastructure, configuration details, and considerations to make when selecting and deploying a solution.

The Internet of Things (IoT) brings the promise of new possibilities, but to unlock them, organizations must change how they think about data. With the emergence of IoT, there is a new class of applications that processes streaming data from sensors and devices that are spread around the globe. In theory, the solution is simple: turn massive amounts of data into real-time insights by immediately processing and analyzing it in a continuous and infinite fashion. However, managing streaming IoT data is not that simple. Legacy infrastructure is not made to support IoT data streaming from millions of data sources with varying data types. The world of streaming IoT requires a shift to the world of real-time applications consuming continuous and infinite streams. Today, there are hundreds of applications trying to solve different pieces of the IoT puzzle. This scenario makes it difficult to build a full, end-to-end solution as the applications keep changing, have various interoperability requirements, and require their own infrastructure. Managing this complex system is costly and time consuming and requires substantial maintenance. Dell EMC Streaming Data Platform is designed to solve these problems. It is an ideal enterprise solution designed to address a wide range of use cases by simplifying the infrastructure stack. The solution described in this document will help customers to reunite Operational Technology (OT) world and IT world by providing the following key features:

• Ingest of data whether it is IOT data, sensor data, video data, log files, high frequency data, at the edge or at the core data center.
• Analyze data in real time or batch and alert based on that specific data set.
• Centralize data from the edge to the core data center for analysis or model development (training).

Streaming Data Platform is an elastically scalable platform for ingesting, storing, and analyzing continuously streaming data in real time. The platform can concurrently process both real-time and collected historical data in the same application. Streaming Data Platform ingests and stores streaming data from a range of sources. These sources can include IoT devices, web logs, industrial automation, financial data, live video, social media feeds, applications, and event-based streams. The platform can process millions of data streams from multiple
sources while ensuring low latencies and high availability. The platform manages stream ingestion and storage, and hosts the analytic applications that process the streams. It dynamically distributes data processing and analytical jobs over the available infrastructure. Also, it dynamically and automatically scales resources to satisfy processing requirements in real time as the workload changes. Streaming Data Platform integrates the following capabilities into a single software platform:

• Stream ingestion: The platform ingests all types of data, whether static or streaming, in real time. Even historical files of data, when ingested, become bounded streams of data.
• Stream storage: Elastic tiered storage provides instant access to real-time data and infinite storage, and access to historical data. This loosely coupled long-term storage is what enables an unbounded digital video recorder (DVR) for all streaming data sources.
• Stream analytics: Real-time stream analysis is possible with an embedded analytics engine.
Analyzing historical and real-time streaming data is now unified to simplify the application development process.
• Real-time and historical unification: The platform can process real-time and historical data, create and store new streams, send notifications to enterprise alerting tools, and send output to third-party visualization tools.
• Platform management: Integrated management provides data security, configuration, access control, resource management, an intuitive upgrade process, health and alerting support, and network topology oversight.
• Run-time management: A web portal lets users configure stream properties, view stream metrics, run applications, and view job status.
• Application development: APIs are included in the distribution. The web portal supports application deployment and artifact storage.

In summary, the platform enables storing continuously streaming data, analyzing that data in real time, and supports historical analysis on the stored stream.

The Streaming Data Platform architecture contains the following key components:
• Pravega: Pravega is an open-source streaming storage system that implements streams and acts as first-class primitive for storing or serving continuous and unbounded data. This open-source project is driven and designed by Dell Technologies. See the Pravega site for more information.
• Unified Analytics: SDP includes the following embedded analytic engines for processing data stream.
– Apache Flink®: Flink is a distributed computing engine to process large-scale unbounded and bounded data in real time. Flink is the main component to perform streaming analytics in the Streaming Data Platform. Flink is an open-source project from the Apache Software Foundation.
– Apache Spark™ is a unified analytics engine for large-scale data processing. SDP ships with images for Apache Spark.
– Pravega Search (PSearch) provides search functionality against Pravega streams.
• Kubernetes: Kubernetes (K8s) is an open-source platform for container orchestration. K8s is distributed through two different flavors. Kubespray for Edge deployment, and OpenShift by RedHat for Core deployment.
• Management platform: The management platform is Dell Technologies™ proprietary software. It integrates the other components and adds security, performance, configuration, and monitoring features. It includes a web-based user interface for administrators, application developers, and end users.

Streaming Data Platform supports options for edge and core deployments. These two new deployments options will allow customers to stream data from the edge to the core. Figure 1 shows a high-level overview of the Streaming Data Platform architecture streaming data from the edge to the core.
• SDP Edge is a small footprint deployment. Deploying SDP at the edge, where the data is generated, has the advantage of local ingestion. In addition, SDP Edge can process, filter, or enrich the collected data at the edge, as opposed to sending all data upstream to the core. SDP Edge can be deployed on a single physical node for edge sites that don’t require High Availability (HA) or on three physical nodes for edge sites that require HA.
• SDP Core provides all the advantages of on-premise data collection, processing, and storage. It provides real-time data ingestion and also accepts data collected by SDP Edge and streamed up to the core. Deployments can start with a minimum of 3 nodes and expand out up to 12 nodes, with built-in scaling of added resource.

SDP Edge Architecture Overview

Note: SDP Edge only supports Dell EMC PowerScale systems for Long-Term Storage (LTS).

SDP Core Architecture Overview

Stream definition and scope
Pravega organizes data into Streams. According to the Pravega site, a Stream is a durable, elastic, appendonly, unbounded sequence of bytes. Pravega streams are based on an append-only log-data structure. By using append-only logs, Pravega rapidly ingests data into durable storage. When a user creates a stream into Pravega, they give it a name such as JSONStreamSensorData to indicate the types of data it stores. Pravega organizes Streams into Scopes. A Pravega Scope provides a secure namespace for a collection of streams and can contain multiple streams. Each Stream name must be unique within the same Scope, but there can be identical Stream names within different Scopes. A Stream is uniquely identified by its name and the scope it belongs to. Clients can append data to a Stream (writers) and read data from the same stream (readers). Within Streaming Data Platform, a Scope is created in the UI by creating an analytics project. A Pravega Scope is automatically created once the analytics project is created. The name of the Pravega Scope is automatically inherited from the analytics project name, so choose the name carefully. Both names are identical. In previous SDP versions, each analytics project was associated to a single Pravega scope, which means
that each project was completely isolated from each other. Since SDP 1.2 version, a new feature has been added to allow members of a project to be able to read Pravega streams from a different project. This will help data scientists who wants to share streams between multiple projects without the need of duplicating the data.
This new feature is called Cross Project Pravega Scope Sharing.

Streaming Data Platform
This section provides an overview of the Streaming Data Platform and its components: Pravega, Flink, Spark and Pravega Search.

Pravega
Pravega is deployed as a distributed system, it forms the Pravega cluster inside Kubernetes. The Pravega architecture presents a software-defined storage (SDS) architecture that is formed by Controller instances (control plane) and Pravega Servers (data plane) also known as Pravega Segment Store. Figure 4 illustrates an overview of the default architecture. Most of the components can be customized such as the volume size or number of replicas per stateful set or replica set.

Pravega Operator
The Pravega Operator is a software extension to Kubernetes. It manages Pravega clusters and automates tasks such as creation, deletion, or resizing of a Pravega cluster. Only one Pravega operator is required per instance of Streaming Data Platforms. For more details about Kubernetes operators, see the Kubernetes page Operator pattern.

Bookkeeper Operator
The Bookkeeper Operator manages Bookkeeper clusters deployed to Kubernetes and automates tasks related to operating a Bookkeeper cluster such as Create and destroy a Bookkeeper cluster, Resize cluster and Rolling upgrades.

Zookeeper Operator
Manages the deployment of Zookeeper clusters in Kubernetes.

Pravega service broker
The Pravega service broker creates and deletes Pravega Scopes. It also registers them as protected resources in Keycloak along with related authorization policies.
Pravega Controller
The Pravega Controller is a core component in Pravega that implements the Pravega control plane. It acts as central coordinator and manager for various operations that are performed in the Pravega cluster such as actions to create, update, seal, scale, and delete streams. It is also responsible for distributing the load across the different Segment Store instances. The set of Controller instances form the control plane of Pravega. They extend the functionality to retrieve information about the Streams, monitor the health of the Pravega cluster, gather metrics, and perform other tasks. Typically, there are multiple Controller instances (at least three instances are recommended) running in a cluster for high availability.

Pravega Segment Store
The Segment Store implements the Pravega data plane. It is the main access point for managing Stream Segments, which enables creating and deleting content. The Pravega client communicates with the Pravega Stream Controller to identify which Segment Store must be used. Pravega Servers provide the API to read and write data in Streams. Data storage includes two tiers:
• Tier 1: This tier provides short-term, low-latency data storage, guaranteeing the durability of data written to Streams. Pravega uses Apache Bookkeeper™ to implement tier 1 storage. Tier 1 storage typically runs within the Pravega cluster.
• Long-Term Storage (LTS): This tier provides long-term storage for Stream data. Streaming Data Platform supports Dell EMC Isilon and Dell EMC ECS to implement Long-Term Storage. LTS is commonly deployed outside the Pravega cluster. The number of segment store is customizable and can be scaled depending on the workload.

Pravega Zookeeper
Pravega uses Apache Zookeeper™ to coordinate with the components in the Pravega cluster. By default,
three Zookeeper servers are installed.

Pravega InfluxDB
The Pravega influxDB is used to store Pravega metrics.

Pravega Grafana
Pravega Grafana dashboards show metrics about the operation and efficiency of Pravega.

Pravega Schema Registry
Pravega Schema Registry is the latest service offering from Pravega family. The registry service is designed to store and manage schemas for the unstructured data stored in Pravega streams. The service is designed to not be limited to the data stored in Pravega and can serve as a general purpose management solution for storing and evolving schemas in wide variety of streaming and non-streaming use cases. Schema Registry
provides RESTful interface to store and manage schemas under schema groups. Users can safely evolve their schemas within the context of the schema group based on desired schema compatibility policy configured at a group level. For more details about Schema registry please see official repository link.

Pravega Bookkeeper
Pravega uses Apache Bookkeeper. It provides short-term, low-latency data storage, guaranteeing the
durability of data written to Streams. In deployment, use at least five bookkeepers (bookies): three bookies for
a quorum plus two bookies for fault-tolerance. By default, three replicas of the data must be kept in
Bookkeeper to ensure durability.
Table 1 describes the four parameters in Bookkeeper that are configured during the Streaming Data Platform
installation. For more details please refer to the Installation and Administration Guide document.
Bookkeeper parameters

Parameter name	Description
bookkeeper replicas	The number of bookies needed in the cluster
bkEnsembleSize	The number of nodes the ledger is stored on. bkEnsembleSize = bookkeeper replicas – F F represents the number of bookie failures tolerated. For instance, wanting to tolerate two failures, at least three copies of the data are needed (bkEnsembleSize = 3). To enable two faulty bookies to be replaced, instantiate two additional bookies, with a total of five bookkeeper replicas.
bkWriteQuorumSize	This parameter corresponds to the number of replicas of the data to ensure durability.
bkAckQuorumSize	By default, the following is true: bkWriteQuorumSize == bkAckQuorumSize The platform waits for the acknowledgment of all bookies on a write to go to the next write.