With the advent of things like smartphones, drones, security cameras, much of the data that exists today is naturally a stream. ‘Stream’ means that there is a constant flow, with no beginning and no end. It’s in continuous motion.
Historically, we break up that stream into files, or objects (like “last month’s data” or “last week’s data”), but when we do that, we lose significant value in that data by trying to simplify the ingestion, storage and analysis of it.
The key component of the Streaming Data Platform is this concept of a stream – and the fact that we’re consuming it, storing it, and analyzing it in its natural form – as a stream.
Data-first – having the information where you need it, at all times. Data is valuable by itself, but more so if you know how to access what you need, as soon as you need it so that you can utilize it immediately in the ways that you need – that’s where the real value lies.
With this change in the makeup of our data pool, we are not only looking at technical problems but data problems. Now, with streaming data – you want to create a foundation that focuses on data first, but how do you do that when an enormous amount of this new form of data is going to be such a large share of your data. You need to think about this now because it’s coming down the pipeline and you want to be prepared – not only in time but ahead of the curve, so that you can get to your business insights faster and better than the competitors.
SOURCE:IDC. Data Age 2025: The Evolution of Data to Life-Critical. 2017. https://assets.ey.com/content/dam/ey-sites/ey-com/en_gl/topics/workforce/Seagate-WP-DataAge2025-March-2017.pdf
With the rise of streaming data, many organizations have begun to move towards harnessing the potential for innovation found within it.
- Unfortunately, this typically requires standing up a new infrastructure to try to gain information faster – not necessarily always in real-time time, but down to seconds.
- This new infrastructure means the need to implement, manage, secure, and scale multiple (now siloed) infrastructures, making the overall system complexity and hard to manage.
- It also creates data inconsistency risks – as multiple infrastructures mean that no one platform is managing the data ingestion and ensuring every piece of data is counted once, and only once.
- Multiple infrastructures also cause issues for the application developers, who now have to build the same application (and manage that application) for multiple data streams in order to get the same results that they could be getting with one, unified and consistent data stream.
- Scaling is also an issue for the same reason – it needs to happen for each pipeline.
- And security is complex because of the multi-faceted approach, opening the organization up to unnecessary risk.
- “Lambda Architecture”
- Batch pipeline created first
- Eventually, information is needed faster, so additional pipelines are created
- Causes: massive waste in data duplication, hardware costs and maintenance issues
These do-it-yourself, piece-meal infrastructures are typically built using a ‘lambda architecture’ – where a batch pipeline is setup first using one or more of the following tools. Then, at some point, someone asks “Can we have this faster? Maybe by the minute, by the second or even faster?”
It is at this point a second pipeline is setup by customers… which causes massive waste in data duplication, hardware costs and maintenance issues.
Besides the hardware/software management, the other group that has the biggest issue is with the workload placed on the developers and data scientists – who now have to write different code for the batch pipeline and the real-time pipeline, and complex for the different paths in the data pipeline.
The infrastructure managers, as well, now have to install, manage, secure, and manually scale and maintain each of those pipelines as data and demand grows.
The Streaming Data Platform combines pipelines in order to solve problems for both the infrastructure team as well as the developers.
These are the Key pillars which the Streaming Data Platform is built around:
- Handles completely UNBOUNDED DATA
- Unify the concept of Historical & Real-time data so the developers don’t have to code applications differently against live off-the-wire data vs. old data files of data. Alike the concept of a DVR or TiVo, it pulls back past data and treats it as though it is live/real-time. Writing code that can either be analyzing a live stream OR old data that is just streamed by as if it WAS live is what simplifies the process of building streaming applications on the Streaming Data Platform.
- The Streaming Data Platform is able to monitor ingestion and scale out across the entire cluster as demands increase and scale down when the needs are no longer high, all without developer input.
- Because of its ability to deliver exactly-once semantics, the platform is able to deliver consistency even in extremely high volume situations; creating new best practices for organizations who are looking to innovate within their industries.
- By delivering the ability to unify both batch and streaming data, infrastructures can now combine both in a single pipeline; analyzing any data type side-by-side, and re-calling historical batch information to do so.
- The Streaming Data Platform is designed around a Enterprise-Ready solution, taking a queue from other Dell EMC solutions such as Isilon where customers can build mission-critical solutions.
The Streaming Data Platform is replacing DIY, piece-meal infrastructures with an out-of-the box, pure software solution that enables organizations to address four key points.
- Minimizing time to insights by providing real-time analysis of massive amounts of streaming data.
- Simplifying infrastructure by removing redundant pipelines and clustered silo.
- Enabling high volume use cases with auto-scaling capabilities and massive data ingest capabilities
- Cost effectiveness – The Streaming Data Platform, by unifying all of the data into one stream and reducing the need for duplication, is simplifying time, management, security, and amount of money that the organization needs in order to gain more business intelligence.
Let me show you exactly what that looks like:
The Streaming Data Platform targets Batch and Real-time use-cases and also has the unique ability to ingest and analyze pure byte streams for those use-cases where that is ideal.
Originally, much of the analysis started was in batch (typically handled in the Hadoop platform). Part of the platform’s goal is to streamline platforms so that batch no longer has to be a separate pipeline.
Today, analysis is focused on high speed events (typically using a combination of Hadoop, for batch jobs, and Spark or Kafka for high speed events). Having these separate systems means reconfiguring the infrastructure when ingestion rates exceed what was originally planned for.
So, besides covering those two (batch and high speed events) – the engine inside the Streaming Data Platform, called Pravega, allows us to capture a pure bite steam – so we don’t need that concept of events or structured data coming down the pipeline, which is really what Spark and Kafka are all about. We can actually capture pure byte streams – so that opens up the potential for additional use cases and capabilities that many infrastructures can’t currently do. The Streaming Data Platform covers all of these and simplifies the infrastructure while also reducing the complexity of writing applications.
Pravega is the key engine within the Streaming Data Platform. It’s an open-source platform created by Dell EMC that allows for the ingestion and storage of streaming data.
The platform also uses Flink for analytics: This allows data scientists, application developers, infrastructure managers, and more to analyze data in consolidated streams for easier and faster access to the insights they need. The Streaming Data Platform, as a whole, provides an open eco-system of community developed integrations and technologies. While that is Flink right now, it will also encompass Spark in version 2.
The platform itself is within the white box in the middle. At it’s core, again, is Pravega, and the it uses Flink for analysis – but the platform itself provides an out-of-the-box, enterprise-grade, pure software platform to combine streams for more streamlined analysis and output. The out-of-the-box solution provides security, integration, efficiency and scalability that, before, was a manual process for each infrastructure. And the entire platform is, of course, serviced by Dell.
The software is built upon Kubernetes, which allows for a wide variety of complimentary solutions to run on the platform similar to an OS.
You’ll also see, at the bottom, is Tier 2 storage. While we’re using Isilon as the Tier 2 storage now, ECS will also be an available option in the next version. These tier 2 storage solutions provides near-infinite and transparent “DVR” streaming storage with instant recall into the Streaming Data Platform’s engine so that historical data can be analyzed alongside the real-time streaming data.
And finally, the output – next level business insights: Those business insights become valuable data for future organizational decisions and innovation that enables them to stay ahead of competitors and take full advantage of their massive sea of data – whether real-time or historical.
Our own customers have shared experiences with us that detail positive changes in their ROI and manageability, such as:
- While they’ve seen a reduction in their physical infrastructure, they also mentioned that streaming large amounts of high resolution industrial data through the network to the cloud datacenters is just not optimal, or even possible. The Streaming Data Platform made streaming those large amounts of data possible for them.
- With this particular beta customer, from an IT perspective, they quoted that 10 TB of class 1 storage (backed up, etc.) cost about $28k. Therefore, the duplication they were able to avoid was key to using capital efficiently to serve more internal users and projects.
- Finally, they saw a reduction in management through both what they needed to operate the software system, as well as the stack. This speaks volumes to the manageability and ease of use within the platform.
As discussed, the platform is very much not built around a specific industry or use case. The foundation of the platform was created in a way that makes it very flexible – allowing our customers to use it in whatever way works best for them, to solve whatever problems they may have.
Just as an example, many of the use cases that we’ve seen so far involve a combination of IoT sensor data, log files, images and video. Our use cases have such a wide range in terms of capabilities that we’re seeing beta sites using drone data to monitor the health of their cattle – and we’re also seeing the ingestion engine used to handle the ever-changing volume and capacity at SpaceX – and everything in between, with just a few examples shown here.
Industries from utilities to healthcare to retail to construction are all finding ways to utilize the platform to make their large data sets, along with their current streaming data, valuable on a daily basis by gathering new business insights.
Just to provide a little bit more detail into the cattle use case I just mentioned – I always find it helpful to talk about examples when discussing the platform because it helps people understand its value better:
That beta customer used drones in their field to check on the health of their cattle – the data streams in from that drone video feed to deliver information on any abnormalities in their activity, eating habits, or physical (visible) health and reports back with the information, in real-time – allowing the client to maintain healthy livestock easily, without having to assign a human resource to do so.
We’ve also seen a similar use case except the drones are checking the health of airplanes after they land and before they take off again – to detect any maintenance that needs to occur – again, alleviating a team of humans from having to check an entire airplane on foot.
Another example – and we’ll take a deeper look at this one in a minute – of how it can be used is in manufacturing – sensing different anomalies or abnormalities along a manufacturing line. The system can detect if a temperature is incorrect or a speed is too high – and, eventually, it can even auto-adjust that issue in order to keep things running smoothly – again, with very little human interaction.
We’ll take a more in-depth look at these use cases in just a minute
And that is why it is so relevant to so many different industries – it can be used in so many ways to solve so many different problems. The platform really does provide a basic foundation for innovation and problem solving using real-time data.
In all of these use cases, we’ll see that their current systems are set up to utilize, old, historical data which is then delivering delayed results and producing only approximate estimations for attempted business and operational forecasting. To date, these systems have been considered ‘good enough’ and are likely close to what you see in your structures.
Then, we’ll take a look at what those same systems look like after the installation of the Streaming Data Platform. They are now able to analyze historical and streaming data together, receiving immediate results from the data streaming in, as that data streams in, and they have the accuracy guarantee of knowing that each data point is counted once and only once – meaning their results are much more reliable than an estimation.
Let’s take a look at two specific use-cases – one of which is the manufacturing anomaly detection use case mentioned earlier, and the other being the use of drones at a construction site. You’ll see the themes just described appearing in both as a common thread, but with two very different and unique sets of outcomes. First, we’ll take a look at a manufacturing facilities who is using the Streaming Data Platform to find anomalies along the production line and abnormalities in the product – working to eventually fix those problems as they arise, sometimes without any human interaction.
The original design of the platform was a cumbersome and wasteful “Lambda Architecture” with a separate batch and a real-time, which came with all of the typical issues described previously.
Typically, in a manufacturing use case like we see here (which, in this case, is sheet metal going through multiple of a process that has IoT sensors on the machines and cameras along the process), the goal is to reduce the number of anomalies and improve the quality of the product coming out at the end of the assembly.
A batch process was originally stood up by the customers along with a real-time process (not on purpose, they started with batch and then realized they needed it faster), resulting in the typical problems like duplicate data, a complex infrastructure and the inequality in timing and quality of what comes out of batch as opposed to real-time.
The objective with this use case was to create a simpler infrastructure that auto-scales and makes development easier, using the “Kappa Architecture”. With the resulting process, everything that is coming in off the factory floor – all of that data, whether it’s images, video or IoT sensor data – they all get ingested and thought of as streams. That data is automatically tiered off to tier 2 so they can both analyze it in-flight but also play it back in the future for any additional analysis or comparison. This unified processing means when that when applications are written for the streaming data coming off the live assembly line and test it against last week’s data, it’s exactly the same – it’s no different at all.
Connected cars are a great example of the efficacy of this platform. Connected cars, in development, must constantly ingest information and adjust accordingly. Let’s take a look at how that’s done.
When looking at use cases where the analysis and/or consumption of the data is not done in one location – it still fits into the pipeline. Even in cases where the data needs to go somewhere else (in this case, the telemetry data along with streaming video) can be analyzed in the same way, just like we saw in the last use case. But in this new model, the results getting created in the analysis phase, can be generated and delivered onto, in this case, an economist vehicle, to go out and capture more data.
So, this is a case where the vehicles are being used to train a new model. This is different than a connected car where you’re just consuming the data of live vehicles – although we do have use cases where that is done, as well – where you’re trying to leverage that live stream off of the multiple vehicles out there to enhance the intelligence for all of the vehicles out there. Both of those use cases can take advantage of all of the key components of the Streaming Data Platform.
You can see that the parallel ingestion streams of sensors and videos are analyzed and stored simultaneously. They are then sent to tier 2 storage where stream ‘DVR’ playback allows for easy testing of new code on historical data, so that the ai model of the car can continuously utilize that historical data alongside the real-time streaming data. In this case, there are also many different teams using the data for different pieces of the project. Once the data is unified and processed, securely isolated teams can leverage the same platform – knowing that their information is separate and secure.
Next, we’ll take a look at a streaming video analytics use case that uses drones to monitor facilities – in our case, that in agriculture, but it could also be used in airlines, automotive, etc. We’ll see the live streaming data from the drones delivering data to the Streaming Data Platform for immediate information on the status of each necessary component.
When you look at other use cases, they all rhyme. In this example, we have drones in flight that are streaming 4k video, as well as other telemetry, about temperature, location, and things like that. Again, those parallel unbounded streams coming into the system are all ingested in the exact same way – they’re all thought of as stream.
After, they’re all tiered off, and you get that exact same value from them. They are unified and sent to tier 2 storage with the ability to recall whenever necessary, with near infinite “DVR-like” capabilities of video streams and telemetry.
After pulling that historical data back from tier 2 storage, the unified processing step, unifies batch and real-time applications for easier development and maintenance of those multiple data types.
We’ve seen drones used for many different use cases – from monitoring the health of a cattle fleet to inspecting airplanes in between flights for maintenance needs.
Market data ingestion is no different, a lot of data, needs to be process, analyzed and report (and actioned upon) in a very quick manner, traditional engines, cannot cope with the load in real time.
Our approach, Guaranteed durability ensures exactly once storage and analysis, No transactions are lost or double counted. Unlimited retention with time transparency allows apps to leverage data live now or from months ago in the same way and, Cross-systems or multi-party transactions are guaranteed with state-synchronizer.
In creating and testing the Streaming Data Platform, we have found that there are four key components that make The Streaming Data Platform valuable and enterprise ready.
Much of the power of the Streaming Data Platform is it’s ability to focus on multiple use-cases within one organization. To that end, the platform allows for enterprise level management capabilities – by allowing for multiple groups to live simultaneously and harmoniously on the platform. So the finance team can do hourly, weekly, monthly batch jobs on the same infrastructure that the engineering team is running IoT data streaming jobs.
So the teams can be created, the team members can be placed on the teams, and teams can build their applications individually of each other – and this is where I can keep engineering data separate from say finance data.
The auto-scaling is, of course, a big one. When data is coming in, this means being able to configure the system to say, ‘this stream needs to be dynamic’ and tell the system that it needs to monitor the ingestion rate and scale appropriately.
We make that easy to do and there are a lot of different visualizations that will show this scaling going on. Here, you can see the number of stream segments automatically increase/decrease as demands change to optimize performance and throughput.
In the past, many of our clients had to shut the system down just to alter it for changes in their business or in the market – that’s not a very efficient way to do this – we’ve seen this problem with infrastructure systems across many different industries – all have described this as a key issue they are experiencing and it’s something that the Streaming Data Platform can solve.
IT Manageability sees that value in the fact that an enterprise can have security, scalability and manageability. The platform allows organizations to manage this platform in the same way they’re used to with their other infrastructure. This is quite different than a do-it-yourself Apache project that you are standing up to solve one use-case but, instead, it is creating an infrastructure that the whole organization can count on and stand up applications on. The system allows you to not only create these teams but also control who gets to run these applications
When you look down here below, you’ll see the retention policy. Within that tier 2, you get to control how much of that data is kept. For every stream, you can choose a different retention policy. So, you may have a stream coming in that you don’t need the data to be kept – you’re just counting it as it’s coming in for some simple purpose and so you don’t need to retain any of it. You might have some data that you want to keep 500 GB of it, so you’ll keep the last 500 GB of it – or you may retain it by an amount of time. Like if you need the last seven days, or the last seven years. In this case, it’s set to infinite. From a manageability point of view, because the stream is no longer a pile of files or objects, the stream can have this attribute of how long it’s going to be retained. This allows organizations that are in industries where data must be kept due to regulatory reasons to rest assured that not only can the system scale as you add more tier 2 for storage – but that it can be configured so that each stream know its own appropriate retention policy to support whatever the use case is.
The Streaming Data Platform is an analytics platform along with storage. Included out of the box are some tools so that you can monitor – things like Graphana – in this case, monitoring influx DB measures of the reads and writes (or the publish and subscribe, depending on the terminology you want to use) of how the data is being consumed on these streams – or how it’s being ingested or created. You can see that per stream or per project – and that’s extremely valuable for these teams who are trying to solve their problems. They want to know how it’s being pulled. In many cases, a stream will come in and it will be consumed by many consumers. So, being able to monitor that, see the changes, and see the number of consumers is important while they’re writing their applications.
If you look historically at the batch market, or specifically the Hadoop market, the challenges that they have dealt with in many cases have come from the complexity of leveraging the data that is in those platforms. It’s easy to get the data in, but it’s very difficult to get any value out of it. When creating the Streaming Data Platform, when talking to customers and partners, this was one of the key things they wanted to avoid. Making it easy to stream the data in and making it easier for developers to get results out of easily turns into ROI.
This video below is an example of one way that we like to solve problems for developers in unifying the concept of time and data. You’ll see that it takes a look at a taxi cab service that uses tip rates to monitor the quality of service –both at different times of the day and in different areas. I’ll talk through this one here.
This is a dashboard of results coming out of the Streaming Data Platform and our dashboard has some wrong numbers. These are calculations of tips given to taxi drivers, used as a way to measure the quality of service. But we see a problem in the code – a problem in the results coming out of our application. And typically, you’d have to go and edit the historical code and the real-time code. But in this case, you can see that when we see a problem in the business logic, we simply have to change the calculation and re-deploy the code. Because it’s a stream, the system can playback that data like a DVR and back-fill the results and also fix the new results coming in. This is very different than if you had to go analyze historical data from 6 months ago and fix that code as well as fixing the real-time code. So, what you’re seeing after re-deploying is, we’re able to just have the system stream back the data – so it’s going to correct it historically as well as correct the data that the platform continues to consume. And so this is just one simple example of why the Streaming Data Platform is important because you can make it easier to get to the insights by taking the complex batch vs real-time infrastructure problems away from the infrastructure manager and the development team. By doing so, you can get a lot more done, causing a lot more movement to a platform like this.
For more info, please visit our portal at Dell EMC Streaming Data Platform | Dell Technologies Israel
Dell Technologies – Real-Time Object Detection with Pravega and Flink