A guest post by Erik Zandboer In order to fully understand metro configurations, let’s first look at the history of things. The earliest ways of recovering from datacenter failures (apart from […]
A guest post by Erik Zandboer
In order to fully understand metro configurations, let’s first look at the history of things. The earliest ways of recovering from datacenter failures (apart from restoring all of your tapes 😉 ) was by using replication. Later metro also entered the scene, and there are some distinct differences between these solutions. Let’s dive in and look what is what.
(And if you are not a fan of reading – You can watch the video below or HERE)
Most people will know the abbreviation “DR” and the fact that it stands for “Disaster Recovery”. It is all around a disaster striking, workloads go down and now you need to recover from that disaster. This may simply be gathering all your backup tapes and restoring everything elsewhere, or something smarter like replication (as we will see next).
The abbreviation “DA” is far less known. It stands for “Disaster Avoidance” and is more around pro-actively moving workloads from a failING (not failED) datacenter so that nothing will actually go down. Water levels rising at a datacenter that will be causing short circuits is an excellent example here.
I order to do “DA” with minimal impact, you need a solution that can move workloads without any interruption between DCs. This is where metro setups come into play. But first, let’s look at replication.
The first idea from an array perspective to save customers from datacenter failure was the introduction of replication. Replication will create a “standby” volume copy on the other datacenter, and as writes come into the primary site, replication will make sure those writes also get applied to the standby volume in the other datacenter. How replication does this depends on the type of replication. I’ll briefly describe both:
With asynchronous replication, the primary site will accept writes to the volume(s). As certain time intervals the storage system will take all writes that occurred in the past timeframe, and copy them over to the other datacenter. The writes get applied there, and the data is now safe on both locations.
This mode of operation has a big advantage: There is no limit to the distance between the two datacenters, as workload performance is not impacted by the latency between both sites. This is a different case for synchronous replication as we will see next.
However, asynchronous replication also has a downside: as the writes are copied over to the second site at regular intervals, chances are high that a certain amount of data gets lost when a failure occurs: In most cases writes will have occurred (and been acknowledged back) without being replicated to the other site. If a failure now occurs, the systems writing to the storage layer will think writes took place, but these actually got lost in the asynchronous nature of the solution. The Recovery Point Objective (RPO) is not zero in this case.
Another mode of operation is what we call “synchronous” replication. In this mode each and every write that goes into the storage system, is immediately copied over to the secondary site, written there and acknowledged back. This means no data is lost in event of a failure which is a good thing.
But there is also a downside to synchronous replication: Because each write has to traverse the WAN two time (a write and the acknowledgement back), distance will induce latency on writes. As we put the datacenters further apart, latency will rise. This is why normally the maximum allowable distance is around 5ms Round-Trip Time (RTT), or around 50kms / 30miles.
So what is this “metro” thing
After replication was introduced, people started thinking about stretching workloads across datacenters. With that metro was born. It is not so much replication (always going from one side to the other), but rather a 2-way synchronous replication better known as “mirroring”.
A metro setup will mirror volumes across (usually) two datacenters and each datacenter will perceive the volume as being the same, single, read/write capable volume:
This concludes the first episode around storage metro. Stay tuned for more! In case you hate reading and rather watch a video, please check below: