Avoiding DR and High Availability Pitfalls in the Hybrid Cloud

The private cloud remains the best choice for many applications for a variety of reasons, while the public cloud has become a more cost-effective choice for others, writes David Bermingham, Technical Evangelist at SIOS Technology.

This split has resulted — intentionally or not — in the vast majority of organizations now having a hybrid cloud. But there are many different ways to leverage the versatility and agility afforded in a hybrid cloud environment, especially when it comes to the different high availability and disaster recovery protections needed for different applications.

This examines the hybrid cloud from the perspective of high availability (HA) and disaster recovery (DR), and provides some practical suggestions for avoiding potential pitfalls.

David Bermingham, Technical Evangelist, SIOS Technology.

Caveat Emptor in the Cloud

The carrier-class infrastructure implemented by cloud service providers (CSPs) gives the public cloud a resiliency that is far superior to what could be justified for a single enterprise.

Redundancies within every data center, with multiple data centers in every region and multiple regions around the globe give the cloud unprecedented versatility, scalability and reliability. But failures can and do occur, and some of these failures cause downtime at the application level for customers who have not made special provisions to assure high availability.

While all CSPs define “downtime” somewhat differently, all exclude certain causes of downtime at the application level. In effect, the service level agreements (SLAs) only guarantee the equivalent of “dial tone” for the physical server or virtual machine (VM), or specifically, that at least one instance will have connectivity to the external network if two or more instances are deployed across different availability zones.

Here are just a few examples of some common causes of downtime excluded from SLAs:

The customer’s software, or third-party software or technology, including application software (e.g. SQL Server or SAP.)
Faulty input or instructions, or any lack of action when required (which covers the mistakes inevitably made by mere mortals.)
Factors beyond the CSP’s reasonable control (e.g. carrier network outages.)

It is reasonable, of course, for CSPs to exclude these and other causes of downtime that are beyond their control. It would be irresponsible, however, for IT professionals to use these exclusions as excuses for not providing adequate HA and/or DR protections for critical applications.

Accommodating Differences Between HA and DR

Properly leveraging the cloud’s resilient infrastructure requires understanding some important differences between “failures” and “disasters” because these differences have a direct impact on HA and DR configurations. Failures are short in duration and small in scale, affecting a single server or rack, or the power or cooling in a single datacenter. Disasters have more enduring and more widespread impacts, potentially affecting multiple data centers in ways that preclude rapid recovery.

The most consequential effect involves the location of the redundant resources (systems, software and data), which can be local — on a Local Area Network — for recovering from a localized failure. By contrast, the redundant resources required to recover from a widespread disaster must span a Wide Area Network.

For database applications that require high transactional throughput performance, the ability to replicate the active instance’s data synchronously across the LAN enables the standby instance to be “hot” and ready to take over immediately in the event of a failure. Such rapid, automatic recovery should be the goal of all HA provisions.

Data is normally replicated asynchronously in DR configurations to prevent the WAN’s latency from adversely impacting on the throughput performance in the active instance. This means that updates being made to the standby instance always get made after those being made to the active instance, making the standby “warm” and resulting in an unavoidable delay when using a manual recovery process.

HA in the Cloud

All three major CSPs accommodate these differences with redundancies both within and across data centers. Of particular interest is the variously named “availability zone” that makes it possible to combine the synchronous replication available on a LAN with the geographical separation afforded by the WAN. The zones exist in separate data centers that are interconnected via a low-latency, high-throughput network to facilitate synchronous data replication. With latencies around one millisecond, the use of multi-zone configurations has become a best practice for HA.

IT departments that run applications on Windows Server have long depended on Windows Server Failover Clustering (WSFC) to provide high availability. But WSFC requires a storage area network (SAN) or some other form of shared storage, which is not available in the public cloud. Microsoft addressed this issue in Windows Server 2016 Datacenter Edition and SQL Server 2016 with the introduction of Storage Spaces Direct. But S2D has its own limitations; most notably an inability to span multiple availability zones, making it unsuitable for HA needs.

The lack of shared storage in the cloud has led to the advent of purpose-built failover clustering solutions capable of operating in private, public and hybrid cloud environments. These application-agnostic solutions facilitate real-time data replication and continuous monitoring capable of detecting failures at the application or database level, thereby filling the gap in the “dial tone” nature of the CSPs’ SLAs. Versions available for Windows Server normally integrate seamlessly with WSFC, while versions for Linux provide their own SANless failover clustering capability. Both versions normally make it possible to configure different failover/failback policies for different applications.

More information about SANless failover clustering is available in Ensure High Availability for SQL Server on Amazon Web Services. While this article is specific to AWS, the cluster’s basic operation is the same in the Google and Azure clouds.

It is worth noting that hypervisors also provide their own “high availability” features to facilitate a reasonably quick recovery from failures at the host level. But they do nothing to protect against failures of the VM, its operating system or the application running in it. Just like the cloud itself, these features only assure “dial tone” to a VM.

DR in the Cloud

For DR, all CSPs have ways to span multiple regions to afford protection against widespread disasters that could affect multiple zones. Some of these offerings fall into the category of DIY (Do-It-Yourself) DR guided by templates, cookbooks and other tools. DIY DR might be able to leverage the backups and snapshots routinely made for all applications. But neither backups nor snapshots provide the continuous, real-time data replication needed for HA. For databases, mirroring or log shipping both provide more up-to-date versions of the database or transaction logs, respectively, but these still lag the data in the active instance owing to the best practice of having the standby DR instance located across the WAN in another region.

Microsoft and Amazon now have managed DR as a Service (DRaaS) offerings: Azure Site Recovery and CloudEndure Disaster Recovery, respectively. These services support hybrid cloud configurations and are reasonably priced. But they are unable to replicate HA clusters and normally have bandwidth limitations that may preclude their use for some applications.

HA and DR in a Hybrid Cloud

One common use case for a hybrid cloud is to have the public cloud provide DR protection for applications running in the private cloud. This form of DR protection is ideal for enterprises that have only a single datacenter and it can be used for all applications, whether they have HA protection or not. In the enterprise datacenter, it is possible to have a SAN or other form of shared storage, enabling the use of traditional failover clustering for HA protection. But given the high cost of a SAN, many organizations are now choosing to use a SANless failover clustering solution.

The diagram below shows one possible way to configure a hybrid cloud for HA/DR protection. The use of SANless failover clustering for both HA and DR has the additional benefit of providing a single solution to simplify management. Note the use of separate racks in the enterprise data center to provide additional resiliency, along with the use of a remote region in the public cloud to afford better protection against widespread disasters.

This hybrid HA/DR configuration is ideal for enterprises with only a single datacenter.

This configuration can also be “flipped” with the HA cluster in the cloud and the DR instance in the enterprise datacenter. While it would also be possible — and even preferable — to use the cloud for both HA and DR protection, this hybrid configuration does at least provide some level of comfort to risk-averse executives reluctant to commit 100% to the cloud. Note how using SANless failover clustering software makes it easy to “lift and shift” HA configurations when migrating from the private to a public cloud.

Confidence in the Cloud

With multiple availability zones and regions spanning the globe, all three major CSPs have infrastructure that is eminently capable of providing carrier-class HA/DR protection for enterprise applications. And with a SANless failover clustering solution, such carrier-class high availability need not mean paying a carrier-like high cost. Because SANless failover clustering software makes effective and efficient use of the cloud’s compute, storage and networking resources, while also being easy to implement and operate, these solutions minimize ongoing costs, resulting in robust HA and DR protections being more affordable than ever before.

David Bermingham is Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past eight years: six years as a Cluster MVP and two years as a Cloud and Datacenter Management MVP. David holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare and education.