Disaster Recovery: Achieving Instantaneous Hot-Hot with OpenShift

Written by Derrick Sutherland | Jan 30, 2025 5:00:00 AM

The biggest challenge for disaster recovery in traditional environments is that every environment looks and feels different.

If you're moving from a colo that someone manages to VMware, the cloud, or a different VMware data center, they all look and feel different.

Even if you're using the same virtualization provider, storage is probably handled differently. There has to be a lot of planning and strategy around how to copy data from one area to another since that’s not something natively built into your virtualization provider.

You also have to figure out networking, DNS, routing, etc.

With Red Hat OpenShift, all of these components are software-defined. You get software-defined DNS, networking, and storage layers all because it’s based on Kubernetes.

Out-of-the-box OpenShift won’t “automagically” failover without additional levels of configuration, much like VMware or other traditional infrastructure environments. But the tools required are mostly there for you, either pre-packaged and open-source or robustly tested and a standard in the space.

You can set up OpenShift in any environment because the least common denominator for everything, no matter where you are (bare metal or virtual), is the operating system.

This is where OpenShift lives.

That’s why it is the perfect tool for building a truly agile infrastructure that can move freely between environments, providing instantaneous disaster recovery for businesses.

Infrastructure Agility Through the Disaster Recovery Lens

There are different benefits of infrastructure agility. If you’re super automated, you optimize costs, improve security, etc. Disaster recovery, however, is the biggest and the most important tenant because it focuses on what happens if an entire environment goes offline or a massive critical piece of infrastructure comes offline.

In traditional models, there's usually a downtime window where you have to mirror elements over, bring them back up, get them reconnected, and turn them back on.

That downtime window affects your external-facing SLAs: How long are you allowed to be offline before you start incurring costs either to your customers or to yourself directly?

Infrastructure agility with OpenShift allows you to shift between environments with relative ease in an automated fashion, so disaster recovery evolves into a high-availability model where a massive piece of important infrastructure is truly hot-hot.

It’s the starting point.

Automating Disaster Recovery with OpenShift

Disaster recovery in OpenShift isn’t a bolt-on. It’s a core part of the platform.

OpenShift provides native tools and operator-based automation to simplify disaster recovery planning across hybrid and multi-cloud environments.

With OpenShift, you’re not stitching together disparate Kubernetes components across AWS, Azure, or VMware. Instead, you’re working with a consistent platform that includes integrated logging, monitoring, CI/CD, and security features—all managed through OpenShift Operators.

Whether you deploy OpenShift on bare metal, in the cloud, or at the edge, you get the same software-defined layer for networking, storage, and identity management.

That gives you a lot of autonomy.

But you still have to configure the system to be ready for disaster recovery. You still have to mirror your environment. The only limitation for OpenShift is how you’ve set it up.

1. Networking and Scaling Automation

We can achieve seamless automated disaster recovery protocol by reconfiguring the network.

In a disaster recovery situation, you need to ensure systems are publicly accessible once systems are back online.

Load Balancers: Which ones were impacted? Are they still accessible depending on their location? Or do they need to be reset? Reconfigure them.
DNS Entries: Update these to point to our new load balancer endpoints.
Ingress Routing: Set it up for traffic to enter the environment, validate if traffic can reach the new or reconfigured load balancer, and make it to the user interface.
Network-Wide Validation: Can we route traffic to this new load balancer or reconfigured load balancer and make it to the point of user interfacing?

There are numerous challenges in setting all this up.

First of all, were these systems being managed? The DNS server that was managing DNS, the load balancer that was allowing traffic in, did they go offline with the original system? If so, we need to rebuild them. If not, we have to reconfigure them.

There are two different ways of looking at that. How much rework needs to be done? Are we changing a couple of entries, or are we resetting everything from scratch?

The benefit involved in having these configurations survive is that there’s less downtime.

But it might also introduce additional complexity from an automation perspective because if you’re setting everything from the ground up, you know what we're getting out of the box versus reconfiguring something that exists where it’s easy to miss something—you don’t configure one piece because you weren't thinking about that in your automation rule. And it turns out that’s the one thing keeping systems from coming back online.

You’re essentially rebuilding the environment from scratch.

What do we need to configure so that any system knocked offline can return automatically? And if it wasn’t taken offline, how do we reconfigure the system to talk to or expose the new environment to our endpoint customers?

2. Monitoring

The monitoring challenge is secondary despite being crucial.

If we're using an observability solution like Dynatrace to track system health, we need to address two key concerns:

Can the monitoring system detect when a system goes offline and trigger infrastructure agility processes to replicate necessary components?
How do we ensure the failover environment is properly monitored?

In traditional disaster recovery, many organizations use a hot-cold setup. The primary (hot) environment is warmed and ready to go. All data goes there. It’s being actively monitored. The secondary (cold) environment may have basic monitoring in place for failover purposes. In this case, we simply adjust settings to prioritize alerts from the failover environment once it's active.

However, if you only set up monitoring for the hot environment, you need to deploy agents or reconfigure the monitoring system in the new environment.

This can be seamless with the right observability tool, especially if we're using configuration management or doing this in OpenShift.

You can deploy an operator to the new OpenShift cluster in a couple of seconds, and the data will be automatically deployed to the environment.

If it's configuration management with Puppet or with Ansible, you can push those configurations into the environment and automatically have those agents set up.

However, if the monitoring tool is not as easily configurable (from an initial pulling of metrics and data), there can be significant rework. You may need to set up new sensors, pull new metrics, rebuild Prometheus, set up Grafana, and manually reconfigure everything. It can be a long list.

Depending on the monitoring solution you’re using in these scenarios, it can be very easy to get your systems back online and observed again, or you could complicate the process of getting observability reestablished.

3. User Validation

The best validation that a system is working comes from end users.

Once the failover environment is up, observability is in place, and users confirm they can access the system, everything seems great. However, there are edge cases that require thorough monitoring. And it can make a big difference.

Internally, strong QA testing helps ensure core functionality. A user clicks a button, and the right action occurs. A workflow is executed, and the expected result appears. However, once you start to scale real users back in, does the system perform as it should? Does autoscaling behave the same way it did before?

Geography also plays a role. Users in Japan, India, or Hawaii may route traffic differently. A robust monitoring tool can use synthetics to user test from different regions while tracking internal request flows. This allows teams to verify all systems are greenlighting across the board. We can see requests go from one point to another and look at autoscaling rules to see if we can respond properly.

There are multiple facets of user validation:

Internal QA testing: Done manually or automated with tools like BrowserStack or Selenium (depending on the environment).
Individual user testing: Done by a single person.
Observability solutions: Validates how users access the systems and address complaints as they occur. Uses synthetic tests to predict potential problems.

It is not always as simple black-and-white as we’re back up and running. A system may appear fully operational but still have hidden failures. Some issues only become apparent under real-world load. Maybe everything functions fine in low-traffic conditions, but as user numbers increase, performance begins to degrade because of overlooked scaling differences.

Tooling That Supports Infrastructure Agility

There's plenty of tooling that can help support infrastructure agility. And there is a right tool and a wrong tool, depending on the scenario.

The most common and easiest way to do this would be an enterprise Kubernetes platform built for on-prem and hybrid cloud like OpenShift. It makes achieving agility easier due to additional management/security functionality with OpenShift vs bare-bones Kubernetes.

You should also consider a combination of Red Hat Ansible for automating network reconfiguration, updating operating systems (if they need to talk to new DNS servers,) reconfiguring the OS to run, and updating route tables to talk to switches directly if necessary.

If you need to bring up new physical infrastructure, you can do this with Ansible and Terraform.

You can automate this end-to-end. You just need to know what to automate and build the code out for it.

If you're not using a monitoring solution and you're using more of a traditional infrastructure solution to monitor networks, you should consider an observability tool to get more robust data. Dynatrace provides information across the board to get you up and running again. There are other open-source tools you can use, but they may require more manual intervention.

Where Can OpenShift Take Your Infrastructure?

Let's say we have a strong OpenShift disaster recovery model setup.

We have a cluster in a local data center and another in a separate data center connected by dark fiber (it’s under 10 milliseconds).

Both clusters are well-configured, using a tool like Portworx for storage replication. Workloads can failover. You’ve automated the networking side of things, including load balancer and DNS reconfiguration using Ansible. An observability tool like Dynatrace informs decision-making for when failover should occur.

What else can we do?

If we applied the same model but across an on-premise and cloud environment, what additional benefits could we gain?

One option is adding automation to optimize costs. Instead of always keeping the failover environment fully scaled, we could scale nodes down to save costs when failover isn’t needed.

Autoscaling could be implemented to adjust capacity dynamically. If AWS costs spike, a Dynatrace alerting rule could monitor EC2 instance pricing using custom metrics and trigger workflow automation with event-driven Ansible to scale down the AWS environment and redirect traffic to the on-premise environment.

This approach reduces AWS costs during expensive months and allows traffic to shift back when prices drop.

The same principle applies to security.

If a cloud provider experiences a security breach, automated failover could redirect workloads to a safer environment. Systems could immediately switch to an alternative infrastructure, performing security checks to ensure no threats transfer during the migration.

This protects customers from major security incidents tied to a hosting provider rather than the organization itself.

With the right automation, numerous opportunities exist to optimize costs, enhance security, and experiment in areas that make your infrastructure more resilient beyond disaster recovery.

---

About the author

Derrick Sutherland - Chief Architect at Shadow-Soft

Derrick is a T-shaped technologist who can think broadly and deeply simultaneously. He holds a master's degree in cyber security, develops applications in multiple languages, and is a Kubernetes expert.

View full post