Manufacturer Mitigates Production Risk With Kubernetes Roadmap

Written by Shadow-Soft Team | Feb 10, 2026 5:10:00 PM

Summary:

A regulated manufacturer ran 24/7 production across multiple sites on a 20-year-old Windows stack that introduced unacceptable operational risk. Manual deployments created inconsistencies and slowed outage diagnosis. Shadow-Soft assessed the platform and delivered a defensible, phased roadmap to Kubernetes to ensure business continuity.

The Challenge

With demand ramping up, the manufacturer had to expand capacity and support the workload over the next three to five years while running 24/7 production.

The operation ran with near-zero downtime tolerance, since even a few hours of disruption could scrap high-value batches and wipe out utilization.

A legacy Windows extension layer ran hundreds of scripts and database procedures that teams deployed manually at each site. That model created a brittle, inconsistent environment across facilities. With weak observability, any outage threatened to become a prolonged, high-impact event, posing a significant risk to utilization and revenue.

The organization needed a Kubernetes platform and operating model that met regulatory requirements and long-term cost constraints.

Meanwhile, internal teams pulled in different directions on standardization and pricing. They needed a forcing mechanism to converge on a platform decision before committing to a build.

Our Solution

We helped the client choose a Kubernetes platform (Red Hat OpenShift) and define the operating model for its use in regulated manufacturing.

The work covered platform selection, along with the supporting decisions that make the platform hold up in production: storage and backup, observability, security controls, and CI/CD.

We packaged that into a phased roadmap that moved from proof of concept to initial production and then to rollout across multiple facilities.

The roadmap also defined ownership and enablement: a central platform team runs the cluster baseline, application teams own services and delivery, and local site IT holds break-glass responsibilities, backed by training to standardize the operating rhythm.

We built the recommendation around day two operations. The roadmap set standards for long term management, vendor support expectations, drift control, and a GitOps model that keeps deployments consistent across sites.

We matched the plan to the workload profile and controlled complexity. Most target workloads skew stateless, so the roadmap starts with simpler storage and backup for proof of concept and early production, then defines explicit triggers for when requirements justify heavier storage platforms.

Finally, we sequenced observability the same way: faster troubleshooting first, then a fuller multi-cluster stack once the team hardens operations.

Our Process

We used our eight-step assessment framework to align cross-functional teams, run a defensible platform evaluation, and turn the decision into a phased plan that de-risks the move from proof of concept to 24/7 production.

Align stakeholders on non-negotiables (downtime tolerance, change control, support ownership, security constraints).
Run the eight-pillar interview framework to capture the current state and future requirements across architecture, workloads, CI/CD, security, storage and backup, disaster recovery, and observability.
Convert platform preferences into shared decision criteria, so the group can discuss tradeoffs and choose the best path.
Evaluated OpenShift and Rancher against those criteria, while mapping the supporting stack decisions that would make either option operable across sites (monitoring, storage, backup, secrets, GitOps).
Pressure test the platform paths against rollout economics as sizing is clarified, with scenarios to guide early rollout and at scale.
Present recommendations, collect pushback tied to existing tooling and constraints, then tighten the selection matrix so internal teams could standardize without tool sprawl.
Define a PoC and a next-step plan that tests the architecture and operations, then sequence the move from proof of concept to production rollout across multiple facilities.

The Roadblocks

Early interviews missed a few key stakeholders, so the team kept the open questions open longer. A follow-up session brought the right owners into the room.

The group worked through the platform trade-offs using shared decision criteria, even when a supporting external architect pushed alternate recommendations.

Cost modeling slowed platform commitment because the team didn’t know the initial and future environment sizing early enough to pressure-test licensing and long-term run costs.

The team treated sizing uncertainty as a gating risk, firmed up assumptions in ranges, and kept the plan phased so the client could move forward without economics that wouldn’t scale.

The Toolstack

Red Hat OpenShift: To reduce day two ops burden and meet regulated manufacturing support expectations across sites.

Portworx: To standardize persistent storage and support backup and DR across mixed storage backends and multiple facilities.

GitOps controller (Argo CD or Flux): To prevent site drift and enforce repeatable, auditable deployments across clusters.

Prometheus, Grafana, and Loki: To reduce diagnostic time and eliminate manual log hunting during incidents, with APM as an option for black-box workloads that demand it.

Velero plus CSI snapshots: To support Kubernetes-native backup and restore, with restore testing as a requirement.

Akeyless, with Vault: To support a standardized secrets layer to avoid site-by-site variance and keep GitOps workflows clean.

The Results

The engagement replaced an internal platform stalemate with a decision to run the PoC on Red Hat OpenShift, backed by decision criteria leaders could defend.

The client left phase one with a documented operating model that covers deployment, security, storage, backup, and observability for 24/7 regulated manufacturing.

Cost modeling forced a reality check before the build started.

The team tested licensing and run-cost scenarios as sizing firmed up, then delivered a phased roadmap from proof of concept to production and multi-site rollout, with gates that prevent a messy pilot from becoming the long-term platform.

Key Results

Replaced platform deadlock with an OpenShift direction leaders could back
Share a roadmap that made cost and operability tradeoffs explicit.
Surfaced licensing and run cost breakpoints early, before the economics failed at scale.
Defined a GitOps rollout standard that removes a major source of production risk by eliminating manual, inconsistent deployments.
Set production readiness gates to validate proof of concept, operations, and recovery paths.
Moved into a pilot SOW to build and validate the chosen platform path.

What’s Next?

With platform selection and cost modeling finalized, the client is kicking off a proof of concept to validate the target architecture and day-two operations in a regulated, 24/7 manufacturing context.

In parallel, the client will refactor hundreds of legacy scripts and stored procedures into a small set of containerized API services, treating the work as a greenfield refactoring rather than a lift-and-shift.

From there, they’ll rebuild the proof of concept into a production foundation, run a first site pilot, then scale to additional sites.

View full post