Large Manufacturer Eliminates Manual Incident Response with Event-Driven Ansible and Dynatrace
Summary: Large Manufacturer needed to reduce manual remediation toil, replace aging Puppet workflows, and connect their observability platform to their automation tooling. Shadow-Soft delivered a working Dynatrace→EDA→Ansible self-healing architecture in six weeks, with an enterprise governance framework built to scale to ~600 nodes across multiple plants.
The Challenge
Every time Dynatrace flagged a problem — a service degradation, disk pressure, an application going unstable — someone had to act on it. That meant pulling engineers from other work, introducing a window of exposure, and watching mean time to remediation climb.
The team was also carrying legacy Puppet workflows that had grown difficult to maintain as their infrastructure expanded. And when HashiCorp Terraform finished provisioning new infrastructure, the handoff to configuration management was still a manual step.
The client needed a platform that could close all three gaps at once: eliminate manual response for known incident classes, standardize configuration management without starting from scratch, and connect provisioning to configuration in a single pipeline — all with the governance controls a distributed manufacturing operation requires.
Our Solution
Shadow-Soft designed and delivered a solution centered on Red Hat Ansible Automation Platform 2.x with Event-Driven Ansible at the core. The goal was to prove real business value quickly — not with a broad rollout, but with a tightly scoped, outcome-tied engagement delivered in six weeks.
The architecture connected Dynatrace to EDA via webhook. When Dynatrace fires a problem event, an EDA rulebook evaluates it against defined conditions and automatically triggers the appropriate AAP job template — no human in the loop. We called it a closed-loop remediation flow.
Alongside the self-healing use case, we validated a migration path off Puppet using an idempotent Ansible baseline role, and wired a Terraform pipeline to invoke AAP job templates via REST API for post-provision configuration.
Our Process
The engagement had a six-week delivery window, so we structured the work to prove value early and build governance in parallel.
- Confirmed access to non-production VMs, validated networking and firewall configs, and connected AAP to Active Directory, Git, and the Dynatrace webhook before writing a single playbook
- Collaborated with the client team to select the first incident class for self-healing remediation and identify the pilot host group for Puppet replacement
- Built an EDA rulebook that receives Dynatrace problem events and triggers AAP job templates automatically for service failures, disk thresholds, and application recycling scenarios
- Developed a custom Ansible role delivering idempotent baseline configuration — users, packages, files, and services — and validated it against 5–10 target nodes that previously ran Puppet-managed config
- Implemented a Terraform→AAP API integration so newly provisioned infrastructure is automatically configured as part of the same pipeline run
- Configured Automation Hub as the client's curated content repository with approved collections and execution environments, and set up AD-integrated RBAC with team and role structures and full audit logging
- Closed with a live stakeholder demo — a real-time Dynatrace→EDA→AAP remediation run with execution logs — and delivered a findings report and sizing guide for the path to ~600 nodes
The Roadblocks
The engagement scope was clear from day one, but the environment introduced one complexity that required careful handling: integrating EDA with Dynatrace's webhook meant validating the full event-to-remediation chain in a non-production environment that didn't perfectly mirror production alerting conditions. We worked with the client's team to simulate representative incident classes and confirm the rulebook logic held before the stakeholder demo.
The bigger challenge was governance. Building RBAC, credential segregation, and Automation Hub content controls that would actually hold at scale — across multiple teams and eventually multiple plants — required more architectural rigor than a typical PoC. We treated it like a production foundation, not a prototype.
The Toolstack
Red Hat Ansible Automation Platform 2.x (Controller, Event-Driven Ansible, Automation Hub): The core automation engine. Controller ran job templates and workflows, EDA handled event-driven remediation, and Automation Hub provided the curated content governance layer.
Dynatrace: The observability source. Problem events from Dynatrace's API triggered EDA rulebooks via webhook, closing the loop between detection and remediation.
Red Hat Enterprise Linux 8/9: Base OS for AAP nodes and target systems throughout the solution build.
HashiCorp Terraform: Existing provisioning pipeline. We wired it to invoke AAP job templates via REST API as a post-provision configuration step.
Active Directory: Integrated with AAP for SSO, LDAP-based RBAC, credential segregation, and audit trail.
Git: Source control for all AAP Projects and Execution Environments.
The Results
The clientwent from manual incident response to a closed-loop remediation architecture that detects, evaluates, and fixes covered incident classes without human intervention — in the time it takes a webhook to fire.
The Puppet replacement pilot gave the team a repeatable, idempotent Ansible baseline they can extend to the full node inventory without rebuilding from scratch for each host group. The Terraform integration eliminated the manual gap between provisioning and configuration. And the governance framework — AD/RBAC, Automation Hub, credential management, audit logging — is built to hold as the platform scales.
Key Results:
- Delivered a working Dynatrace→EDA→AAP self-healing flow in 3 weeks
- Validated Puppet replacement on 5–10 nodes with a reusable Ansible baseline role
- Eliminated the manual post-provision configuration step via Terraform→AAP API integration
- Deployed enterprise RBAC with AD/LDAP auth, team structures, and full audit trail
- Produced architecture documentation and sizing guidance for expansion to ~600 nodes
- Closed with a live stakeholder demonstration and accepted findings report
What's Next?
Client is using the solution build as the foundation for a broader rollout. The immediate priorities are extending EDA remediation to additional incident classes and onboarding more teams under the existing RBAC framework.
The longer-term roadmap includes migrating remaining Puppet-managed workloads to Ansible, scaling AAP toward the ~600-node target across multiple plants, and building out additional Terraform→AAP pipeline integrations as provisioning patterns expand.
Shadow-Soft remains engaged as the automation architecture grows.
Client Overview
Client is a Fortune 500 company operating dozens of manufacturing facilities across the United States.
- Industry: Manufacturing
- Size: Enterprise, Fortune 500
- Location: United States