Day 2 Operations on OpenShift’s Kubernetes Platform
Thinking about containers? Chances are you’ve been considering a way to manage the development and operations surrounding them and likely have evaluated OpenShift as the platform to do just that. You’ve had the POC, and have a successful production-ready cluster stood up with workloads running, but now what? What about “Day 2”? We at Shadow-Soft look to answer that question and show you some lessons learned from running production clusters.
Observability
Information and transparency are tantamount when it comes to operations, above all else. If proper observability measures are implemented, they will provide demonstrable benefits to the organization and of the new OpenShift platform. In regards to observability, it can be broken down into 3 core areas: Logging, Metrics, and Tracing.
OpenShift natively comes with built-in logging and metrics functionality and infrastructure. An EFK stack provides the logging functionality and by default, OpenShift retains a maximum of 14 days or 200GB of application logs, whichever comes first. Metrics monitoring provides the ability to view CPU, memory, and network-based metrics in the OpenShift web console. The metrics stack consists of the following components:
- Hawkular Metrics: A metrics engine that stores the data persistently in a Cassandra database. When this is configured, CPU, memory and network-based metrics are viewable from the OpenShift web console and are available for use by horizontal pod autoscalers.
- Heapster: A service that retrieves a list of all nodes from the master server, then contacts each node individually and scrapes the metrics for CPU, memory, and network usage. It then exports these into Hawkular Metrics.
For monitoring and testing, OpenShift clusters come with an integrated Prometheus/Grafana stack for cluster monitoring. Monitoring service containers (Prometheus) run on each application node and scrap data from the nodes, which is then fronted on the Grafana dashboard.
These platform-provided tools, when combined with additional tooling and some crafty plumbing work, can delivering an outstanding monitoring solution for your cluster and workloads. Examples of some creative tooling are: forwarding the logging via fluentd to an enterprise-wide platform like Splunk, ELK stack, or Sumo Logic, implementing Prometheus-based metrics monitoring, interpreting audit logs and tying up with alerting systems (like Icinga), implementing service-mesh and tracing solutions. The ability to monitor the cluster at depth and trace down issues provides assurances for any platform or application team. Further tooling like Instana or Sysdig for APM are also good add-on choices.
Security
Thankfully, the folks at Red Hat have security-hardened OpenShift out-of-the-box, with features like RBAC, secrets management, service accounts, security context restraints, and identity management and authentication provider integrations. Despite this, the buck for security doesn’t stop there, and for many organizations, more robust security measures need to be considered.
There are a wide array of security solutions that can be implemented, from scanning, verifying, and signing of images and image compliance policies, to audit security analysis, to running regular chaos or vulnerability tests, as well as adhering to industry best practices for managing microservice-oriented infrastructures.
Reducing the attack surface area is the goal of most security personnel and the DevOps team at-large, so getting a grip on vulnerabilities at the host, platform, and container levels will be key with regular updates and patching. Leveraging additional tooling, like Sysdig for container security, will provide deep-level insights into your containers, provide secure, structured policy measures and access within OpenShift, as well as security forensics on containers in the event of a breach or attack. In regards to security, there is no such thing as “too many” defensive measures.
Backup, Restore, Upgrades
Another important part of your Day2 considerations is how do we process, store, and secure information in the event of mistakes or disaster? Do we have a backup plan and a plan to restore data? Do we have a way to ensure upgrades are performed successfully with little to no downtime? Backups are critical to any infrastructure platform, and with OpenShift, multi-tenancy gives you the ability to created a shared responsibility model for data. Depending on your persistent storage choices, data backup of the registry and various infrastructure components may be a breeze and part of the regular backup procedure or policy. Regarding the cluster definition, that may be entirely backed up via an etcd backup, although regular testing of this backup and restore procedure should be performed to ensure integrity and performance.
Upgrades present another challenge to process and stability, however; with each successive release, OpenShift upgrade procedures are getting easier and simpler. The new Control Plane Migration Assistant (CPMA) to migrate your control plane as well as the Cluster Application Migration (CAM) tool to migrate your application workloads makes the upgrade and migration process from OpenShift 3 to 4 much simpler and faster than ever before.
Resource Utilization
Efficient resource utilization is probably one of the core drivers for organizations adopting microservices int he first place, so how do we ensure responsible and efficient resource usage? OpenShift affords administrators the ability to ration and distribute compute, memory, and network resources and a granular level to each tenant if need be, giving true resource control over the cluster and its underlying resource availability. Resource limiting can also be enforced at the namespace and cluster-level, safeguarding against runaway applications that can hamstring cluster resources.
Persistent Storage
Containers and Kubernetes prefers to be ephemeral by nature, which can present challenges regarding stateful applications and also maintaining and storing the state of the cluster, the registry, and the underlying applications. OpenShift features a fairly robust and comprehensive persistent storage solution in addition to the ability to link more traditional filesystem or block-based backends into the system. While these work for most situations, it is not always the answer or the complete answer at least as sometimes we need even more robust solutions. Tried and true persistent storage solutions for containers, like Portworx, provide highly resilient, dependable, and highly available persistent and distributed storage solutions for maintaining statefulness.
Human Ops
You’ve achieved success with all the above components, but now what? What remains? One thing ever-present in this process is the human element. How are teams organized to run efficiently and provide adequate coverage? How and by what process does the team or teams deliver their services to tenants of the platform and to end-users? How will the team handle upgrades, new features, and necessary maintenance in the dynamic landscape of ever-increasing innovation? The secret to this is the key ingredient in DevOps: Automation. Automate where possible to remove tedious human interactions and free up visibility and coverage ability to tackle real, complex problems while also freeing up time back towards creative pursuits. Google has a fantastic Site Reliability Engineering book as a solid reference for teams looking to implement OpenShift into their service delivery process and organization at-large.
Final thoughts
Hopefully, with all this taken into consideration, your team will be able to take the leap into microservices and leverage empowering and feature-rich platforms for managing them like OpenShift. It’s not enough to just acquire technology, it’s necessary to have a plan on how to continue to successfully employ and advance those tools, Day 2 and onwards…