Every business is at a different place with their monitoring systems. A surprising number of companies have no monitoring at all, while a second group has a monitoring system that they are generally unhappy with. A third group has a monitoring solution in place that is not providing the data that they need.
Businesses have mission critical systems they use every day, applications their employees rely upon and services their customers are paying for. Making sure that those systems, applications and services have maximum availability is why having a well-configured, relevant and maintainable monitoring solution is crucial for your business’s success.
In my experience, system management tools like monitoring are successful and achieve the maximum adoption within an organization when they provide not only the most relevant data to the operations team, but also when they are perceived to be highly accurate and reliable. Additionally, I’ve seen many implementations go astray when the initial plan is too complicated and tries to address too many issues all at once.
For these reasons, I advocate an incremental approach when implementing any monitoring system, just like I would with a software development project.
In software development, the end goal of the project is not always clear when you begin the design and coding. Often developers find that APIs and tools do not work the way they at first thought, causing them to take a different approach. Also, business requirements frequently change before even the first release of a project is ready. This is one of the reasons why Agile and sprinting have become so popular in software development departments: the ability to course-correct when technical or business objectives change over time.
Implementing a new systems management tool should be approached in a similar way. Start with a working core system with no bells or whistles. Make sure your core is rock solid before adding features or additional systems. Establish reliable working methods for each new step so that future iterations of implementation are better and more successful than the last. Take time after each iteration to make sure you understand what went well and what should be improved.
When it comes to monitoring the same principles apply. Start with building a solid monitoring server, not one that you’ll have to recreate almost immediately because the software was not installed correctly or because the machine was improperly sized. Make sure the monitoring system monitors the most critical services on your server first: connectivity, CPU utilization, memory, and of course disk space. Having a properly working core system will save you many future headaches.
Select a few different roles from the different servers you need to monitor, ideally from a lab or development environment. Each of these roles is likely to have different resources you need to monitor (for instance, you want to make sure your web servers are listening on ports 80 and 443, or you want your MySQL server to be running the “mysqld” process). Pick a few of the most important roles, maybe three or so, and determine what the critical services are to monitor for each of them.
This is place where a project can quickly get too complicated for the initial iteration: do not try to implement every possible service, and do not try to integrate non-stock plugins or packages at this point. It is good to know which of these plugins and packages you want to integrate, but put it on your project backlog. Get the core working first.
Once you have selected a few roles and a few services to monitor for each role, select a small representative set of servers to monitor. Build out the monitoring for each of those servers based on the role and service structure that you have already determined.
Make all of that work, then take a breath. You did it. You made a working monitoring system. Almost.
This is a good time to take a break from the full-bore implementation and watch if the monitoring system is doing what it is supposed to do. Watch the dashboards and the alarm streams. Double-check that what the system is reporting to you is actually true. You should be skeptical. It may be the best system you have ever seen, but you should not believe it until you have verified that it is reporting truthfully. There may be problems in how you have clients set up on your monitored systems, or how you have defined the services in the monitoring server. This is the best time to figure out those problems before you make the system large enough that it is more difficult to manage.
Also, this is a great time to implement notifications, probably by email, since this will help you keep an eye on your new server that you’re still a little skeptical about. Tinker with the notifications to make sure they are reporting what you really want to see in an alarm, that they are readable and that they will be meaningful to the individual that will be receiving them.
With your small group of servers cause actual problems, such as stopping processes that should be running or almost running it out of disk space. Try shutting it down entirely. (This is why you should implement with lab systems first.) Make sure your monitoring system accurately reports these failures.
Once you have gained confidence that your monitoring is working only then should you broaden your implementation. Add more servers for the roles you have and maybe add another role. Then take another breath and observe.
Taking an incremental approach to implementing your new monitoring system or any other systems management tool may seem like it is taking longer than a “full speed ahead” implementation, but the payback will become evident when your system works more accurately and reliably because you took the time to do it right.