Monitoring Butterflies and Haystacks
March 20, 2011
Tom recently commented that one of my standard technology slides was out of date. Apparently we don't track 5,000 service checks and metrics anymore. We track 10,000! That is a lot of data and a lot of visibility into our infrastructure and platform operations.
This is a great indicator of how much work continues to go into the continuous improvements of our monitoring and notifications systems. This is a critical part of the work our technology teams do, and while the implementation work has been primarily a TechOps task, the overall development and refinement of our platform monitoring capabilities has truly been a cross team effort. I can't remember a Tuesday Architecture Meeting that has not identified some new metric or service check to improve our platform management capabilities.
Monitoring and notification is much, much more that just paging someone if a service is down. It is about maintaining a high quality of service, reducing the risk while enhancing the effectiveness of change, and understanding the underlying dynamics of the platform for both short and long term planning. Some key goals of our monitoring systems:
- The Butterfly Effect - The Clickability Platform is complex on several dimensions. It is a complex application, a complex infrastructure, and customers do complex things on the platform. Finding root causes to problems is only possible by having detailed metrics across the entire platform and an ability to correlate events. Sometimes if truly feels like finding the butterfly that flapped his wings in China and made it rain in New York (from the origins of Chaos Theory)
- Needle in the haystack - 10,000 metrics is a lot to sort through to investigate an issue! Add in all the log files we collect and the task can be daunting. Being able to find the proverbial needle in the haystack is only possible by having broad and deep monitoring coverage, well organized tools that provide high level dashboard views and drill down capabilities to each individual metric, and a knowledge and familiarity with the platform to sniff out whatever is going wrong.
- Early warning system - Proper notification is about detecting issues before they become service problems. Obviously, detecting service problems is critical, but the ability to identify conditions that may led to service issues before they happen is even more mature and powerful. That way proactive measures can be taken to correct whatever is happening before customer impact and our high quality of service is maintained.
If you have a few minutes, drop by the TechOps area sometime and ask them to show you Cacti (metric tracking and graphing) and Nagios (notifications). I think you will be amazed at the depth of information available, the immediate availability of key metrics from top to bottom in the platform, and the amount of work that has gone into developing these fine tuned, critical technology management tools.