I have a saying: “You can’t manage what you don’t know,” and what we need to know to manage IT effectively, to prevent downtime, maximize utilization (cost efficiency), and deliver maximum reliability, is everything.
Knowing everything can be done. The challenge is how to do it cost effectively and efficiently. The solution to this challenge is Enterprise Monitoring Architecture (EMA). EMA works by defining everything that needs to be monitored by diagramming the IT architecture by OSI layer and then assigning monitoring services to each layer; storage, networking, etc. (EMA Sources). By defining all the Sources, monitoring services can be evaluated to select as few or be as comprehensive as possible. Using the smallest number of monitoring services is usually the most cost effective and efficient in configuration and management.
Enterprise monitoring applications have also been historically very expensive, in my opinion even unreasonable in terms of ROI. I have had quotes for up to $600k but newer vendors have started offering solutions at much lower and more reasonable cost.
Back to the EMA premise, by defining a holistic monitoring architecture, all architecture and engineering teams (servers, storage, etc.) understand what monitoring is required and can consistently configure systems monitoring, preferably during installation.
Engineering teams should even consider monitoring functionality / cost as an evaluation criteria, for example ease of monitoring Storage systems IOPS in addition to capacity. I would even select a system with better monitoring functionality over lower capacity or less cost. This is an Operations centric IT implementation view, with a goal of not 9.999% or the mythical Five Nines but infinite 100% availability. An example of EMA working: I started getting reports of internet access failure and within a minute alerts were sent from the firewall that the TCP connection limit was intermittently maxing out, I mailed all staff and found that somebody was capacity testing from the office. This was stopped and the issue resolved in under five minutes. Without this level of monitoring, days could have been spent troubleshooting for circuit errors.
Enterprise Monitoring Architecture consists of EMA Sources, Systems, Clients, and Targets. Sources are everything, all systems, firewalls, servers, storage, every port on every switch, even monitor the monitoring systems. I was working at a SaaS company that delivered the product to the clients FTP servers and I even monitored the clients FTP servers. Imagine the client’s response when we called them to tell them their systems were down even before they knew.
- EMA Systems are monitoring systems that generate reports in addition to alerts.
- EMA Clients are devices that receive monitoring information like tablets, phones, consoles, etc.
- EMA Targets are mail groups. Three types of targets are defined for each application, project, group, etc. these EMA Target types are Alerts, Notifications, and Reports. Alerts go 24×7 to cell phones and mail. Notifications go business hours cell phones and 24×7 mail, reports go 24×7 mail. Mail group members are again holistic and should include Engineering, Operations, and Business owners specific for the target. This is the major efficiency enabler, editing a mail group in effect edits the configuration of dozens if not hundreds of EMA Sources and staff at once. This is how EMA is done efficiently.
EMA is also a replacement for troubleshooting, it is automatic troubleshooting. If you implement EMA well, it will report issues to the right people at the right time.
I consider EMA mandatory for a Cloud Architecture, due to the higher number of applications sharing the same hardware. If there is a hardware issue more applications are effected so monitoring is critical in preventing and responding to issues as quickly as possible. Capacity Reporting is also critical to a Cloud Architecture and is an EMA reporting requirement.
EMA isn’t just a theory. EMA is something I have put into practice for hundreds of servers, and multiple data centers, for over five years to deliver exceptional results, services running for over 2 years with 100% availability. A major reason for this was a highly virtualized fully redundant Hybrid RAID Cloud Architecture built with tier 1 vendors but that’s another article, and a little bit of luck. Another saying I like is “You make your own luck”.
Rick Parker has a wide range of experience in the IT industry including serving as the IT Director of Fetch Technologies, a solutions provider for enterprise clients that extract and analyze Website data. He has over 25 years of IT experience as a manager, consultant, and network administrator. At Fetch, over the last two years, Rick designed and built the network that became the Fetch Private cloud. Previously he was the founder and CTO of Bedouin Networks, one of the first Infrastructure-as-a-service providers, and the IT Director for Vendare Media, a successful internet startup.
Follow Rick online via Twitter.