You’re developing the “Next Great Service X” and are knee deep in the logic managing reads and writes to a data store. Following good engineering practices, you log each read and write operation and cover the nominal use cases. But DevOps need to be notified when access to the data store is lost in production. Is “Service X”, in addition to logging the error, also responsible for raising an “Alert” with the Operations or DevOps team? The answer is no. Services and applications need to be in the business of logging. A “Monitoring & Alerting” system should be in the business of interpreting these log records using on business rules, and raising a suitable “Alert” when appropriate.
We make the distinction between the process of “logging” and the act of “alerting” because a service cannot have sufficient context to determine if any particular condition warrants an alert, if that alert should be raised or suppressed, or who or what systems should be notified. This context is solely within the domain of the Monitoring & Alerting system, because this is the system into which feeds the logs from all applications, services and infrastructure.
To the service the inability to access a data store may be a fatal condition. But in this example, any alert must be suppressed because the service is deployed into a QA environment and we don’t want to pull DevOps out of bed at 3 am due to Continuous Integration testing. Similarly, failure conditions may be logged by services in Production but Alerts may be routed only to devs and QA because of specific failures relating to an “alpha” feature exercised by employee test accounts. Or, operations may have created a rule that generates a new service ticket if a work unit fails in a processing work flow. But should the failure of a shared resource cause the creation of hundreds of new service tickets? The Monitoring & Alerting system may instead alert DevOps of the resource failure and not flood Operations with hundreds of service tickets for the failed work units until another rule determines that SLAs are being missed.
Which leads to the following use cases;
- The need for consistent behavior,
- The need for a consistent alerting and notification framework,
- Empowering others within the organization,
- New alerts from existing logs,
- Logging for development versus logging for monitoring and alerting, and
- Non-functional requirements
The need for consistent behavior: Applications and services must have consistent behavior across deployment environments. A service should not know to suppress notifications and alerts when deployed into a QA environment because this means that critical alerting logic can only be “tested” in Production. Which is really bad. Again, applications and services should just log.
The need for a consistent alerting and notification framework: The Monitoring & Alerting platform is responsible for providing a consistent framework through which all alerts should be raised. Having some alerts raised directly by each service and other alerts raised by the Monitoring & Alerting system imposes tight functional coupling by distributing alerting rules and logic throughout all services and infrastructure.
Empowering others in the organization: Others in the organization should have the capability to create alerts and notifications according to their own non-functional requirements. If alerts are raised directly by each service, then adding or changing the alerting rules requires a code change. By encapsulating the alerting function within the domain of the Monitoring & Alerting system we allow anyone with the appropriate authorization to create alerts for themselves.
New alerts from existing logs: Because the Monitoring & Alerting system ingests logs from all services and infrastructure, existing log records can be interpreted in different ways to raise new alerts or change the behavior of existing Alerts - reducing the incidence of “false positives” or “false negatives”.
Logging for development versus logging for alerting and monitoring: It should by now be obvious that the logging for monitoring and alerting requires a different type of log record to what a developer may create when she is accustomed to grepping though log files after the fact to determine the cause of a production issue. Which is why it is important to log using a consistent format (hello architectural guidelines) such that the underlying platform can transport and route appropriately and the Monitoring & Alerting system can efficiently process these logs to enable real-time monitoring and alerting.
Non-functional requirements: Developers are not mind-readers. While devs are quite capable of logging for the kinds of things that they care about, to log for alerting, reporting and monitoring, devs need good non-functional requirements that describe what the rest of the organization is looking for. Which means the non-functional reporting, monitoring and alerting requirements need to arrive from these other groups before development starts, not after a feature is rolled out to production.