Monitoring and observability and allows teams to watch, debug and understand the state of their systems.
Clone this repo and document your monitoring strategy here:
Content
Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.
Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
Monitoring and observability solutions are designed to do the following:
- Provide leading indicators of an outage or service degradation.
- Detect outages, service degradations, bugs, and unauthorized activity.
- Help debug outages, service degradations, bugs, and unauthorized activity.
- Identify long-term trends for capacity planning and business purposes.
- Expose unexpected side effects of changes or added functionality.
From Google - How to implement monitoring and observability
Monitoring is used in combination with a working optimization setup and incident management procedure
-
Add tracing to your systems
-
Add logging to your systems
-
Monitor the golden four signals (latency, error rate, traffic, saturation)
-
As simple as possible, no simpler
-
Create useful dashboards not impressive dashboards
-
Avoid false positives at all cost, all alerts must be actionable
-
Formalize your optimization strategy
-
Formalize your incident management procedures