Monitoring and Alerting for Teams That Just Inherited Production

observabilitymonitoringalertingplatform engineering

At some point, many teams hear the same sentence: “You own production now.”

The migration is complete. The system is live. The architecture diagram looks clean. And then someone asks the obvious question:

“How are we going to monitor this?”

Observability is often presented as a tooling problem. Install something. Ship logs. Add dashboards. Configure alerts. Done.

In reality, monitoring and alerting are design decisions. They require clarity on what matters, what impact looks like, and what action should follow when something breaks.

This article outlines a practical framework for building a sane monitoring and alerting strategy — starting with definitions, then moving into impact, thresholds, and ultimately actionable alerts.


Monitoring? Alerting? Observability?

Sometimes used interchangeably, sometimes people say monitoring and alerting but mean observability, sometimes they want just monitoring. So what’s the difference?

Monitoring

Is the what. What is my CPU utilisation for the last hour? How much memory does the system have available? Monitoring is going to tell you something is wrong but not much else.

Alerting

The alarm. The notification. The thing that’s going to get your attention when something is going wrong.

Observability

This is the why. Observability helps us understand why a system behaves in the way it does. Observability done well tells you what’s wrong, why and how to fix it.

Can I monitor but not alert or observe? Yes. Do I have to alert? No. Do I have to do full end to end observability like Google does? No. Whilst reading this keep asking “what is important to me and my team? What do I want to be getting a page at 3am for?”

Before we dive into where we even start with all this, some key terms to understand.

The Three Pillars of Observability

Logs

If you are reading this you probably do not need much of an explanation as to what a log is. They can be infrastructure generated logs or application logs, but keep in mind that the quality of the logs is important. A custom application spitting out logs that mean nothing not only hinder troubleshooting when it goes wrong but also make them useless in terms of observability.

2026-02-15T13:58:01Z 10.0.4.12:443 192.168.12.44:51522 GET /api/orders/123 HTTP/1.1 200 842 0.021 0.022 "Mozilla/5.0" trace_id=4d9c0e3b2a1
2026-02-15T13:58:02Z 10.0.4.12:443 192.168.12.77:44110 POST /api/login HTTP/1.1 401 112 0.003 0.003 "curl/8.5.0" trace_id=ab22ff981aa
2026-02-15T13:58:03Z 10.0.4.12:443 192.168.12.19:59001 GET /health HTTP/1.1 200 32 0.001 0.001 "kube-probe/1.30" trace_id=33bfa1dbe01
2026-02-15T13:58:05Z 10.0.4.12:443 192.168.12.91:38844 GET /api/products HTTP/1.1 200 5421 0.041 0.042 "Mozilla/5.0" trace_id=98af2c7d991
2026-02-15T13:58:07Z 10.0.4.12:443 192.168.12.62:44192 PUT /api/orders/123 HTTP/1.1 500 241 0.210 0.211 "Mozilla/5.0" trace_id=0ffbc1e4d21
2026-02-15T13:58:08Z 10.0.4.12:443 192.168.12.10:32910 GET /favicon.ico HTTP/1.1 404 0 0.000 0.000 "Mozilla/5.0" trace_id=de93aa77200
2026-02-15T13:58:09Z 10.0.4.12:443 192.168.12.33:61044 POST /api/cart HTTP/1.1 200 133 0.018 0.019 "Mozilla/5.0" trace_id=9aafef23101
2026-02-15T13:58:11Z 10.0.4.12:443 192.168.12.84:48771 GET /api/orders HTTP/1.1 502 301 0.098 0.099 "Mozilla/5.0" trace_id=71cbbad9aa2
2026-02-15T13:58:12Z 10.0.4.12:443 192.168.12.73:39021 GET /api/users/me HTTP/1.1 200 731 0.015 0.016 "Mozilla/5.0" trace_id=33a8fe12cd3
2026-02-15T13:58:13Z 10.0.4.12:443 192.168.12.17:51932 DELETE /api/cart/55 HTTP/1.1 204 0 0.006 0.006 "Mozilla/5.0" trace_id=bb01cffa231
2026-02-15T13:58:14Z 10.0.4.12:443 192.168.12.65:45610 POST /api/payment HTTP/1.1 500 412 0.322 0.323 "Mozilla/5.0" trace_id=91da2f001ab
2026-02-15T13:58:15Z 10.0.4.12:443 192.168.12.98:33440 GET /api/products/999 HTTP/1.1 404 88 0.004 0.004 "Mozilla/5.0" trace_id=ee12a0a9123
2026-02-15T13:58:16Z 10.0.4.12:443 192.168.12.56:52910 GET /metrics HTTP/1.1 200 2210 0.010 0.011 "Prometheus/2.51" trace_id=aa01cce9912
2026-02-15T13:58:17Z 10.0.4.12:443 192.168.12.44:51528 GET /api/orders/124 HTTP/1.1 200 844 0.020 0.021 "Mozilla/5.0" trace_id=5dd12ac01ef
2026-02-15T13:58:18Z 10.0.4.12:443 192.168.12.72:37111 PATCH /api/users/me HTTP/1.1 400 129 0.007 0.007 "Mozilla/5.0" trace_id=be98f1aa000
2026-02-15T13:58:19Z 10.0.4.12:443 192.168.12.23:44002 GET /api/reports HTTP/1.1 503 502 1.102 1.104 "Mozilla/5.0" trace_id=fa7729bc123
2026-02-15T13:58:20Z 10.0.4.12:443 192.168.12.30:51033 GET /api/products HTTP/1.1 200 5421 0.039 0.040 "Mozilla/5.0" trace_id=1d22beac011
2026-02-15T13:58:21Z 10.0.4.12:443 192.168.12.11:60944 POST /api/login HTTP/1.1 200 188 0.011 0.012 "Mozilla/5.0" trace_id=aa9d0012bc1
2026-02-15T13:58:22Z 10.0.4.12:443 192.168.12.49:38890 GET /api/orders HTTP/1.1 200 2101 0.027 0.028 "Mozilla/5.0" trace_id=dc812aa5512
2026-02-15T13:58:23Z 10.0.4.12:443 192.168.12.90:47821 POST /api/payment HTTP/1.1 500 415 0.295 0.296 "Mozilla/5.0" trace_id=991abcde912

The above log snippet is from a local NGINX server. When considering logs through the lens of observability, we can start looking for patterns and crafting monitoring and then alerting from them (log based metrics) as well as use them in a wider solution to tell us about events that don’t fit neatly into a percentage. For example, this log is showing us not only the number of 4xx and 5xx status codes over time but also the API endpoint generating them. We can use this information, graph it over time to understand when certain endpoints may start failing and supplement it with other data to understand why.

Metrics

Metrics can be considered those things we can directly measure and fit into a nice numerical format. They are quantifiable. Counting our log entries above creates a metric for us. CPU utilisation is a metric.

Traces

Deserve a post to themselves to really do them justice. A trace allows you to follow a request from its inception all the way through a system and are of particular use in distributed systems. If you are dealing with proper micro services at scale, traces will become your best friend very quickly.


Aggregations

The number of available aggregations differs between tools, datasources and plugins with some being very specific to certain use cases. Here we will go over the common ones.

Combining Aggregations on Visualisations

Some will say never have more than one on a visualisation, others will say go nuts. The reality is that it depends on what you are trying to show; it may make total sense to show the min, max and current available replicas for a Kubernetes deployment, in other cases it may not. Do what works for your team and provides value. If the visualisation no longer makes sense, you went too nuts.

Percentiles

These can be hard to get your head around but they are worth explaining in a bit more depth. Percentiles measure how many users experience degraded service so they are often used for latency but can be adopted elsewhere, depending on what you consider your user to be.

Average latency can be misleading because a small number of slow requests can distort the overall picture. Percentiles instead show how performance is distributed across users.

As an example, consider this scenario:

Say you are in your coffee chain of choice in a queue of 25 people.

  • One person is ordering their super skinny latte with caramel swirls and cream. That person has to wait 10,000ms for their order.
  • Four want a cappuccino even though it’s after 11am and wait 400ms.
  • Everyone else just getting an espresso only waits 100ms.

This equates to:

  • p50 = 100ms, espresso lovers are happy.
  • p95 = 400ms (95% of 25 = 23.75 so round up), we can start to see an issue affecting users.
  • p99 = 10,000ms.

The average wait is only 544ms. If you were happily just monitoring the average, and 500+ ms is within your tolerance for caring, you are missing those nasty spikes that an SRE would be all over. They happened, therefore they generally mean something — maybe not immediately but the system is doing something out of the ordinary that should be investigated.

p70, p80, p85, p90 and p99.9 are also used but unless you are a hyper scaler or have a solid use case you are unlikely to need or see them in production.

Sampling Intervals

Or sampling window. Use this to determine how often you want to check the condition, typically in chunks of 1 or 5 minutes. You can of course get more granular by going lower or more coarse by going higher but keep in mind there is a computational and financial cost to this — the finer the grain the more expensive it is.

Intervals are important because they help reduce outlier noise. For example, it is normal for CPU utilisation to go up and down over the course of operations, a spike is normal whereas sustained high CPU utilisation (outside of expectations) probably is not. Intervals reduce noise by sampling over time and helping to remove these expected spikes — you wouldn’t monitor CPU utilisation on a 1 second interval because any alerting strategy off the back of it would create a lot of noise chasing spikes, but sustained CPU utilisation over 5 minutes could be an issue.

Indicator (of)

When considering what to monitor it is critical to understand what the metric is actually an indicator of. If it is not clear to everyone not only what the metric means but also what an increase or decrease of said metric is indicative of then it doesn’t belong here.

For example, CPU utilisation. CPU utilisation is often, but not always, a symptom of something else. In Java applications, CPU utilisation can increase to the point of exhaustion because the database connection pool is full and threads are waiting. In this scenario, monitoring just CPU utilisation would cause a delay in detection — if connection pool or other DB metrics were monitored instead, you would catch the problem earlier.

This is why metrics should be defined by the teams that build and are responsible for the app or infra and not another team doing it for them. The closest team to the thing being monitored understands what is actually important, and should also be on call for it.

Impact

If you know what your metric indicates, how often you need to sample and what your aggregation is, the next thing to understand is the impact should the increase or decrease of your metric go unchecked. Using database connection pools as an example again, a saturated pool left unchecked will degrade user experience and eventually cause an outage. This is impact. If you cannot define what impact a change in metric is going to have, you don’t need it and it is just noise.


Actually Defining Your Strategy

If you made it this far, you are ready to start defining your strategy. Some people won’t like it but it is imperative going forward that you document this. You must define what failure actually means, what your failure paths actually are and what impact they have to your end users.

Grab that architecture diagram, compare it to what you now own. Do they match? If not, update it — there is no point in a team where some people think a system looks one way and another group think it looks totally different. You need even just a high level diagram that everyone agrees is an accurate representation of the system.

From this start asking yourselves “What happens if this fails here? What about if responses are slow from here? How does a failure of this component impact the overall user experience?”

Your documentation does not have to be anything crazy. A simple table will do:

ComponentMetricWhy It MattersImpact If Ignored
API server5xx error rateDirect user-facing failuresUsers cannot complete requests
DB connection poolPool saturation %Precursor to application failureCascading timeouts and outage
Payment serviceResponse latency p99Regulatory and UX requirementFailed transactions, lost revenue

Do it as a group or do it in isolation and present it back to the team, discuss it, iron out any assumptions. The goal is to document what is important, why it’s important and what happens if nothing is done about it. Every entry in the table has to matter, no “just in case” allowed — if there is no impact or no one is sure what it indicates, get rid of it.

You may even discover some items that are currently unavailable but also important. This is normal and to be expected as part of iteration. Make sure there is an item in the backlog to address it and it is clear why exactly doing this work is valuable and the risk of not completing it.

Adding Alerting

Alerts follow the same logic as everything discussed — only alert on what is important and what you can actually action, otherwise you will burn out your team. Before we go into defining alerts, some more terms to understand.

Rules

A rule is the condition you want to alert on. It can be as simple as “metric is greater than X” through “count of 500 error codes in log over 5 minutes” to checking SSL cert expiry.

If you read this far you can probably see the current theme here is selecting what is important, it is the same for rule selection. If an alert rule says “greater than 95%” but there’s not actually a problem if a metric does reach that level (self healing for example), then there’s no need to set your rule at that level unless you want an informational alert that doesn’t demand immediate action.

Threshold and Duration

This indicates how many times your rule must evaluate to true before an alert is triggered. Sometimes a one-off condition or outlier is not worth chasing. If you have a spike in CPU utilisation your rule may state “greater than 80%” but how many times over 1 minute should this condition be true before action, and therefore an alert, is required? This is your threshold or the duration.

For example, if your sampling interval is 1 minute, your threshold could be 5 minutes, therefore the rule has to evaluate as true 5 times to fire an alert.

Thresholds can go both ways and can auto-cancel alerts, but if you are alert thrashing (firing, resolved, firing, resolved) then either your rule or threshold should be reevaluated.

Severity

This is another one of those things that some people are totally against and others will insist upon. Again it depends on your use case and team culture, but the two schools of thought are “if it’s not critical, I don’t want an alert” and “even if it’s not critical I want to know so I can look later”. If you are in the second camp, you need severities.

Not all alerts are equal — some indicators are a slow burn and something should be done, others need immediate intervention. Alerts should therefore be classified with a severity and routed in such a way that it is clear to everyone when there is a drop-everything situation. For example, you may have multiple channels in Slack where different severities of alerts go. Maybe non-critical stuff goes straight to Jira. Whichever way you go, everyone needs to know when something needs to be acted upon and what can wait.

Alerting Strategy

To keep documenting your strategy, add additional columns to the previous table to show your rule, threshold and severity (if using):

ComponentMetricRuleThresholdSeverity
API server5xx error rate> 1%5 minutesCritical
DB connection poolPool saturation %> 80%10 minutesWarning
Payment serviceResponse latency p99> 2000ms5 minutesCritical

Alert Formatting

Any alert, regardless of where it is sent, should only contain enough information for the receiver to act upon it. They should not contain an entire dump of the log entry that triggered it. Each alert should contain:

  • Title — Simple summary of what is wrong.
  • Description — Longer explanation of what and where failed. This could be the rule that triggered the alert: “CPU utilisation on [node_name] greater than 90% for 5 minutes.”
  • Link to Dashboard — Link to whatever tools you are using with a chart related to the alert. If you aren’t monitoring it, you shouldn’t be alerting on it.
  • Link to Runbook — Your wiki page with detailed troubleshooting steps.

Acting on Alerts

Every alert should have a runbook to accompany it. Anyone that is on call and receives an alert should be able to reference a document that not only explains in detail what the alert is telling them, but what to do about it.

If you cannot write a runbook to resolve the alert, you must ask yourself if you are the appropriate person to be defining the alert, or whether it is even worth alerting for. Again, this is why it is important for the teams responsible for assessing and resolving the issue to be the ones defining everything from the monitoring up.

Each team, if they do not operate an on-call system, should also have a method whereby everyone knows who is dealing with an alert. Having no way to communicate who is working on a resolution causes problems where two people might be trying to fix the problem in isolation and tripping over each other.


Documenting Your Strategy

As you can see in the previous sections, documenting monitoring and alerting helps to start forming an observability strategy. It shows us not only what is important but why it also impacts our users and to what degree. It helps us rationalise what we are doing.

Documentation does not need to be excessive, but should at least contain something like the tables used as examples in earlier sections.

When to Add More or Change

Finally, the temptation will be great to keep extending. Don’t. The only times where additional monitoring and alerting should be added are:

  1. A postmortem found that an incident could have been avoided or resolved quicker if a metric was available or an alert triggered earlier.
  2. A change to infrastructure or application which adds a new critical path (new DB replicas for example) or existing paths change due to continual improvement.
  3. Business changes what is important to them or the customers — for example, business requires a UI response time of no more than 400ms during peak periods.

Remember — if you add an alert or change one, you must also update your runbooks.