If you're new to Prometheus and wondering how it can notify you when something goes wrong, this post is for you. I’ll explain what alerting means, how Prometheus and Alertmanager work together, and how you can write your first alert rule.
1. What is Alerting?
Alerting means getting notified when something unusual happens in your system like when a server goes down or the CPU usage is too high.
Prometheus helps with alerting by checking if certain conditions are true. You can define these conditions using something called PromQL (Prometheus Query Language). If the condition becomes true, Prometheus will trigger an alert.
Example
up{job="node_exporter"}
== 0
Above PromQL expression evaluates to true, when the node_exporter is down.
Example Alerts
Here are some common alert conditions:
· CPU usage is more than 80%
· There are 5 or more stale (unused) connections in the database pool
· Memory consumption is 75% or more.
· Total active users reach certain threashold (like more than a laksh)
How are alerts sent to users?
This is where Alertmanager comes in. Here's how it works step-by-step:
· Prometheus reads alert rules written in .yml (YAML) files.
· If any condition becomes true, Prometheus fires the alert.
· The alert is sent to Alertmanager.
· Alertmanager takes that alert and sends notifications to places like:
o Email
o Slack
o PagerDuty
o or any other service you connect
Alertmanager performs several key functions to reduce alert fatigue and ensure relevant notifications reach the appropriate recipients. Alert fatigue occurs when system administrators, developers, or on-call engineers are bombarded with excessive alerts from monitoring systems. This often leads to stress, burnout, or even ignoring alerts altogether. Over time, critical alerts can be overlooked or delayed in response, increasing the risk of downtime or unresolved issues.
Alertmanager perform following activites.
· Grouping: It intelligently groups related alerts together to avoid overwhelming users with a flood of individual notifications.
· Deduplication: It eliminates duplicate alerts that may be triggered by the same underlying issue.
· Silencing: It allows users to temporarily mute alerts for known issues that are already being addressed, preventing unnecessary noise.
· Throttling: It controls the frequency of notifications to avoid sending too many alerts in a short period.
· Notification Routing: It sends alerts to various configured destinations such as email, Slack, PagerDuty, or XMatters, based on rules and receiver preferences.
Overall, Alertmanager plays a key role in ensuring that alerts are actionable, relevant, and delivered in a timely and manageable way.
Naming Conventions for Alerts
Always use clear and descriptive names for your alerts
For example:
· HighCpuUsage
· StaleDbConnections
· ServiceDown
This helps in understanding the purpose of each alert quickly.
How to Write Alert Rules?
Alert rules are written in YAML files. They include:
· A name for the alert
· A condition (written in PromQL)
· Labels and annotations for extra info
Writing Your First Alert
Here’s an example alert rule to check if a service (like node_exporter) is down.
app_down_alerts.yaml
groups: - name: app_down_alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0
Add this file to prometheus.yml file.
prometheus.yml
global: scrape_interval: 15s rule_files: - "app_down_alerts.yaml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node_exporter' static_configs: - targets: ['localhost:9100']
Reload Prometheus configuration by executing below command.
curl -X POST http://localhost:9090/-/reload
Navigate to Prometheus Alerts, you can the alert named NodeExporterDown is added.
Click on the alert ‘NodeExporterDown’, you can see the PromQL expression attached to it.
As you see above snippet, alert status is set to INACTIVE, it means, that the condition defined in the alert rule is currently not true.
Let me stop the node_exporer process, wait for 1 minute (It is the default evaluation_interval time, if you do not configure any).
Reload http://localhost:9090/alerts page after a minute, you can see that the alert is in firing state.
When an alert is in the FIRING state, it means, that the alert condition is true.
Once in the FIRING state, Prometheus sends alert data to Alertmanager. Alertmanager sends notifications to:
· Slack
· PagerDuty
· Any other channel you configured
Previous Next Home
No comments:
Post a Comment