Tuesday, 17 June 2025

Stop False Alerts: How the for Clause Works in Prometheus

When you create alerts in Prometheus, you don’t always want them to fire immediately. Sometimes issues are just temporary, like a short network glitch or a momentary CPU spike. This is where the for clause helps.

In this post, you’ll learn what the for clause does in alerting rules, why it’s important, and how to use it correctly with an example.

 

1. Why Not Fire Alerts Instantly?

Let’s say you're monitoring a service like node_exporter to check if it's running or not. If Prometheus checks once and sees it down, should it immediately fire an alert?

 

Probably not. The issue could be a temporary network glitch, a slow response, or a one-time blip.

 

If you alert too quickly, you might end up with a lot of false alarms, which can lead to alert fatigue and people ignoring important notifications.

 

2. What is the for Clause?

Using ‘for’ clause, we can only fire this alert if the condition stays true for a specific amount of time.

 

So, Prometheus will wait and watch the condition. If the problem goes away before the time is up, no alert is fired, and the time is reset.

 

3. Example Alert Rule Using for

 

alert_using_for.yaml

groups:
  - name: example_alerts
    rules:
      - alert: NodeExporterDown
        expr: up{job="node_exporter"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node Exporter is down"
          description: "No data received from node_exporter for more than 1 minute"

The rule checks if node_exporter is down. Since we set for to 2m, Prometheus wait for 2 minutes before firing the alert. So Prometheus must see the service down for 2 minutes continuously before sending a notification.

 

prometheus.yaml

 

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_using_for.yaml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Start Prometheus application by executing below command.

prometheus --config.file=./prometheus.yml --web.enable-lifecycle

Navigate to the Prometheus alerts section (http://localhost:9090/alerts).

 


As you see the above screen, alert NodeExporterDown is inactive, as the node_exporter is running perfectly fine now.

 

Let me kill the node_exporter process. 

$ps ax | grep node_exporter
24384 s001  S+     0:00.71 node_exporter
30047 s002  S+     0:00.00 grep node_exporter
$
$kill -9 24384

Wait for 15 seconds and reload the alerts page.

 


As you see above screenshot, the alert is in PENDING state. PENDING state tells us that the Condition just became true, waiting for the ‘for’ duration before firing.

 

If the node_exporter is down for 2 minutes continuously, then this laert will get triggered to AlertManager.

 

You can observe that the NodeExporterDown alert will move to FIRING state after 2 minutes.

 


Let me start the node_exporter service again.

$node_exporter
time=2025-04-14T10:16:49.293Z level=INFO source=node_exporter.go:216 msg="Starting node_exporter" version="(version=1.9.1, branch=, revision=unknown)"
time=2025-04-14T10:16:49.293Z level=INFO source=node_exporter.go:217 msg="Build context" build_context="(go=go1.24.1, platform=darwin/arm64, user=Homebrew, date=, tags=unknown)"
time=2025-04-14T10:16:49.293Z level=INFO source=filesystem_common.go:265 msg="Parsed flag --collector.filesystem.mount-points-exclude" collector=filesystem flag=^/(dev)($|/)
time=2025-04-14T10:16:49.293Z level=INFO source=filesystem_common.go:294 msg="Parsed flag --collector.filesystem.fs-types-exclude" collector=filesystem flag=^devfs$
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:135 msg="Enabled collectors"
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=boottime
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=cpu
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=diskstats
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=filesystem
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=loadavg
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=meminfo
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=netdev
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=os
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=powersupplyclass
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=textfile
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=thermal
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=time
time=2025-04-14T10:16:49.294Z level=INFO source=node_exporter.go:141 msg=uname
time=2025-04-14T10:16:49.296Z level=INFO source=tls_config.go:347 msg="Listening on" address=[::]:9100
time=2025-04-14T10:16:49.296Z level=INFO source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9100

 

Wait for 15 seconds (until next evaluation_interval) and reload alerts page. You can observe that the alert is in INACTIVE state.

 

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment