When you use binary operators in PromQL like +, -, *, /, and, or, Prometheus matches time series based on all their labels by default.
But sometimes, you only want to match series based on specific labels and ignore all the others.
That’s where the on keyword comes in.
The on(...) keyword restricts the matching to only the labels you specify.
Example
node_cpu_seconds_total{mode="idle"} and on(cpu) node_cpu_seconds_total{mode="system"}
Here,
· I am using the and operator to find time series that exist in both sets.
· Normally, Prometheus would try to match all labels, including mode, instance, etc.
· But with on(cpu), it only uses the cpu label to match.
· So this query matches, all idle and system metrics that share the same cpu, even if they have different mode or instance.
Why is on useful?
Because sometimes:
· The labels differ, but you still want to match.
· You want precise control over which labels are used for matching.
· You want to avoid unexpected mismatches due to extra labels like mode, instance, or job.
Understanding Prometheus Aggregation Operators
When you query metrics in Prometheus, you often get a bunch of time series, sometimes too many. Aggregation operators helps you to combine, group, or filter these metrics into something more useful.
For example,
sum(prometheus_http_requests_total)
This gives you the total number of HTTP requests across all codes, methods, or other labels.
Following table summarizes the aggregation operators supported by Prometheus.
Operator |
Description |
sum |
Adds up values |
min |
Gets the smallest value |
max |
Gets the largest value |
avg |
Calculates the average |
count |
Counts how many values there are |
stddev |
Measures how much the values vary |
stdvar |
Measures how spread out the values are (variance) |
group |
Groups values together and sets them to 1 |
count_values("label") |
Counts how many times each value appears and labels it |
topk(n, ...) |
Gets the top N biggest values |
bottomk(n, ...) |
Gets the bottom N smallest values |
quantile(φ, ...) |
Gets the percentile (like median or 95th) |
limitk(n, ...) |
Gets N random samples (experimental) |
limit_ratio(r, ...) |
Gets a portion of samples based on a ratio (experimental) |
Examples
Example 1: Sum of all requests
sum(prometheus_http_requests_total)
Example 2: Sum of all requests grouped by response code
sum(prometheus_http_requests_total) by (code)
Example 3: Top 2 HTTP codes by request count
topk(2, sum(prometheus_http_requests_total) by (code))
Example 4: Bottom 2 HTTP codes by request count
bottomk(2, sum(prometheus_http_requests_total) by (code))
Example 5: CPU usage grouped by mode
sum(node_cpu_seconds_total) by (mode)
Example 6: Count elements in the vector.
count(node_cpu_seconds_total)
In summary, Aggregation operators help you get a sense of large amounts of data, turn chaos into clarity, and build dashboards that actually tell a story. Start with simple aggregations. Try sum and count. Then slowly explore topk, quantile, and the rest as your needs grow.
References
https://prometheus.io/docs/prometheus/latest/querying/operators/
Previous Next Home
No comments:
Post a Comment