Wednesday, 4 June 2025

Match on specific labels using on keyword in Prometheus

When you use binary operators in PromQL like +, -, *, /, and, or, Prometheus matches time series based on all their labels by default.

But sometimes, you only want to match series based on specific labels and ignore all the others.

 

That’s where the on keyword comes in.

 

The on(...) keyword restricts the matching to only the labels you specify.

 

Example

node_cpu_seconds_total{mode="idle"} 
and on(cpu) 
node_cpu_seconds_total{mode="system"}

Here,

·      I am using the and operator to find time series that exist in both sets.

·      Normally, Prometheus would try to match all labels, including mode, instance, etc.

·      But with on(cpu), it only uses the cpu label to match.

·      So this query matches, all idle and system metrics that share the same cpu, even if they have different mode or instance.

 



Why is on useful?

Because sometimes:

·      The labels differ, but you still want to match.

·      You want precise control over which labels are used for matching.

·      You want to avoid unexpected mismatches due to extra labels like mode, instance, or job.

 

Understanding Prometheus Aggregation Operators

When you query metrics in Prometheus, you often get a bunch of time series, sometimes too many. Aggregation operators helps you to combine, group, or filter these metrics into something more useful.

 

For example,

sum(prometheus_http_requests_total)

 

This gives you the total number of HTTP requests across all codes, methods, or other labels.

 

Following table summarizes the aggregation operators supported by Prometheus.

 

Operator

Description

sum

Adds up values

min

Gets the smallest value

max

Gets the largest value

avg

Calculates the average

count

Counts how many values there are

stddev

Measures how much the values vary

stdvar

Measures how spread out the values are (variance)

group

Groups values together and sets them to 1

count_values("label")

Counts how many times each value appears and labels it

topk(n, ...)

Gets the top N biggest values

bottomk(n, ...)

Gets the bottom N smallest values

quantile(φ, ...)

Gets the percentile (like median or 95th)

limitk(n, ...)

Gets N random samples (experimental)

limit_ratio(r, ...)

Gets a portion of samples based on a ratio (experimental)

 

Examples

Example 1: Sum of all requests

sum(prometheus_http_requests_total)

 


Example 2: Sum of all requests grouped by response code

sum(prometheus_http_requests_total) by (code) 


 

Example 3: Top 2 HTTP codes by request count

topk(2, sum(prometheus_http_requests_total) by (code)) 


Example 4: Bottom 2 HTTP codes by request count

bottomk(2, sum(prometheus_http_requests_total) by (code))

 


Example 5: CPU usage grouped by mode

sum(node_cpu_seconds_total) by (mode)

 


Example 6: Count elements in the vector.

count(node_cpu_seconds_total)

 


In summary, Aggregation operators help you get a sense of large amounts of data, turn chaos into clarity, and build dashboards that actually tell a story. Start with simple aggregations. Try sum and count. Then slowly explore topk, quantile, and the rest as your needs grow.

 

References

https://prometheus.io/docs/prometheus/latest/querying/operators/

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment