Wednesday, 21 May 2025

Understanding Prometheus Metrics for Beginners

If you've just set up Prometheus and you're staring at the /metrics endpoint wondering "What is all this?", you’re not alone. In this post, we’ll walk through what Prometheus metrics look like, what they mean, and how they help you monitor your application.

prometheus.yml 

Here's a minimal config file for Prometheus:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

In the above config, we set the target to scrap as Prometheus itself.

This tells Prometheus to scrape metrics from itself every 15 seconds. You can access these metrics by visiting /metrics endpoint

 

http://localhost:9090/metrics 


 

2. How Prometheus Metrics Look Like?

Here’s a sample snippet from the /metrics API.

# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter
go_gc_cycles_automatic_gc_cycles_total 25

Each metric block usually contains:

·      HELP: A human-readable description.

·      TYPE: The metric type (counter, gauge, summary, histogram).

·      Metric name and value: The actual data point.

 

In this example,

·      Metric name: go_gc_cycles_total_gc_cycles_total

·      Type: counter, It only increases (used for things like total requests).

·      Value: 25, There have been 25 garbage collection cycles.

 

Following table summarizes common metric types:

Type

Description

counter

A value that only increases (e.g., number of HTTP requests).

gauge

A value that can go up or down (e.g., current memory usage).

summary

Tracks distribution of events (e.g., request durations).

histogram

Similar to summary but with buckets for aggregation.

 

2.1 counter metric

A counter is a metric type that only increases, it represents a monotonically increasing value.

 

When Should You Use a Counter?

Use a counter when you want to track:

 

·      Number of requests served

·      Number of errors

·      Bytes sent/received

·      Memory allocations

·      Events that happen over time

 

Counters never decrease. If you see a drop, it means the process restarted.

 

Example

# HELP go_gc_heap_allocs_bytes_total Cumulative sum of memory allocated to the heap by the application. Sourced from /gc/heap/allocs:bytes.
# TYPE go_gc_heap_allocs_bytes_total counter
go_gc_heap_allocs_bytes_total 2.9687072e+07

 

·      Metric Name: go_gc_heap_allocs_bytes_total

·      Type: counter

·      Value: 2.9687072e+07 (i.e., 29,687,072 bytes)

 

This tells us that, the Go application has allocated ~29.7 MB of memory to the heap so far, since it started. It keeps accumulating every time memory is allocated to the heap, and resets only when the application restarts.

 

Counter metrics in Prometheus are ideal for tracking things that only increase, such as memory allocations, HTTP requests, or errors. In our example, the go_gc_heap_allocs_bytes_total counter shows how much heap memory (in bytes) the application has allocated in total. Its current value is around 29.7 MB. Because counters can only increase, they are useful for long-term trends and rates over time, and they reset only if the application restarts.

 

2.2 guage metric

A gauge is a metric type in Prometheus that represents a single numerical value that can go up or down over time. Think of it like a speedometer in a car, it shows the current value at a point in time.

 

Gauges are perfect for tracking values that:

 

·      Change over time

·      Can increase or decrease

·      Represent the current state

 

Common Use Cases

·      CPU usage

·      Memory usage

·      Temperature

·      Active connections

·      Configured limits or thresholds

 

Example

 

# HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent.
# TYPE go_gc_gogc_percent gauge
go_gc_gogc_percent 75

 

go_gc_gogc_percent is a gauge metric. It tells Prometheus the current Garbage Collection target heap size growth percentage. Here the Go runtime will trigger GC when the heap grows by 75%.

 

2.3. summary metric example

A summary metric in Prometheus is used to:

 

·      Track individual events and their durations or sizes.

·      Provide total number of observations (_count)

·      Provide total cumulative value (_sum)

·      Optionally expose quantiles like 0.5 (median), 0.9, 0.99 — but only if explicitly configured.

 

Use a summary metric when you want to:

·      Measure latencies (e.g., API response times)

·      Measure durations of processes (e.g., GC time, disk I/O time)

·      View average durations across many occurrences

 

Example

 

# HELP prometheus_tsdb_head_gc_duration_seconds Runtime of garbage collection in the head block.
# TYPE prometheus_tsdb_head_gc_duration_seconds summary
prometheus_tsdb_head_gc_duration_seconds_sum 3.42
prometheus_tsdb_head_gc_duration_seconds_count 5

 

Following table summarizes the metrics.

 

Metric

Value

Description

prometheus_tsdb_head_gc_duration_seconds_sum

3.42

Total time spent in GC is 3.42 seconds

prometheus_tsdb_head_gc_duration_seconds_count

5

GC (Garbage Collection) has happened 5 times in the head block

 

So, the average GC duration is: 3.42 / 5 = 0.684 seconds (or 684 milliseconds)

 

2.4. Histogram metrics example

Let’s walk through a realistic example where compactions have occurred and Prometheus has collected meaningful histogram data for the metric prometheus_tsdb_compaction_chunk_range_seconds.

# HELP prometheus_tsdb_compaction_chunk_range_seconds Final time range of chunks on their first compaction
# TYPE prometheus_tsdb_compaction_chunk_range_seconds histogram
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="100"} 1
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="400"} 3
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1600"} 7
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6400"} 12
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="25600"} 17
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="102400"} 20
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="409600"} 22
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1.6384e+06"} 23
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6.5536e+06"} 23
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="2.62144e+07"} 23
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="+Inf"} 23
prometheus_tsdb_compaction_chunk_range_seconds_sum 301200
prometheus_tsdb_compaction_chunk_range_seconds_count 23

 

How to Interpret This Data?

Total compactions observed:

prometheus_tsdb_compaction_chunk_range_seconds_count = 23

 

So 23 chunk compactions have occurred.

 

Total time span of all chunk ranges combined:

prometheus_tsdb_compaction_chunk_range_seconds_sum = 301200 seconds

 

This equals approximately 83.7 hours, or about 3.5 days.

 

Histogram Buckets

Bucket (le=)

In Bucket (delta)

Range (approx)

100

1

≤ 100 seconds

400

3

101–400 sec

1600

7

401–1600 sec

6400

12

1601–6400 sec

25600

17

6401–25600 sec (~7 hrs)

102400

20

25601–102400 sec (~1 day)

409600

22

102401–409600 sec (~5 day)

1.6384e+06

23

409601–1.6M sec (~19 days)

…>

All Others

 

Average duration = sum / count = 301200 / 23 ≈ 13095.65 seconds ≈ 3.63 hours

 

So on average, each chunk covered about 3.6 hours of time after compaction.

 

In summary, Prometheus metrics may look complex at first, but once you understand the structure HELP, TYPE, and VALUE it becomes a powerful tool to observe your application in real-time.

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment