If you've just set up Prometheus and you're staring at the /metrics endpoint wondering "What is all this?", you’re not alone. In this post, we’ll walk through what Prometheus metrics look like, what they mean, and how they help you monitor your application.
prometheus.yml
Here's a minimal config file for Prometheus:
global: scrape_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
In the above config, we set the target to scrap as Prometheus itself.
This tells Prometheus to scrape metrics from itself every 15 seconds. You can access these metrics by visiting /metrics endpoint
http://localhost:9090/metrics
2. How Prometheus Metrics Look Like?
Here’s a sample snippet from the /metrics API.
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime. # TYPE go_gc_cycles_automatic_gc_cycles_total counter go_gc_cycles_automatic_gc_cycles_total 25
Each metric block usually contains:
· HELP: A human-readable description.
· TYPE: The metric type (counter, gauge, summary, histogram).
· Metric name and value: The actual data point.
In this example,
· Metric name: go_gc_cycles_total_gc_cycles_total
· Type: counter, It only increases (used for things like total requests).
· Value: 25, There have been 25 garbage collection cycles.
Following table summarizes common metric types:
Type |
Description |
counter |
A value that only increases (e.g., number of HTTP requests). |
gauge |
A value that can go up or down (e.g., current memory usage). |
summary |
Tracks distribution of events (e.g., request durations). |
histogram |
Similar to summary but with buckets for aggregation. |
2.1 counter metric
A counter is a metric type that only increases, it represents a monotonically increasing value.
When Should You Use a Counter?
Use a counter when you want to track:
· Number of requests served
· Number of errors
· Bytes sent/received
· Memory allocations
· Events that happen over time
Counters never decrease. If you see a drop, it means the process restarted.
Example
# HELP go_gc_heap_allocs_bytes_total Cumulative sum of memory allocated to the heap by the application. Sourced from /gc/heap/allocs:bytes. # TYPE go_gc_heap_allocs_bytes_total counter go_gc_heap_allocs_bytes_total 2.9687072e+07
· Metric Name: go_gc_heap_allocs_bytes_total
· Type: counter
· Value: 2.9687072e+07 (i.e., 29,687,072 bytes)
This tells us that, the Go application has allocated ~29.7 MB of memory to the heap so far, since it started. It keeps accumulating every time memory is allocated to the heap, and resets only when the application restarts.
Counter metrics in Prometheus are ideal for tracking things that only increase, such as memory allocations, HTTP requests, or errors. In our example, the go_gc_heap_allocs_bytes_total counter shows how much heap memory (in bytes) the application has allocated in total. Its current value is around 29.7 MB. Because counters can only increase, they are useful for long-term trends and rates over time, and they reset only if the application restarts.
2.2 guage metric
A gauge is a metric type in Prometheus that represents a single numerical value that can go up or down over time. Think of it like a speedometer in a car, it shows the current value at a point in time.
Gauges are perfect for tracking values that:
· Change over time
· Can increase or decrease
· Represent the current state
Common Use Cases
· CPU usage
· Memory usage
· Temperature
· Active connections
· Configured limits or thresholds
Example
# HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent. # TYPE go_gc_gogc_percent gauge go_gc_gogc_percent 75
go_gc_gogc_percent is a gauge metric. It tells Prometheus the current Garbage Collection target heap size growth percentage. Here the Go runtime will trigger GC when the heap grows by 75%.
2.3. summary metric example
A summary metric in Prometheus is used to:
· Track individual events and their durations or sizes.
· Provide total number of observations (_count)
· Provide total cumulative value (_sum)
· Optionally expose quantiles like 0.5 (median), 0.9, 0.99 — but only if explicitly configured.
Use a summary metric when you want to:
· Measure latencies (e.g., API response times)
· Measure durations of processes (e.g., GC time, disk I/O time)
· View average durations across many occurrences
Example
# HELP prometheus_tsdb_head_gc_duration_seconds Runtime of garbage collection in the head block. # TYPE prometheus_tsdb_head_gc_duration_seconds summary prometheus_tsdb_head_gc_duration_seconds_sum 3.42 prometheus_tsdb_head_gc_duration_seconds_count 5
Following table summarizes the metrics.
Metric |
Value |
Description |
prometheus_tsdb_head_gc_duration_seconds_sum |
3.42 |
Total time spent in GC is 3.42 seconds |
prometheus_tsdb_head_gc_duration_seconds_count |
5 |
GC (Garbage Collection) has happened 5 times in the head block |
So, the average GC duration is: 3.42 / 5 = 0.684 seconds (or 684 milliseconds)
2.4. Histogram metrics example
Let’s walk through a realistic example where compactions have occurred and Prometheus has collected meaningful histogram data for the metric prometheus_tsdb_compaction_chunk_range_seconds.
# HELP prometheus_tsdb_compaction_chunk_range_seconds Final time range of chunks on their first compaction # TYPE prometheus_tsdb_compaction_chunk_range_seconds histogram prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="100"} 1 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="400"} 3 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1600"} 7 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6400"} 12 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="25600"} 17 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="102400"} 20 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="409600"} 22 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1.6384e+06"} 23 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6.5536e+06"} 23 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="2.62144e+07"} 23 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="+Inf"} 23 prometheus_tsdb_compaction_chunk_range_seconds_sum 301200 prometheus_tsdb_compaction_chunk_range_seconds_count 23
How to Interpret This Data?
Total compactions observed:
prometheus_tsdb_compaction_chunk_range_seconds_count = 23
So 23 chunk compactions have occurred.
Total time span of all chunk ranges combined:
prometheus_tsdb_compaction_chunk_range_seconds_sum = 301200 seconds
This equals approximately 83.7 hours, or about 3.5 days.
Histogram Buckets
Bucket (le=) |
In Bucket (delta) |
Range (approx) |
100 |
1 |
≤ 100 seconds |
400 |
3 |
101–400 sec |
1600 |
7 |
401–1600 sec |
6400 |
12 |
1601–6400 sec |
25600 |
17 |
6401–25600 sec (~7 hrs) |
102400 |
20 |
25601–102400 sec (~1 day) |
409600 |
22 |
102401–409600 sec (~5 day) |
1.6384e+06 |
23 |
409601–1.6M sec (~19 days) |
…> |
… |
All Others |
Average duration = sum / count = 301200 / 23 ≈ 13095.65 seconds ≈ 3.63 hours
So on average, each chunk covered about 3.6 hours of time after compaction.
In summary, Prometheus metrics may look complex at first, but once you understand the structure HELP, TYPE, and VALUE it becomes a powerful tool to observe your application in real-time.
No comments:
Post a Comment