Tuesday, 10 June 2025

A Beginner’s Guide to What and How to Monitor in Your Applications Using Prometheus

If you're just getting started with system monitoring, it can be tricky to figure out what to track. This guide will help you understand what to measure (instrument) in your applications and why it matters. Whether you're running microservices, batch jobs, or databases, tracking the right metrics can help you find problems early, improve performance, and keep everything running smoothly.

Before choosing specific things to track, here’s a simple rule "Monitor anything that helps you understand how well your system is working".

 

Ask questions like:

·      Is the app up and running?

·      Is it fast or slow?

·      Are there any errors?

·      Are we using too much memory or CPU?

·      Is the system under heavy load?

·      Are things getting worse over time?

 

What you should instrument depends heavily on your application type.

 

1. Microservices

These are small, independent services that work together. Following are some common and important things you can track:.

 

·      Request Count & Speed: How many requests are made and how fast they respond.

·      Errors: How many fail (like 500 errors).

·      Resource Use: CPU, memory, and network usage.

·      Queues: If using message queues (e.g., Kafka), track how full they are.

·      Business Data: Like how many users signed up or orders placed.

·      Health Endpoint: Provide a /health check to report if the service is up.

 

Focus on the 4 key areas:

·      Latency,

·      Traffic,

·      Errors, and

·      Resource Consumption (like cpu, memory, connection pool etc.,)

 

2. Background Jobs (ETL, Batch Jobs)

These jobs run behind the scenes to process large data sets. Following are some common and important things you can track.

 

 

 

·      Status: Is the job running, done, or failed?

·      Time Taken: How long did it run?

·      Records Processed: How much data did it handle?

·      Errors: How many failed parts?

·      Resources Used: CPU and memory during the job.

·      Start/End Times: Helps with job tracking and scheduling.

 

Focus on

·      Job health,

·      Performance, and

·      Results.

 

3. Scheduled Jobs (CronJobs)

These run tasks at specific times. Following are some common and important things you can track.

 

·      Job Status: Did it succeed or fail?

·      Run Time: How long did it take?

·      Last Run Time: When was it last finished?

·      Failures Count: How many times did it fail?

·      Resources: CPU/memory used.

·      Delays: Is it running behind schedule?

 

Make sure scheduled jobs run on time and finish properly.

 

4. Real-Time Streaming Apps (e.g., Kafka Streams, Flink)

These apps process data continuously as it comes in. Following are some common and important things you can track.

 

·      Processing Rate: Events per second.

·      Latency: Time taken to process an event.

·      Lag: How far behind it is in processing.

·      Errors: Failed events.

·      Resources: CPU/memory.

·      Internal Queues: How full they are.

·      Watermarks: (Advanced) Tracking event time progress.

 

Focus on data flow speed, delay, and errors.

 

5. Databases (e.g., MySQL, PostgreSQL)

Databases store your app’s data. If they're slow, everything slows down. Following are some common and important things you can track.

 

·      Query Speed: How fast are common queries?

·      Connection Usage: How many connections are active?

·      Cache Hits: Are we using cache efficiently?

 

·      Transactions: How many are active?

·      Lock Waits: Are queries waiting on each other?

·      Resources: CPU, memory, disk.

·      Replication Lag: If using replicas.

·      Size: Is the DB growing too fast?

 

Monitor query time, connections, and any signs of overload.

 

6. Job Schedulers (e.g., Airflow, Kubernetes Scheduler)

These manage when and how jobs are run. Following are some common and important things you can track.

 

·      Running Jobs: How many are active?

·      Waiting Jobs: Are jobs getting delayed?

·      Success/Failure Rate: Are most jobs finishing successfully?

·      Duration: How long do jobs take?

·      Scheduler Resources: CPU and memory.

·      Queue Size: Are tasks piling up?

·      Scheduling Delay: How long before a job gets scheduled?

 

7. Frontend Applications (e.g., React, Angular, Vue)

These are user-facing apps running in a browser, and instrumentation helps you track how users interact and whether your app is behaving as expected. Following are some common and important things you can track.

 

·      Page Load Time: How long does it take for the app to become usable?

·      API Call Performance: Track success/failure rate and response time of backend calls.

·      JavaScript Errors: Uncaught exceptions or crashes.

·      User Interaction Metrics: Button clicks, form submissions, or other key actions.

·      Frontend Resource Usage: Memory, CPU (especially on mobile devices).

 

Use tools like Google Lighthouse, Sentry, or performance APIs to gather these metrics. Focus on user experience first.

 

8. API Gateways (e.g., Kong, API Gateway, Istio Gateway)

Gateways sit in front of your services and manage routing, rate limiting, authentication, and logging. Following are some common and important things you can track.

 

 

 

 

·      Request Volume: Number of incoming API calls.

·      Latency and Errors: Track how long requests take and how often they fail.

·      Rate Limiting Events: How often clients are being throttled.

·      Authentication Failures: Helps spot security or integration issues.

·      Upstream Service Latency: How fast backend services are responding.

 

Monitor both incoming and outgoing traffic to get the full picture.

 

9. Caching Systems (e.g., Redis, Memcached)

Used to speed up data access. If caching goes wrong, everything can slow down. Following are some common and important things you can track.

 

·      Hit/Miss Ratio: Are your cache lookups successful?

·      Evictions: Is data getting removed too soon?

·      Latency: How quickly is data being returned?

·      Memory Usage: Is the cache close to being full?

·      Command Rate: Number of read/write operations.

 

A poor hit rate usually means your cache isn’t being used effectively.

 

10. CI/CD Pipelines (e.g., Jenkins, GitHub Actions, GitLab CI)

Monitoring your build and deployment pipelines helps ensure fast, stable releases. Following are some common and important things you can track.

 

·      Build Duration: How long do builds and tests take?

·      Success/Failure Rate: How many builds or deploys are failing?

·      Queue Time: How long does a job wait before it starts?

·      Test Failures: What’s causing builds to fail?

 

11. Load Balancers (e.g., NGINX, HAProxy, ELB)

They route traffic to your services, and problems here impact your whole app. Following are some common and important things you can track.

 

·      Request Volume: Incoming vs outgoing traffic.

·      Error Rates: HTTP 5xx responses from upstream services.

·      Connection Metrics: Active connections, dropped connections.

·      Latency: Time taken to process and forward requests.

 

Watch for spikes in error rates, they often point to failing backend services.

 

In summary, instrumentation is all about watching your systems so you know how they’re doing, whether they’re healthy, fast, overloaded, or broken. No matter what kind of system you're dealing with microservices, batch jobs, cron jobs, streaming apps, databases, or even frontend apps there are key questions to ask:

 

·      Is it working as expected?

·      How fast is it?

·      Is anything failing?

·      Are we using too many resources?

·      Are there delays or bottlenecks?

 

Start small with the most important signals like latency, traffic, errors, and saturation and build from there. Use tools like Prometheus and Grafana to collect and visualize metrics, and follow simple naming conventions so your metrics make sense later.

 

Always remember that you don’t have to track everything at once. Begin with a few critical metrics for each system type, and evolve your monitoring setup as your application grows.

 

By keeping it simple, consistent, and clear, you'll build a strong foundation for observability.

Previous                                                    Next                                                    Home

No comments:

Post a Comment