Lab 8 - Prometheus

Lab Goal

This lab introduces client libraries and shows you how to use them to add Prometheus metrics to applications and services. You'll get hands-on and instrument a sample application to start collecting metrics.

Instrumenting - Generalized or specific metrics

In this workshop you've been given pre-instrumented demo applications (such as the services demo) that leverage generalized auto instrumentation. These leverage exporters to provide general observability metrics, but not specific business guided data. To be able to specifically instrument your applications and services, you'll use language specific Prometheus client libraries to track for the insights you want.

Let's start with a review of the metrics types in Prometheus, look at using the Prometheus client libraries, and finally instrumenting an example Java application using the Prometheus Java client library.

Instrumenting - Requirements for lab

This lab concludes with an example exercise where you instrument a simple Java application with the four Prometheus metric types and collect them with a running Prometheus instance. The focus is coding instrumentation, so let's assume you have the following:

working Prometheus instance, such as you used in previous labs in this workshop
a basic understanding of coding, Java skills are not needed (but nice) for this lab as you'll be walked through all you need to do and provided a working Java project to start with.

Intermezzo - Reviewing Prometheus metrics collection

The basics of how Prometheus collects metrics from target systems and applications is to scrape using a pull mechanism as follows:

Targets are scraped over the HTTP protocol, standard is /metrics path.
Targets provide current states for each metric, sending:

single sample for each tracked time series
metric name
label set
sample value

Each scraped sample is stored with a server-side timestamp added, building a set of time series.

Intermezzo - Exposing target metrics

For Prometheus to be able to scrape a target, that target must expose metrics in the proper format over HTTP. An example taken from the example service used later in this lab shows the format you can manually verify in your browser on the path http://localhost:7777/metrics:

							# HELP java_app_c_total example counter
# TYPE java_app_c_total counter
java_app_c_total{status="error"} 239.0
java_app_c_total{status="ok"} 478.0
# HELP java_app_g_seconds is a gauge metric
# TYPE java_app_g_seconds gauge
java_app_g_seconds{value="value"} 7.29573889110867
# HELP java_app_h_seconds is a histogram metric
# TYPE java_app_h_seconds histogram
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.005"} 0
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.01"} 0
...
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="10.0"} 10
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="+Inf"} 239
java_app_h_seconds_count{method="GET",path="/",status_code="200"} 239
java_app_h_seconds_sum{method="GET",path="/",status_code="200"} 28475.853574282995
# HELP java_app_s_seconds is summary metric (request latency in seconds)
# TYPE java_app_s_seconds summary
java_app_s_seconds{status="ok",quantile="0.5"} 2.870230936180606
java_app_s_seconds{status="ok",quantile="0.95"} 4.888056778494996
java_app_s_seconds{status="ok",quantile="0.99"} 4.903344773262025
java_app_s_seconds_count{status="ok"} 239
java_app_s_seconds_sum{status="ok"} 607.9779550254922

Intermezzo - Instrumenting target for metrics

Because a target provides only the current values for the previously shared metrics, Prometheus is responsible for collecting these individual values over time and creating time series. The important part to remember is that the individual target application or service is only instrumented to keep track of the current state of its metrics and does not ever buffer any historical metrics states.

The various implementation details can be found in Prometheus exposition formats documentation. Instead of serialize the exposition format yourself, there are various client libraries to assist you with the protocol serialization and more. Let's look closer at how they can help.

Instrumenting - Prometheus client libraries

The provided Prometheus client libraries assist with instrumenting your application code. From application code you're creating a metrics registry to track all metrics objects, creating and updating metrics objects (counters, gauges, histograms, and summaries), and exposing the results to Prometheus over HTTP. The client library architecture:

Instrumenting - Using a client library

Using a client library from your application code is laid out in the following overview, numbered in the order that each step would be implemented and used. The final piece of the puzzle is Prometheus scraping the /metrics endpoint:

Intermezzo - Reviewing metrics types: Counters

There are four metrics types you'll be exploring in Prometheus for this lab.

The first is a Counter. Counters track cumulative totals over time, such as the total number of seconds spent handling requests. Counters may only decrease in value when the process that exposes them restarts, in which case their last value is forgotten and it's reset to zero. A counter metric is serialized like this:

							# HELP java_app_c_total example counter
# TYPE java_app_c_total counter
java_app_c_total{status="error"} 239.0
java_app_c_total{status="ok"} 478.0

Intermezzo - Reviewing metrics types: Gauges

Gauges track current tallies, things that increase or decrease over time, such as memory usage or a temperature. A gauge metric is serialized like this:

							# HELP java_app_g_seconds is a gauge metric
# TYPE java_app_g_seconds gauge
java_app_g_seconds{value="value"} 7.29573889110867

Intermezzo - Reviewing metrics types: Histograms

Histograms allow you to to track the distribution of a set of observed values, such as request latencies, across a set of buckets. They also track the total number of observed values, and the cumulative sum of the observed values. A histogram metric is serialized as a list of counter series, with one per bucket, and an le label indicating the latency upper bound of each bucket counter:

							# HELP java_app_h_seconds is a histogram metric
# TYPE java_app_h_seconds histogram
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.005"} 0
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.01"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.025"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.05"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.1"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.25"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.5"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="1.0"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="2.5"} 3
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="5.0"} 5
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="10.0"} 10
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="+Inf"} 312
java_app_h_seconds_count{method="GET",path="/",status_code="200"} 312
java_app_h_seconds_sum{method="GET",path="/",status_code="200"} 48719.75198531

Intermezzo - Reviewing metrics types: Summaries

Summaries are tracking the distribution of a set of values, such as request latencies, as a set of quantiles. A quantile, is like a percentile, but indicated with a range from 0 to 1 instead of 0 to 100. For example, a quantile 0.5 is the 50th percentile. Like a histogram, summaries also track the totals and cumulative sums of the observed values. A summary metric is serialized with the quantile label indicating the quantile:

							# HELP java_app_s_seconds is summary metric (request latency in seconds)
# TYPE java_app_s_seconds summary
java_app_s_seconds{status="ok",quantile="0.5"} 2.209168597209208
java_app_s_seconds{status="ok",quantile="0.95"} 4.270739610746089
java_app_s_seconds{status="ok",quantile="0.99"} 4.270739610746089
java_app_s_seconds_count{status="ok"} 312
java_app_s_seconds_sum{status="ok"} 729.9408091814233

Instrumenting - Some library coding details to consider

Client libraries provide interfaces for creating and using metrics and each library can be slightly different for each type of metric.

Depending on the type of metric, constructors will require different options. For example, creating a histogram will require specifying a bucket configuration and a counter would not need any parameters.

Metric objects also expose distinct state update methods for each type of metric. For example, counters provide methods to increment the current value but never provide a method to set the counter to an arbitrary value. Gauges on the other hand can be set to an absolute value and also provide methods to decrease the current value.

Instrumenting - Worried about library efficiency?

Don't worry, be happy!

All official Prometheus client libraries are implemented with efficiency and concurrency safety in mind. State updates are highly optimized such that incrementing a counter millions of times a second will still perform well. Also, state updates and reads from metric states are fully concurrency-safe. This means you can update metric values from multiple threads without locking issues. Applications are able to handle multiple scrapes safely at the same time.

Instrumenting - What metrics to track: USE

When you are just getting started and are unsure of what metrics you want to track, a good starting point can be the USE Method. It's summarized as follows:

For every resource, check utilization, saturation, and errors.

These are a set of metrics useful for measuring things that behave like resources, used or unused (queues, CPUs, memory, etc)

Utilization: the average time that the resource was busy servicing work
Saturation: the degree to which the resource has extra work which it can't service, often queued
Errors: the count of error events

Instrumenting - What metrics to track: RED

The goal of the Red Method is to ensure that the software application functions properly for the end-users above all else. These are the three key metrics you want to monitor for each service in your architecture:

Rate: request counters
Error: error counters
Duration: distributions of time each request takes (histograms or summaries)

See also, the Prometheus documentation on instrumentation for best practices for instrumenting different types of systems.

Instrumenting - Best practices metric names

Metric name of time series describes an aspect of the system being monitored. They are not interpreted by Prometheus in any meaningful way, so here are a few best practices for metric names:

ensure human readability
ensure valid, matching regular expression [a-zA-Z_:][a-zA-Z0-9_:]*
ensure clarity of origin with prefix, such as prometheus_ or java_app_
ensure unit suffix adhering to base units, such as prometheus_tsdb_storage_blocks_bytes or prometheus_engine_query_duration_seconds

Instrumenting - More best practices metric names

Naming the basic metric types of counters, gauges, histograms and summaries have their own best practices as follows:

counters named with suffix _total, such as prometheus_http_requests_total
gauges are exposing the current number of queries, so something like prometheus_engine_queries
Histograms and summaries also produce counter time series, these receive the following suffixes, which are auto-appended so you'll never have to manually specify:

java_app_h_sum for total sum of observations
java_app_h_count for total count of observations
java_app_h_bucket for individual buckets of histogram

Instrumenting - Metric label dangers!

Carving up your metrics with labels might feel very useful in the beginning, but be aware that each label creates a new dimension. This means that each unique set of labels creates a unique time series to be tracked, stored, and handled during queries by Prometheus. The number of concurrently active time series is a bottle neck for Prometheus at scale (a few million is a guideline for a large server).

Label dimensions for metrics are multiplicative, so if you add a status_code and method labels to your metric the total series number is the product of the number of different status codes and methods (all valid combinations). Then multiply that cardinality by the number of targets for the overall time series cost.

Instrumenting - Avoiding metric cardinality explosions

To avoid time series explosions, also known as cardinality bombs, consider keeping the number of possible values well bounded for labels. Several really bad examples:

storing IP addresses in a label value
storing email addresses in a label value
storing full HTTP paths in a label value

especially if they contain IDs or other unbounded cardinality information

These examples create rapidly ever-increasing numbers of series that will overload your Prometheus server quickly.

Instrumenting - Example Java application

For the rest of this lab you'll be working on exercises that walk you through instrumenting a simple Java application using the Prometheus Java client library. Below you can choose to run the rest of this lab from the source project on your local machine, or generating a container image to run your instrumentation project:

Lab 8 - Instrumenting Applications