Lab 8 - Instrumenting Applications
Lab Goal
This lab introduces client libraries and shows you how to use them to add Prometheus
metrics to applications and services. You'll get hands-on and instrument a sample application
to start collecting metrics.
Instrumenting - Generalized or specific metrics
In this workshop you've been given pre-instrumented demo applications (such as the services demo)
that leverage generalized auto instrumentation. These leverage exporters to provide general
observability metrics, but not specific business guided data. To be able to specifically
instrument your applications and services, you'll use language specific Prometheus client
libraries to track for the insights you want.
Let's start with a review of the metrics types in Prometheus, look at using the
Prometheus client libraries, and finally instrumenting an example Java application
using the Prometheus Java client library.
Instrumenting - Requirements for lab
This lab concludes with an example exercise where you instrument a simple Java application with
the four Prometheus metric types and collect them with a running Prometheus instance. The focus
is coding instrumentation, so let's assume you have the following:
- working Prometheus instance, such as you used in previous labs in this workshop
-
a basic understanding of coding, Java skills are not needed (but nice) for this
lab as you'll be walked through all you need to do and provided a working Java project to
start with.
Intermezzo - Reviewing Prometheus metrics collection
The basics of how Prometheus collects metrics from target systems and applications is to scrape
using a pull mechanism as follows:
- Targets are scraped over the HTTP protocol, standard is
/metrics
path.
- Targets provide
current states
for each metric, sending:
- single sample for each tracked time series
- metric name
- label set
- sample value
- Each scraped sample is stored with a server-side timestamp added, building a set of
time series.
Intermezzo - Exposing target metrics
For Prometheus to be able to scrape a target, that target must expose metrics in the proper
format over HTTP. An example taken from the example service used later in this lab shows the
format you can manually verify in your browser on the path
http://localhost:7777/metrics
:
# HELP java_app_c_total example counter
# TYPE java_app_c_total counter
java_app_c_total{status="error"} 239.0
java_app_c_total{status="ok"} 478.0
# HELP java_app_g_seconds is a gauge metric
# TYPE java_app_g_seconds gauge
java_app_g_seconds{value="value"} 7.29573889110867
# HELP java_app_h_seconds is a histogram metric
# TYPE java_app_h_seconds histogram
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.005"} 0
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.01"} 0
...
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="10.0"} 10
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="+Inf"} 239
java_app_h_seconds_count{method="GET",path="/",status_code="200"} 239
java_app_h_seconds_sum{method="GET",path="/",status_code="200"} 28475.853574282995
# HELP java_app_s_seconds is summary metric (request latency in seconds)
# TYPE java_app_s_seconds summary
java_app_s_seconds{status="ok",quantile="0.5"} 2.870230936180606
java_app_s_seconds{status="ok",quantile="0.95"} 4.888056778494996
java_app_s_seconds{status="ok",quantile="0.99"} 4.903344773262025
java_app_s_seconds_count{status="ok"} 239
java_app_s_seconds_sum{status="ok"} 607.9779550254922
Intermezzo - Instrumenting target for metrics
Because a target provides only the current values for the previously shared metrics, Prometheus
is responsible for collecting these individual values over time and creating time series. The
important part to remember is that the individual target application or service is only
instrumented to keep track of the current state of its metrics and does not ever buffer any
historical metrics states.
The various implementation details can be found in
Prometheus exposition formats documentation.
Instead of serialize the exposition format yourself, there are various client libraries to
assist you with the protocol serialization and more. Let's look closer at how they can help.
Instrumenting - Prometheus client libraries
The provided Prometheus
client libraries
assist with instrumenting your application code. From application code you're creating a metrics
registry to track all metrics objects, creating and updating metrics objects (counters, gauges,
histograms, and summaries), and exposing the results to Prometheus over HTTP. The client library
architecture:
Instrumenting - Using a client library
Using a client library from your application code is laid out in the following overview, numbered
in the order that each step would be implemented and used. The final piece of the puzzle is
Prometheus scraping the /metrics
endpoint:
Intermezzo - Reviewing metrics types: Counters
There are four metrics types you'll be exploring in Prometheus for this lab.
The first is a Counter
. Counters track cumulative totals over time, such as
the total number of seconds spent handling requests. Counters may only decrease in value when the
process that exposes them restarts, in which case their last value is forgotten and it's reset
to zero. A counter metric is serialized like this:
java_app_c_total{status="error"} 239.0
java_app_c_total{status="ok"} 478.0
Intermezzo - Reviewing metrics types: Gauges
Gauges
track current tallies, things that increase or decrease over time,
such as memory usage or a temperature. A gauge metric is serialized like this:
java_app_g_seconds{value="value"} 7.29573889110867
Intermezzo - Reviewing metrics types: Histograms
Histograms
allow you to to track the distribution of a set of observed
values, such as request latencies, across a set of buckets. They also track the total number of
observed values, and the cumulative sum of the observed values. A histogram metric is serialized
as a list of counter series, with one per bucket, and an le
label indicating the
latency upper bound of each bucket counter:
# HELP java_app_h_seconds is a histogram metric
# TYPE java_app_h_seconds histogram
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.005"} 0
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.01"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.025"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.05"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.1"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.25"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.5"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="1.0"} 1
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="2.5"} 3
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="5.0"} 5
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="10.0"} 10
java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="+Inf"} 312
java_app_h_seconds_count{method="GET",path="/",status_code="200"} 312
java_app_h_seconds_sum{method="GET",path="/",status_code="200"} 48719.75198531
Intermezzo - Reviewing metrics types: Summaries
Summaries
are tracking the distribution of a set of values, such as request
latencies, as a set of quantiles. A quantile, is like a percentile, but indicated with a range
from 0 to 1 instead of 0 to 100. For example, a quantile 0.5 is the 50th percentile. Like a
histogram, summaries also track the totals and cumulative sums of the observed values. A summary
metric is serialized with the quantile label indicating the quantile:
java_app_s_seconds{status="ok",quantile="0.5"} 2.209168597209208
java_app_s_seconds{status="ok",quantile="0.95"} 4.270739610746089
java_app_s_seconds{status="ok",quantile="0.99"} 4.270739610746089
java_app_s_seconds_count{status="ok"} 312
java_app_s_seconds_sum{status="ok"} 729.9408091814233
Instrumenting - Some library coding details to consider
Client libraries provide interfaces for creating and using metrics and each library can be
slightly different for each type of metric.
Depending on the type of metric, constructors
will require different options.
For example, creating a histogram will require specifying a bucket configuration and a counter
would not need any parameters.
Metric objects also expose distinct state update methods for each type of metric. For example,
counters provide methods to increment the current value but never provide a method to set the
counter to an arbitrary value. Gauges on the other hand can be set to an absolute value and also
provide methods to decrease the current value.
Instrumenting - Worried about library efficiency?
Don't worry, be happy!
All official Prometheus
client libraries
are implemented with efficiency and concurrency safety in mind. State updates are highly optimized
such that incrementing a counter millions of times a second will still perform well. Also, state
updates and reads from metric states are fully concurrency-safe. This means you can update metric
values from multiple threads without locking issues. Applications are able to handle multiple
scrapes safely at the same time.
Instrumenting - What metrics to track: USE
When you are just getting started and are unsure of what metrics you want to track, a good starting
point can be the
USE Method.
It's summarized as follows:
For every resource, check utilization, saturation, and errors.
These are a set of metrics useful for measuring things that behave like
resources, used
or unused (queues, CPUs, memory, etc)
- Utilization: the average time that the resource was busy servicing work
- Saturation: the degree to which the resource has extra work which it can't service, often queued
- Errors: the count of error events
Instrumenting - What metrics to track: RED
The goal of the
Red Method
is to ensure that the software application functions properly for the end-users above all
else. These are the three key metrics you want to monitor for each service in your architecture:
- Rate: request counters
- Error: error counters
- Duration: distributions of time each request takes (histograms or summaries)
See also, the
Prometheus documentation on instrumentation
for best practices for instrumenting different types of systems.
Instrumenting - Best practices metric names
Metric name of time series describes an aspect of the system being monitored. They are not
interpreted by Prometheus in any meaningful way, so here are a few best practices for metric
names:
- ensure human readability
- ensure valid, matching regular expression
[a-zA-Z_:][a-zA-Z0-9_:]*
-
ensure clarity of origin with prefix, such as
prometheus_
or
java_app_
-
ensure unit suffix adhering to
base units,
such as
prometheus_tsdb_storage_blocks_bytes
or
prometheus_engine_query_duration_seconds
Instrumenting - More best practices metric names
Naming the basic metric types of counters, gauges, histograms and summaries
have their own best
practices as follows:
- counters named with suffix
_total
, such as prometheus_http_requests_total
- gauges are exposing the current number of queries, so something like
prometheus_engine_queries
-
Histograms and summaries also produce counter time series, these receive the following
suffixes, which are auto-appended so you'll never have to manually specify:
java_app_h_sum
for total sum of observations
java_app_h_count
for total count of observations
java_app_h_bucket
for individual buckets of histogram
Instrumenting - Metric label dangers!
Carving up your metrics with labels might feel very useful in the beginning, but be aware that
each label creates a new dimension. This means that each unique set of labels creates a unique
time series to be tracked, stored, and handled during queries by Prometheus. The number of
concurrently active time series is a bottle neck for Prometheus at scale (a few million is a
guideline for a large server).
Label dimensions for metrics are multiplicative, so if you add a status_code
and method
labels to your metric the total series number is the product of
the number of different status codes and methods (all valid combinations). Then multiply that
cardinality by the number of targets for the overall time series cost.
Instrumenting - Avoiding metric cardinality explosions
To avoid time series explosions, also known as cardinality bombs
, consider
keeping the number of possible values well bounded for labels. Several really bad examples:
- storing IP addresses in a label value
- storing email addresses in a label value
- storing full HTTP paths in a label value
- especially if they contain IDs or other unbounded cardinality information
These examples create rapidly ever-increasing numbers of series that will overload your
Prometheus server quickly.
Instrumenting - Example Java application
For the rest of this lab you'll be working on exercises that walk you through instrumenting
a simple Java application using the
Prometheus Java client library.
Below you can choose to run the rest of this lab from the source project on your local machine,
or generating a container image to run your instrumentation project: