Lab 4 - Prometheus

Lab Goal

This lab dives into using PromQL to explore basic querying so that you can use it to visualize collected metrics data.

Basic PromQL - Review current lab setup

A quick review: Be sure you've completed the previous lab before proceeding.

If you've done that, then you now have Prometheus installed, configured, and running (maybe for some time now) collecting time series data known as metrics. You've also built and installed the services demo project to provide a service that Prometheus scrapes for a varied data collection. This involved re-configuring Prometheus to scrape the new services instance and a restart. The longer this setup is running, the more data your queries will display as the graph views can be adjusted by time intervals.

In this lab, you'll start using the query language and see what we can find out from our metrics data collection.

Basic PromQL - Starting with selecting metrics

The basic first step to querying your metrics data collection (time series data) is to select some portion of that data. You have to start somewhere, with a basic selection, before you move to transforming or performing calculations on your selected data.

When you select some metric, you are going to start by selecting all instances of that metric with no filtering at all. The second step will be to filter based on a metric NAME and one or more of its LABELS.

Basic PromQL - A few definitions before querying

Let's look at what you are going to be doing in your first basic queries:

Metric name - querying for all time series data collected for that metric

Label - using one or more assigned labels filters metric output

Timestamp - fixes a query single moment in time of your choosing

Range of time - setting a query to select over a certain period of time

Time-shifted data - setting a query to select over a period of time while adding an offset from the time of query execution (looking in the past)

Basic PromQL - Services demo metrics by name

You're going to start with queries targeting the services demo metric names you saw listed when you installed it in the previous lab. You can view all of them and their available labels in your browser at its metrics endpoint http://localhost:8080/metrics:

							# HELP demo_api_http_requests_in_progress The current number of API HTTP requests in progress.
# TYPE demo_api_http_requests_in_progress gauge
demo_api_http_requests_in_progress 1
# HELP demo_api_request_duration_seconds A histogram of the API HTTP request durations in seconds.
# TYPE demo_api_request_duration_seconds histogram
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0001"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.00015000000000000001"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.00022500000000000002"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0003375"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.00050625"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.000759375"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0011390624999999999"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0017085937499999998"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0025628906249999996"} 0
demo_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0038443359374999994"} 0
...

Basic PromQL - Selecting a metric by name

Let's start by selecting the below metric name using the Prometheus expression browser which you can find at http://localhost:9090. Cut-and-paste the below line to select all time series data from our collection in Prometheus:

							demo_api_request_duration_seconds_count

Basic PromQL - Results metric selection

As you can see, the metric demo_api_request_duration_seconds_count selects all LABELS associated with it. In this case they are INSTANCE, JOB, METHOD, PATH, STATUS. Your results should look something like this:

Basic PromQL - Shorthand metric selection

The metric name selected was shorthand to make it easier to use. If you are interested, the full form to select a metric by name is as follows. Be sure to give it a try in your instance to see the same results as before:

							{__name__="demo_api_request_duration_seconds_count"}

is the same as:

demo_api_request_duration_seconds_count

Basic PromQL - Filtering using a label

Now we can start filtering our selection output by using a specific label. You might have noticed that the output had two sorts of METHOD, either GET or POST, so we can narrow the output down by half just by selecting one of them as follows:

							demo_api_request_duration_seconds_count{method="POST"}

Basic PromQL - Filtering auto-completion

Note that entering the query (typing by hand) in the expression browser includes auto-completion suggestions as you go:

Basic PromQL - Results filtering a label

The query filtered our output to only the entries containing a method with value POST, which has reduced our results listing by half:

Basic PromQL - Filtering with multiple labels

To further refine our search we can be more specific by adding more labels to our query. Try the following query with two labels filtering your results:

							demo_api_request_duration_seconds_count{method="POST", path="/api/foo"}

Basic PromQL - Results multiple labels

This query with filtering using two labels narrowed our results some (note that all labels must match to select a metric):

Basic PromQL - Filtering with multiple labels

To further refine our search we can be more specific by adding more labels to our query. Three labels should be enough to get a single results, so try this one:

							demo_api_request_duration_seconds_count{method="POST", path="/api/foo", status="500"}

Basic PromQL - Results multiple labels

This final query is specific enough after filtering with three labels to narrow our search results down to one final answer:

Basic PromQL - Possible matching operators

When filtering our queries up to now, we've only used the equality operator (=), there are more to help you with filtering:

							 =: Equals
!=: Non-equals
=~: Regular expression matching
!~: Regular expression non-match

Basic PromQL - Filtering with non-equals

Revisiting the first selection query we did with a single label filter, we can apply non-equals to select results for this metric where the label METHOD is not POST:

							demo_api_request_duration_seconds_count{method!="POST"}

Basic PromQL - Results non-equal label

The results of our non-equal query:

Basic PromQL - Explore more selecting labels

This is the time for you to now explore queries using equals and non-equals with a variety of metric names. Remember you can use auto-completion to explore the available labels.

Basic PromQL - Filtering with regular expressions

Now let's try and select using just a regular expression and leave out any specific metric name to see what metrics are using our API in a PATH label. You need curly brackets to query only a single label, or a comma separated list of labels:

							{path=~"api"}

Basic PromQL - First attempt regexp

The results of our first attempt are disappointing. The reason it's empty is because Prometheus matches regexp against the full string instead of a partial string. We have no PATH labels with only the word api in them. In the next slide we can try again:

Basic PromQL - Filtering with regular expressions

We want to select any metric with a PATH label that contains the string api, so let's make sure we search for everything that exists before and after the api string. To do that we start with an anchor (^) and wildcard (*) followed by a dot to anchor to the api string, then follow it with another dot to anchor our wildcard (*) for any remaining string values after api:

							{path=~"^*.api.*"}

Basic PromQL - Filtering results for regexp

This looks much better, a long list of metric names (multiple if you scroll down and inspect the selection results) and they all container the string api embedded in the PATH label:

Basic PromQL - Danger with regular expressions

Now we get the bright idea to try out the regular expression non-match operator (!~) by adjusting our select query from looking for api in the PATH label to looking for all metrics that don't have the string foo in them, like this:

							{path!~"^*.foo.*"}

Basic PromQL - Safety valve for regexp

We end up getting an error message like this and might think we did something wrong. We didn't, other than set up a query that is so wide in matching metrics that it would put massive load on your Prometheus instance. The fine engineers behind this project decided to put a safety stop in place when queries are too much for the system, which plays out in an error message:

Basic PromQL - Explore more regexp

This is the time for you to now explore queries with your own regexp's where you try to select results and see what works as you learn about these types of queries.

Basic PromQL - Instant vectors explained

Up to now you've only done queries selecting the single latest value for all series found, known as an INSTANT VECTOR. There are functions that require a range of values and not just a single value. These are known as a RANGE VECTOR and have a duration specifier at the end in the form [number,unit]. Let's try this one to select all user cpu usage over the last minute:

							demo_cpu_usage_seconds_total{mode="user"}[1m]

Valid durations are:

 ms - milliseconds
 s - seconds
 m - minutes
 h - hours
 d - days
 y - years

Basic PromQL - Range vectors in action

The user cpu usage over a range vector of one minute:

Basic PromQL - Explore more range vectors

This is the time for you to now explore queries and try out various ranges to see what the resulting queries might be. Also try applying them to different metric names.

Basic PromQL - Time-shifting data selection

One issue that's going to come up is how to select data in the past, so that you can compare it to current data. This is known as time-shifting, where you run your selection query but want it to look in a time frame in the past. To do this you append offset and a duration, such as looking at user cpu usage over one minute back one hour ago:

							demo_cpu_usage_seconds_total{mode="user"}[1m] offset 1h

Basic PromQL - Results time-shifting selection

The user cpu usage over a range vector of one minute looking back to one hour ago (noting that if you get a "query returned no data" response, you might not have data collection going back that long, so just shorten the offset duration based on how long your environment has been running, such as 10m shown here):

Basic PromQL - Searching in the future

Time-shifting offsets can be negative number, which causes selection queries to look into the future for data, relative to their current evaluation timestamp. These negative offsets are not a common use case and should be avoided unless you are sure you have a special reason for them. For this reason, they are not covered in this workshop.

Basic PromQL - Graphic visualization of metrics

Up to now you've just been exploring simple selections of counter metrics, which produced a table view like this from our very first query:

							demo_api_request_duration_seconds_count{method="POST"}

Basic PromQL - Selecting counters for graphs

This is not a metric to visualized in a graph (select the graph tab to view) as it's an absolute value increasing over time:

What we are more interested in is HOW FAST a count increases over time, so let's explore that in the next slide...

Basic PromQL - Visualizing using rate function

The function of choice for calculating the change over time in a metric is to use the rate() function. It calculates the change of a counter per-second over a given period of time. In our example below, we are going to check the rate over a 5 minute window:

							rate(demo_api_request_duration_seconds_count{method="POST"}[5m])

Basic PromQL - Graphing the rate function

The graph shows a more interesting view of the metric performance over time using the rate function. Note you might have to adjust the length of your viewing window, here shown at 2 hours, depending on your time of scraping metrics to see what is shown here with the changes over time:

Basic PromQL - Inside the rate function

The inner workings of the rate function are important to understanding what you are viewing in the resulting graphic visualizations:

counter metrics reset to 0 when a scraping process resets
rate (during any window) assumes counter decrease is reset
rate adjusts all following samples in window it's measuring to fix this
rate extrapolates first and last samples under the window boundary to actual window edges
can report non-integer rates for counters with integer increases

Basic PromQL - Visualizing inside rate function

This graphic shows how the rate function works over its window and how it deals with resets of the scraping process:

Basic PromQL - Visualizing using irate function

If we want to zoom in on a rate function, we can increase the resolution of our graph by using the irate() function. It calculates only the last two samples within a given window. The function then calculates an INSTANTANEOUS rate from those two samples. The provided window tells irate how far to look back for the two samples and reacts faster to counter changes. Let's see what our counter looks like in a higher resolution graph:

							irate(demo_api_request_duration_seconds_count{method="POST"}[5m])

Basic PromQL - Graphing the irate function

The graph shows more deviations over the time frame given, uncovering short dips in the rate that weren't visible before. (Note that the rate function gives a smoother graph and is recommended for use in alerting rules that should not fire on short spikes in rate):

Basic PromQL - Visualizing using increase function

The default on rate and irate functions is to calculate using a per-second rate of sampling. If we want to query the total increase over a give time window, we need to use the increase() function:

							increase(demo_api_request_duration_seconds_count{method="POST"}[30m])

Basic PromQL - Graphing the increase function

The graph shows the total increase over our 30 minute time window for this counter:

Intermezzo - Note about rate and friends

This family of functions needs at least two (2) samples from within the specified window or it will not be able to return any output. The standard is to set window sizes to be four times (4x) the scrape interval to ensure reliable output even when failures happen, such as shown below:

Basic PromQL - Selecting gauges for graphs

Since rate, irate, and increase functions only help visualize counter metrics due to the behavior of correcting all decreases in value as a reset and only outputting non-negative values. When we want to track values that can decrease as well as increase, such as temperature, we use gauge metrics. There are two functions we can use to help visualize gauge metrics, the deriv() and delta() functions.

Basic PromQL - Visualizing using deriv function

To track our services api requests over the last 15 minutes, both increases and decreases, we can use the deriv() function. You might need to play with the measurement window to catch the collected data based on how long you have your instance running for this workshop, here you see 15 minutes as an example:

							deriv(demo_api_request_duration_seconds_count{job="services"}[15m])

Basic PromQL - Graphing the deriv function

The graph shows what our gauge metric has captured over the 15 minute window used in this displayed example output:

Basic PromQL - Visualizing using delta function

To track our services api requests over the last 15 minutes, sampling only the first and last values in the given time window, we can use the delta() function. You might need to play with the measurement window to catch the collected data based on how long you have your instance running for this workshop, here you see 15 minutes as an example:

							delta(demo_api_request_duration_seconds_count{job="services"}[15m])

Basic PromQL - Graphing the delta function

The graph shows what our gauge metric has captured over the 15 minute window used in this displayed example output, note the smooth graph is now more jagged as the sample is just two data points (first and last in the time window):

Basic PromQL - Visualizing the future

Let's have some fun and look at visualizing how our gauge metric will look one hour in the future. This sort of query is useful when, for example, building an alert to tell you when your memory is about to fill up in the next hour. The function predict_linear() in the following query will try to predict what the memory usage will be in one hour, based on its development in the last 15 minutes:

							predict_linear(demo_memory_usage_bytes{job="services"}[15m], 3600)

Basic PromQL - Graphing the future

The graph shows what might happen with our memory usage for the services demo job based on 15 minutes of historical data:

Basic PromQL - Visualize highly dimensional data

Up to now you've been visualizing time series data that is highly dimensional. That means you have the ability to drill down into more and more detail, such as using one, two, and then three labels to narrow your search of the data.

Now we are going to look at how you can aggregate over all these dimensions (labels for example) to get a less detailed view. To do this you'll be using aggregation functions sum, avg, min, and max. There are many more, see the aggregation operators documentation. Note that these operators do not aggregate over time, but across multiple series at each point in time.

Basic PromQL - Looking first all dimensions

When we first look at the selection of metric data we see in the TABLE view all the dimensions we can get details from and use to narrow our search (labels), but next we are going to look at all these dimensions, across multiple series (data points), at each point in time.

							demo_api_request_duration_seconds_count{job="services"}

Basic PromQL - Visualizing using sum function

The function sum() is going to give us a look across all the previous dimensions for 5 minute period of time and then look at this across all those captured series:

							sum(
  #individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="services"}[5m])
)

Basic PromQL - Graphing the sum function

The graph shows the results of our sum function across highly dimensional metrics:

Basic PromQL - Visualizing using avg function

The function avg() is going to give us a look across all the previous dimensions for 5 minute period of time and then look at this across all those captured series:

							avg(
  #individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="services"}[5m])
)

Basic PromQL - Graphing the avg function

The graph shows the results of averaging across our highly dimensional metric query:

Basic PromQL - Visualizing using min function

The function min() is going to give us a look across all the previous dimensions for 5 minute period of time and then look at this across all those captured series:

							min(
  #individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="services"}[5m])
)

Basic PromQL - Graphing the min function

The graph shows the results of looking at minimums across our highly dimensional metric query:

Basic PromQL - Visualizing using max function

The function max() is going to give us a look across all the previous dimensions for 5 minute period of time and then look at this across all those captured series:

							max(
  #individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="services"}[5m])
)

Basic PromQL - Graphing the max function

The graph shows the results of looking at maximums across our highly dimensional metric query:

Basic PromQL - Visualizing using without function

The problem with aggregation functions is that they work on so many dimensions, as you can imagine that's resource intensive. It's always advisable to use these functions on partial sets of dimensional data, for example by excluding labels you do not need. The function without() is going to give us the power to exclude some dimensional data in the same query we've been using:

							sum without(method, status) (
  #individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="services"}[5m])
)

Basic PromQL - Graphing the without function

The first graph shows the dimensions being summed in the second, but without the method or status labels. The dimensions being queried have been cut by 50%:

Basic PromQL - Visualizing using by function

Another approach is to use inclusion with the by() function. This is going to give us the power to aggregate by using only the included dimensional data:

							sum by(instance, job, path) (
  #individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="services"}[5m])
)

Basic PromQL - Graphing the by function

The first graph shows the dimensions being summed in the second, but second includes three labels. The dimensions being queried are only those you specified:

Intermezzo - without() over by()

There is a great article covering the details on why you should use without over by, see RobustPerception on by() vs. without()

The reality is that, as your metrics gain new labels over time, your expression results automatically include any newly added labels if you use without().

You won't have to revisit or edit aggregation expressions.

Basic PromQL - The rest of aggregation story

See the documentation, beyond those covered in this workshop, Prometheus supports more aggregators:

stddev(): calculates the standard deviation of all values within an aggregated group.
stdvar(): calculates the standard variance of all values within an aggregated group.
count(): calculates the total number of series within an aggregated group.
count_values(): calculates number of elements with the same sample value.
bottomk(k, ...): calculates the smallest k elements by sample value.
topk(k, ...): calculates the largest k elements by sample value.
quantile(φ, ...): calculates the φ-quantile (0 ≤ φ ≤ 1) over dimensions.
group(...): just group by labels and set the sample value to 1.
limitk: sample n elements
limit_ratio: sample elements with approximately 𝑟 ratio if 𝑟 > 0, and the complement of such samples if 𝑟 = -(1.0 - 𝑟)

Basic PromQL - Doing simple math

You can do any arithmetic you like with the query language, which is fun for numbers, so try a few of these to see all the supported operators in action (the first one is shown in example below):

(2 + 2 / 2) * 3
((4 - 2 / 2) * 3) * 2^2
(5 % 3)

Basic PromQL - Time series interesting math

It becomes much more interesting when you start using math operators applied to time series data queries for human readable output. Take for example the first query below that returns it's results in bytes. In the second query with a bit of applied math, dividing the results by 1024 twice, you turn these results into human readable MBs:

							demo_batch_last_run_processed_bytes{job="services"}

demo_batch_last_run_processed_bytes{job="services"} / 1024 / 1024

Basic PromQL - Interesting math results

The results of the first query in bytes followed by the massaged results in MBs:

Basic PromQL - The power of binary operations

A very powerful feature is that the query language supports binary operations between whole sets of time series. If we query using the metric demo_api_request_duration_seconds_sum over 5 minutes we track the total time spent in each dimension, that's each label (instance, job, method, path, status). If we divide that by the metric demo_api_request_duration_seconds_count which collects the total count of requests for the same set of dimensions (again, the labels), gives you the average request duration over the last 5 minutes, broken out by each dimension (by each label):

							rate(demo_api_request_duration_seconds_sum{job="services"}[5m])
/
rate(demo_api_request_duration_seconds_count{job="services"}[5m])

Basic PromQL - Results binary operation

The results of a binary operation, average request duration broken out by each dimension:

Basic PromQL - Mismatched dimensions (labels)

In many cases the metrics you query will not have the same dimensions, one side having more than the other means you have to tell the divide operator which label to group by. In the below example, the demo_cpu_usage_seconds_total metric has an additional mode label dimension. To calculate per-mode CPU usage divided by the number of cores to find a per-core usage value of 0 to 1, you are telling it to group by the extra mode label dimension. This is done with group_left modifier and then you also need to exclude the mode label by explicitly matching only the mutually existing labels using the on modifier:

							rate(demo_cpu_usage_seconds_total{job="services"}[5m])
/
on(job, instance) group_left demo_num_cpus{job="services"}

Basic PromQL - Results mismatch dimensions

The results of your binary operation, finding a per-core usage value of 0 to 1:

Basic PromQL - Other options to try

You can approach the previous problem from the other direction using the ignoring() modifier to exclude the mode label dimensions from your query. It does not apply here, but if there are extra dimensions on the right instead of the left side of the operation, you can use group_right modifier instead. Let's try this:

							rate(demo_cpu_usage_seconds_total{job="services"}[5m])
/
ignoring(mode) group_left demo_num_cpus{job="services"}

Basic PromQL - Results other options

The results of your binary operation, finding a per-core usage value of 0 to 1 using ignoring:

Lab completed - Results

Next up, exploring advanced queries...

Contact - are there any questions?

Eric D. Schabell
Director Evangelism
Contact: @ericschabell {@fosstodon.org) or https://www.schabell.org

Up next in workshop...

Lab 5 - Using Advanced Queries

Lab 4 - Exploring Basic Queries