Lab 3 - Prometheus

Lab Goal

This lab introduces the Prometheus Query Language (PromQL) giving you an introduction and sets up a demo project to provide more realistic data for querying.

Prometheus - What query language?

A query language is needed by Prometheus to be able to send requests to the stored metrics data, allowing users to gain ad-hoc insights, build visualizations and dashboards from this data, and be able to report (alert) when incoming data indicates that systems are not performing as desired.

This language is called PromQL and provides an open standard unified way of selecting, aggregating, transforming, and computing on the collected time series data. Note that this language provides only READ access to the collected metrics data, while Prometheus offers a different path to WRITE access.

PromQL - Proving PromQL compliance?

PromQL is an open standard in that it's widely integrated by many vendors in their products, which begs the question, how can I be sure that a product is 100% compatible with the real open source PromQL found in the Prometheus project?

To answer this question, the PromQL Compliance Tester was added to the larger Prometheus compliance project. Follow the documentation and you can test any vendor you like, or you can browse one of the formatted results published online.

PromQL - Compliance testing results

You can see here that Chronosphere has full compliance with the Prometheus Query Language and you can view the entire output of the compliance tests online:

PromQL - Prometheus architecture

Remember the overview architecture of Prometheus as it was presented in the first introduction? The next slide will expose the Prometheus query engine:

PromQL - Query engine architecture

Taking a closer look here at our Prometheus internals, we find the ingested time series data (metrics) are scraped from configured targets and stored in the TSDB. An internal PromQL engine supports our ability to query that data. All queries are in read-only access. The query engine supports both internal and external queries. Let's take a look at some rule terminology before we dig any further:

Intermezzo - Defining queries and rules

Before we get too deep into PromQL, let's look closer at the various rule terminology you'll be covering. First a query and rule:

Query - a PromQL query is not like SQL (SELECT * FROM...), but consist of nested functions with each inner function returning the data described to the next outer function.

Rule - a configured query to gather data and evaluate, either as a recording rule or an alerting rule.

Intermezzo - Recording and alerting rules

Next the recording rule and alerting rule, essential to complexer actions:

Recording rule - used to pre-query often used data or computationally expensive expressions and save the results for faster execution of queries later. Useful for queries used in dashboards (refreshed often).

Alerting rule - defines an alert condition based on PromQL expressions, when fired cause notifications to be sent to external services.

Intermezzo - Aggregation and filtering

Finally, a look at aggregation and filtering, very important to optimizing both execution as well as trimming excessive unused data metrics:

Aggregation - using operators that support combining elements from a single function, resulting in new results with fewer elements by combining values (SUM, MIN, MAX, AVG...)

Filtering - the act of removing metrics from a query result by exclusion, aggregation, or applying language functions to reduce the results.

PromQL - Prometheus internal queries

Now to how internal queries to Prometheus run. Recording and alerting rules are executed on a regular schedule to calculate rule results, such as an alert needing to fire. As you configure new rules, these activities happen automatically:

PromQL - The external queries

Queries can be sent externally to Prometheus using the Prometheus API (HTTP). External users, user interfaces (UIs), and dashboards are all examples of querying Prometheus metrics using PromQL. This is also how Prometheus uses its built-in web console to run queries:

PromQL - Exploring a few use cases

While there are many use cases that PromQL can support, it's possible to group them into a few more general ones. These are more common in your daily observability work and we’ll explore each one in more detail:

Ad-hoc querying
Dashboards
Alerting
Automation

PromQL use cases - Ad-hoc querying

This use case is about you running live queries against the collected time series data. Imagine you are getting alerts while on-call at your organization, you open the dashboard and the pre-configured display gives you some hints as to the issue but you want to dig specifically into some data points. That's when you write your own ad-hoc query and execute it to view the data in a graph:

PromQL use cases - Dashboard queries

This use case is where you create a layout of queries in what is know as a dashboard. You design your display of metrics, gauges, and charts you want to display for a specific user viewing aspects of your systems. PromQL queries are used to collect data, here using the Perses project (you'll learn about dashboards later in this workshop) and embedded it in a dashboard view:

PromQL use cases - Alerting queries

The use of queries to watch your collected data for possible alerts is another use case. Prometheus generates alerts based on queries such as this one looking for hardware failure:

							groups:
- name: Hardware alerts
	rules:
	- alert: Node down
	expr: up{job="node_exporter"} == 0
	for: 3m
	labels:
		severity: warning
	annotations:
		title: Node {{ $labels.instance }} is down
		description: No scrape {{ $labels.job }} on {{ $labels.instance }}.

PromQL use cases - Dispatching alerts

To make these alerts useful, you might want to dispatch them to Slack, PagerDuty, or some other notification mechanism. Here is an example of what Slack might look like when you dispatch an alert notification:

PromQL use cases - Query automation

When you are automating your processes you can run PromQL queries against Prometheus collected data and make choices based on the results. A few examples you might consider:

In a CI/CD pipeline, inspecting a deployment's stage health before full deployment.
Kicking off a remediation process when a system alerts to a deteriorated state.
Autoscaling to provision more infrastructure when increased load is detected.

Services demo - Query architecture

That's enough theory about queries for now, let's look at installing and running a services demo project (source: with thanks to this repository) that will allow you to query somewhat realistic scraped services time series data. The architecture is simple:

Services demo - Metrics being generated

The services demo architecture shows the layout, but what are these services providing for our Prometheus instance to collect metrics from? It's exporting synthetic metrics (specifically designed metrics) about our simulated services, here's a few examples:

HTTP API server exposing request counts and latencies
Periodic batch job exposing timestamp and number of processed bytes
Metrics: CPU usage, memory usage, size of disk, disk usage, and more

Options for installing services demo

There are several ways to install the services demo locally, so please click on the option you want to use to continue with this workshop:

Lab 3 - Introduction to the Query Language