Telemetry Pipelines Workshop

Lab Goal

This lab explores backpressure in telemetry pipelines and how to address it with Fluent Bit.

Intermezzo - Jumping to the solution

If you happen to be exploring Fluent Bit as an architect and want to jump to the solution in action, we've included the configuration files in the easy install project from the source install support directory, see the previous installing from source lab. Instead of creating all the configurations as shown in this lab, you'll find them ready to use as shown below from the fluentbit-install-demo root directory:

							$ ls -l support/configs-lab-5/

-rw-r--r--@ 1 erics  staff   166 Jul 31 13:12 Buildfile
-rw-r--r--  1 erics  staff  1437 Jul 31 14:02 workshop-fb.yaml

Backpressure - The basic problem

The purpose of our telemetry pipelines is to collect events, parse, optionally filter, optionally buffer, route, and deliver them to predefined destinations. Fluent Bit is set up by default to put events into memory, but what happens if that memory is not able to hold the flow of events coming into the pipeline?

This problem is known as backpressure and leads to high memory consumption in the Fluent Bit service. Other causes can be network failures, latency, or unresponsive third-party services, resulting in delays or failure to process data fast enough while we continue to receive new incoming data to process. In high-load environments with backpressure, there's a risk of increased memory usage, which leads to the termination of the Fluent Bit process by the hosting operating system. This is known as an Out of Memory (OOM) error.

Let’s configure an example pipeline and make it run in a constrained environment, causing backpressure and ending with the container failing with an OOM error...

Backpressure - Configuration OOM inputs

As we are going to cause catastrophic failure to our Fluent Bit pipelines in this lab, all examples are going to be shown using containers (Podman). It is assumed you are familiar with container tooling such as Podman or Docker.

We begin configuration of our telemetry pipeline in the INPUT phase with a simple dummy plugin generating a large amount of entries to flood our pipeline. Add this to a new workshop-fb.yaml configuration file as follows:

							# This file is our workshop Fluent Bit configuration.
#
service:
  flush: 1
  log_level: info

pipeline:

  # This entry generates a large amount of success messages for the workshop.
  inputs:
    - name: dummy
      tag: big.data
      copies: 15000
      dummy: {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah"}

Backpressure - Configuring OOM outputs

Now ensure the output section of our configuration file workshop-fb.yaml following the inputs section is as follows:

							# This entry directs all tags (it matches any we encounter)
# to print to standard output, which is our console.
#
outputs:
  - name: stdout
    match: '*'
    format: json_lines

Backpressure - Testing this pipeline (container)

Let's now try testing our configuration by running it using a container image. First thing that is needed is to ensure a file called Buildfile is created. This is going to be used to build a new container image and insert our configuration file. Note this file needs to be in the same directory as our configuration file, otherwise adjust the file path names:

							FROM ghcr.io/fluent/fluent-bit:4.0.0

COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml

CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]

Backpressure - Building this pipeline (container)

Now we'll build a new container image, naming it with a version tag, as follows using the Buildfile and assuming you are in the same directory:

							$ podman build -t workshop-fb:v8 -f Buildfile

STEP 1/3: FROM ghcr.io/fluent/fluent-bit:4.0.0
STEP 2/3: COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
--> 3592e314b0b7
STEP 3/3: CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
COMMIT workshop-fb:v8
--> ecacdf798204
Successfully tagged localhost/workshop-fb:v8
ecacdf79820429d0fb10696138cd03803224c9acfe8946cf4aa317f1f179646a

Backpressure - Running this pipeline (container)

Now we'll run our new container image:

							$ podman run --name fb-oom workshop-fb:v8

Backpressure - Console output this pipeline (container)

The console output should look something like this, noting that we've cut out the ascii logo at start up. This runs until exiting with CTRL_C, but before we do that we need get some information about the memory settings so we can create an OOM experience. Leave this running and go to next slide to explore the container stats:

							...
[2024/04/16 10:14:32] [ info] [input:dummy:dummy.0] initializing
[2024/04/16 10:14:32] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2024/04/16 10:14:32] [ info] [sp] stream processor started
[2024/04/16 10:14:32] [ info] [output:stdout:stdout.0] worker #0 started
[0] big.data: [[1713262473.231406588, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[1] big.data: [[1713262473.232578175, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[2] big.data: [[1713262473.232581509, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[3] big.data: [[1713262473.232583009, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[4] big.data: [[1713262473.232584217, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[5] big.data: [[1713262473.232585425, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[6] big.data: [[1713262473.232586550, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[7] big.data: [[1713262473.232587967, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[8] big.data: [[1713262473.232589134, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[9] big.data: [[1713262473.232590425, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
...

Backpressure - Container stats inspection

We need to find the memory usage of this pipeline, so while it's running we need to check the container stats for our current memory numbers (MEM USAGE and LIMIT):

							# Use the container name you assigned in previous slide.
$ podman stats fb-oom

ID              MEM USAGE /  LIMIT
a9a25abc042a    9.925MB   /  2.045GB

Backpressure - Simulating backpressure

If we run our pipeline in a container configured with constricted memory, in our case we need to give it around 11MB limit (adjust on your machine until you see it run a bit of logging before failing), then we'll see the pipeline run for a bit and then fail due to overloading (OOM):

							$ podman run --name fb-oom --memory 11MB workshop-fb:v8

Backpressure - Console output this pipeline (container)

The console output shows that the pipeline ran for a bit, in our case below to event number 1124 before it hit the OOM limits of our container environment (11MB). We can validate this by inspecting the container image on the next slide:

							...
{"date":1722510052.224602,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224602,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224606,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224607,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224607,"message":"true 200 success",
...

Backpressure - Validating backpressure OOM

Below we need to find the container id of our last container that failed and then inspect it for an OOM failure to validate our backpressure worked. The following commands show that our container kernel failed and killed it due to an OOM error:

							# Use the container name you assigned.
$ podman inspect fb-oom | grep OOM

"OOMKilled": true,

Backpressure - Catastrophic failure prevention

What we've seen is that when a channel floods with too many events to process, our pipeline instance fails. From that point onwards we are now unable to collect, process, or deliver any more events.

Our first try at fixing this problem is to ensure that our input plugin is not flooded with more events than it can handle. We can prevent this backpressure scenario from happening by setting memory limits on the input plugin. This means setting a configuration property mem_buf_limit that will limit the events allowed. Let's try this...

Backpressure - Memory buffer limit

The configuration of our telemetry pipeline in the INPUT phase needs a slight adjustment by adding mem_buf_limit as shown, set to 2MB to ensure we hit that limit on ingesting events:

							...
pipeline:

  # This entry generates a large amount of success messages for the workshop.
  inputs:
    - name: dummy
      tag: big.data
      dummy: '{"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah"}'
      copies: 15000
      mem_buf_limit: 2MB
...

Backpressure - Building limited container

Now we'll build a new container image, naming it with a version tag, as follows using the Buildfile and assuming you are in the same directory:

							$ podman build -t workshop-fb:v9 -f Buildfile

STEP 1/3: FROM ghcr.io/fluent/fluent-bit:4.0.0
STEP 2/3: COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
--> 1d8e9412e4dd
STEP 3/3: CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
COMMIT workshop-fb:v9
--> 885ea0f09e49
Successfully tagged localhost/workshop-fb:v9
885ea0f09e493e9323671c6b9ffe962ed2857dcffb45225791b8bb736a424e45

Backpressure - Running limited container

Now we'll run our new container image and search for the level of memory needed to trigger the pausing of processing without breaking the container, in our case here it was around 18MB (so try increasing the memory on your machine until you get the results shown on the next slide):

							$ podman run --rm --name fb-limit -it --memory=18MB workshop-fb:v9

Backpressure - Console output limited container

The console output should look something like this, running until exiting with CTRL_C, but before we do that we see that after a certain amount of time the output shows the input plugin pauses, and then resumes. This is highlighted below:

							...
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
[2024/08/01 11:06:41] [ info] [input] resume dummy.0
[2024/08/01 11:06:41] [ info] [input] dummy.0 resume (mem buf overlimit)
[2024/08/01 11:06:42] [ warn] [input] dummy.0 paused (mem buf overlimit)
[2024/08/01 11:06:42] [ info] [input] pausing dummy.0
{"date":1722510401.216399,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216414,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216416,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
...

Intermezzo - How it works part 1

The mem_buf_limit only applies with memory only buffering. Behind the scenes, our previous example behaves as follows (using fictional data amounts):

mem_buf_limit set to 2MB
input plugin tries to append 1.25MB
engine routes 1.25MB data to output plugin
output plugin backend is blocking delivery for some reason
engine scheduler will retry delivery after 10 seconds
input plugin tries to append the next 1MB

Here the engine allows appending of the last 1MB into memory, now buffering 2.25MB of data and exceeding the limit we set. This is a permissive limit, meaning a single write to the memory is allowed past the limit, but triggers the following:

blocks input plugin from appending more data
notify the input plugin invoking the pause notification

Intermezzo - How it works part 2

While the input plugin is paused, the engine will not append more data from that plugin. The input plugin manages the state and decides what to do when paused. When the engine can finally deliver the initial 1.25MB of data (or given up retrying), that amount of memory is released causing the following:

when 1.25MB data was delivered, internal memory counter is updated
the internal memory counter now shows remaining 1MB in memory
with 1MB < 2MB, engine will check input plugin state
if plugin paused, it sends a resume notification
input plugin resumes and can start appending more data

Backpressure - Memory limiting failures

The results of this memory buffer limiting as you can imagine is not quite the solution to solve the backpressure issues we are dealing with. While it does prevent the pipeline container from failing completely due to high memory usage as it pauses ingesting new records, but it is also potentially losing data during those pauses as the input plugin clears its buffers. Once the buffers are cleared, the ingestion of new records resumes.

In the next lab, we'll see how to achieve both data safety and memory safety by configuring a better buffering solution with Fluent Bit.

Lab completed - Results

							...
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
[2024/08/01 11:06:41] [ info] [input] resume dummy.0
[2024/08/01 11:06:41] [ info] [input] dummy.0 resume (mem buf overlimit)
[2024/08/01 11:06:42] [ warn] [input] dummy.0 paused (mem buf overlimit)
[2024/08/01 11:06:42] [ info] [input] pausing dummy.0
{"date":1722510401.216399,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216414,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216416,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
...

Next up, avoiding telemetry data loss...

Contact - are there any questions?

Eric D. Schabell
Director Evangelism
Contact: @ericschabell {@fosstodon.org) or https://www.schabell.org

Up next in workshop...

Lab 6 - Avoiding telemetry data loss

Lab 5 - Understanding backpressure