Lab 5 - Understanding backpressure
Lab Goal
This lab explores backpressure in telemetry pipelines and how to address it with Fluent Bit.
Intermezzo - Jumping to the solution
If you happen to be exploring Fluent Bit as an architect and want to jump to the solution in
action, we've included the configuration files in the easy install project from the source
install support directory, see the previous installing from source lab. Instead of creating all
the configurations as shown in this lab, you'll find them ready to use as shown below from the
fluentbit-install-demo
root directory:
$ ls -l support/configs-lab-5/
-rw-r--r--@ 1 erics staff 166 Jul 31 13:12 Buildfile
-rw-r--r-- 1 erics staff 1437 Jul 31 14:02 workshop-fb.yaml
Backpressure - The basic problem
The purpose of our telemetry pipelines is to collect events, parse, optionally filter, optionally
buffer, route, and deliver them to predefined destinations. Fluent Bit is setup by default to
put events into memory, but what happens if that memory is not able to hold the flow of events
coming into the pipeline?
This problem is known as backpressure
and leads to high memory consumption
in the Fluent Bit service. Other causes can be network failures, latency, or unresponsive
third-party services, resulting in delays or failure to process data fast enough while we
continue to receive new incoming data to process. In high-load environments with backpressure,
there's a risk of increased memory usage, which leads to the termination of the Fluent Bit
process by the hosting operating system. This is known as an
Out of Memory (OOM)
error.
Let’s configure an example pipeline and make it run in a constrained environment, causing
backpressure and ending with the container failing with an OOM error...
Backpressure - Configuration OOM inputs
As we are going to cause catastrophic failure to our Fluent Bit pipelines in this lab,
all examples are going to be shown using containers (Podman). It is assumed you are familiar with
container tooling such as Podman or Docker.
We begin configuration of our telemetry pipeline in the INPUT
phase with a
simple dummy plugin generating a large amount of entries to flood our pipeline. Add this to a
new workshop-fb.yaml
configuration file as follows:
# This file is our workshop Fluent Bit configuration.
#
service:
flush: 1
log_level: info
pipeline:
# This entry generates a large amount of success messages for the workshop.
inputs:
- name: dummy
tag: big.data
copies: 15000
dummy: {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah"}
Backpressure - Configuring OOM outputs
Now ensure the output section of our configuration file workshop-fb.yaml
following the inputs section is as follows:
# This entry directs all tags (it matches any we encounter)
# to print to standard output, which is our console.
#
outputs:
- name: stdout
match: '*'
format: json_lines
Backpressure - Testing this pipeline (container)
Let's now try testing our configuration by running it using a container image. First thing that
is needed is to ensure a file called Buildfile
is created. This is going to
be used to build a new container image and insert our configuration file. Note this file needs
to be in the same directory as our configuration file, otherwise adjust the file path names:
FROM cr.fluentbit.io/fluent/fluent-bit:3.1.4
COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
Backpressure - Building this pipeline (container)
Now we'll build a new container image, naming it with a version tag, as follows using the
Buildfile
and assuming you are in the same directory:
$ podman build -t workshop-fb:v8 -f Buildfile
STEP 1/3: FROM cr.fluentbit.io/fluent/fluent-bit:3.1.4
STEP 2/3: COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
STEP 3/3: CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
COMMIT workshop-fb:v8
Successfully tagged localhost/workshop-fb:v8
ecacdf79820429d0fb10696138cd03803224c9acfe8946cf4aa317f1f179646a
Backpressure - Running this pipeline (container)
Now we'll run our new container image:
Backpressure - Console output this pipeline (container)
The console output should look something like this, noting that we've cut out the ascii logo
at start up. This runs until exiting with CTRL_C, but before we do that we need get some
information about the memory settings so we can create an OOM experience. Leave this running
and go to next slide to explore the container stats:
...
[2024/04/16 10:14:32] [ info] [input:dummy:dummy.0] initializing
[2024/04/16 10:14:32] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2024/04/16 10:14:32] [ info] [sp] stream processor started
[2024/04/16 10:14:32] [ info] [output:stdout:stdout.0] worker #0 started
[0] big.data: [[1713262473.231406588, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[1] big.data: [[1713262473.232578175, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[2] big.data: [[1713262473.232581509, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[3] big.data: [[1713262473.232583009, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[4] big.data: [[1713262473.232584217, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[5] big.data: [[1713262473.232585425, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[6] big.data: [[1713262473.232586550, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[7] big.data: [[1713262473.232587967, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[8] big.data: [[1713262473.232589134, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
[9] big.data: [[1713262473.232590425, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}]
...
Backpressure - Container stats inspection
We need to find the memory usage of this pipeline, so while it's running we need to check the
container stats for our current memory numbers (MEM USAGE
and
LIMIT
):
$ podman stats fb-oom
ID MEM USAGE / LIMIT
a9a25abc042a 9.925MB / 2.045GB
Backpressure - Simulating backpressure
If we run our pipeline in a container configured with constricted memory, in our case we need
to give it around 11MB limit (adjust on your machine until you see it run a bit of logging before
failing), then we'll see the pipeline run for a bit and then fail due to overloading (OOM):
Backpressure - Console output this pipeline (container)
The console output shows that the pipeline ran for a bit, in our case below to event number 1124
before it hit the OOM limits of our container environment (11MB). We can validate this by
inspecting the container image on the next slide:
...
{"date":1722510052.224602,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224602,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224606,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224607,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510052.224607,"message":"true 200 success",
...
Backpressure - Validating backpressure OOM
Below we need to find the container id of our last container that failed and then inspect
it for an OOM failure to validate our backpressure worked. The following commands show that our
container kernel failed and killed it due to an OOM error:
$ podman inspect fb-oom | grep OOM
"OOMKilled": true,
Backpressure - Catastrophic failure prevention
What we've seen is that when a channel floods with too many events to process, our pipeline
instance fails. From that point onwards we are now unable to collect, process, or deliver any
more events.
Our first try at fixing this problem is to ensure that our input plugin is not flooded with
more events than it can handle. We can prevent this backpressure scenario from happening by
setting memory limits on the input plugin. This means setting a configuration property
mem_buf_limit
that will limit the events allowed. Let's try this...
Backpressure - Memory buffer limit
The configuration of our telemetry pipeline in the INPUT
phase needs a
slight adjustment by adding mem_buf_limit
as shown, set to
2MB
to ensure we hit that limit on ingesting events:
...
pipeline:
inputs:
- name: dummy
tag: big.data
dummy: '{"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah"}'
copies: 15000
mem_buf_limit: 2MB
...
Backpressure - Building limited container
Now we'll build a new container image, naming it with a version tag, as follows using the
Buildfile
and assuming you are in the same directory:
$ podman build -t workshop-fb:v9 -f Buildfile
STEP 1/3: FROM cr.fluentbit.io/fluent/fluent-bit:3.1.4
STEP 2/3: COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
STEP 3/3: CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
COMMIT workshop-fb:v9
Successfully tagged localhost/workshop-fb:v9
885ea0f09e493e9323671c6b9ffe962ed2857dcffb45225791b8bb736a424e45
Backpressure - Running limited container
Now we'll run our new container image and search for the level of memory needed to trigger the
pausing of processing without breaking the container, in our case here it was around 18MB (so
try increasing the memory on your machine until you get the results shown on the next slide):
$ podman run --rm -it --memory=18MB workshop-fb:v9
Backpressure - Console output limited container
The console output should look something like this, running until exiting with CTRL_C, but before
we do that we see that after a certain amount of time the output shows the input plugin pauses,
and then resumes. This is highlighted below:
...
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
[2024/08/01 11:06:41] [ info] [input] resume dummy.0
[2024/08/01 11:06:41] [ info] [input] dummy.0 resume (mem buf overlimit)
[2024/08/01 11:06:42] [ warn] [input] dummy.0 paused (mem buf overlimit)
[2024/08/01 11:06:42] [ info] [input] pausing dummy.0
{"date":1722510401.216399,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216414,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216416,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
...
Intermezzo - How it works part 1
The mem_buf_limit
only applies with memory only buffering. Behind the
scenes, our previous example behaves as follows (using fictional data amounts):
mem_buf_limit
set to 2MB
- input plugin tries to append 1.25MB
- engine routes 1.25MB data to output plugin
- output plugin backend is blocking delivery for some reason
- engine scheduler will retry delivery after 10 seconds
- input plugin tries to append the next 1MB
Here the engine allows appending of the last 1MB into memory, now buffering 2.25MB of data and
exceeding the limit we set. This is a permissive limit
, meaning a single
write to the memory is allowed past the limit, but triggers the following:
- blocks input plugin from appending more data
- notify the input plugin invoking the pause notification
Intermezzo - How it works part 2
While the input plugin is paused, the engine will not append more data from that plugin. The
input plugin manages the state and decides what to do when paused. When the engine can finally
deliver the initial 1.25MB of data (or given up retrying), that amount of memory is released
causing the following:
- when 1.25MB data was delivered, internal memory counter is updated
- the internal memory counter now shows remaining 1MB in memory
- with 1MB < 2MB, engine will check input plugin state
- if plugin paused, it sends a resume notification
- input plugin resumes and can start appending more data
Backpressure - Memory limiting failures
The results of this memory buffer limiting as you can imagine is not quite the solution to solve
the backpressure issues we are dealing with. While it does prevent the pipeline container from
failing completely due to high memory usage as it pauses ingesting new records, but it is also
potentially losing data during those pauses as the input plugin clears its buffers. Once the
buffers are cleared, the ingestion of new records resumes.
In the next lab, we'll see how to achieve both data safety and memory safety by configuring a
better buffering solution with Fluent Bit.
Lab completed - Results
...
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510400.223572,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
[2024/08/01 11:06:41] [ info] [input] resume dummy.0
[2024/08/01 11:06:41] [ info] [input] dummy.0 resume (mem buf overlimit)
[2024/08/01 11:06:42] [ warn] [input] dummy.0 paused (mem buf overlimit)
[2024/08/01 11:06:42] [ info] [input] pausing dummy.0
{"date":1722510401.216399,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216414,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510401.216416,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
...
Next up, avoiding telemetry data loss...
Contact - are there any questions?