Lab 6 - Avoiding telemetry data loss
Lab Goal
This lab explores how to avoid losing telemetry data in your telemetry pipeline with
Fluent Bit.
Intermezzo - Jumping to the solution
If you happen to be exploring Fluent Bit as an architect and want to jump to the solution in
action, we've included the configuration files in the easy install project from the source
install support directory, see the previous installing from source lab. Instead of creating all
the configurations as shown in this lab, you'll find them ready to use as shown below from the
fluentbit-install-demo
root directory:
$ ls -l support/configs-lab-6/
-rw-r--r--@ 1 erics staff 166 Jul 31 13:12 Buildfile
-rw-r--r-- 1 erics staff 1437 Jul 31 14:02 workshop-fb.yaml
Data loss - The problem statement
In the previous lab we explored how input plugins can hit their ingestion limits when our
telemetry pipelines scale beyond memory limits when using default in-memory buffering of our
events.
We also saw that we can limit the size of our input plugin buffers to prevent our pipeline from
failing on out of memory errors, but that the pausing of the ingestion can also lead to data loss
if the clearing of the input buffers takes too long.
In this lab, we'll explore another buffering solution that Fluent Bit offers to ensure data and
memory safety at scale by configuring
filesystem buffering
.
Data loss - The solution background
Let's explore how the Fluent Bit engine processes data that input plugins emit. When an input
plugin emits events, the engine groups them in a
Chunk
. A Chunk size is around 2MB. The default is for the engine to place this Chunk only
in memory.
We saw that limiting in-memory buffer size did not solve the problem, so we are looking at
modifying this default behavior of only placing Chunk's into memory. This is done by changing
the property
storage.type
from the default
Memory
to
Filesystem
.
It's important to understand that
Memory
and
Filesystem
buffering mechanisms are not mutually exclusive. By enabling filesystem buffering for our input
plugin we automatically get performance and data safety.
Intermezzo - Filesystem configuration tips
By changing our buffering from memory to filesystem with the property
storage.type filesystem
, the settings for mem_buf_limit
are ignored.
Instead, we need to use the property storage.max_chunks_up
for controlling
the size of our memory buffer. Shockingly, when using the default settings the property
storage.pause_on_chunks_overlimit
is set to off
, causing
the input plugins not to pause. Instead input plugins will switch to buffering into the
filesystem only. We can control the amount of disk space used with
storage.total_limit_size
.
If the property storage.pause_on_chunks_overlimit
is set to
on
, then the buffering mechanism to the filesystem behaves just like our
mem_buf_limit
scenario demonstrated in the previous lab.
Data loss - Configuration inputs
We will be using a backpressure (stressed) configuration for our Fluent Bit pipelines in this
lab, all examples are going to be shown using containers (Podman). It is assumed you are
familiar with container tooling such as Podman or Docker.
We begin with a configuration file workshop-fb.yaml
with the
INPUT
phase, a simple dummy plugin generating a large amount of entries to
flood our pipeline (note that the previous mem_buf_limit
fix is commented
out):
# This file is our workshop Fluent Bit configuration.
#
service:
flush: 1
log_level: info
pipeline:
# This entry generates a large amount of success messages for the workshop.
inputs:
- name: dummy
tag: big.data
copies: 15000
dummy: {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah"}
#mem_buf_limit 2MB
...
Data loss - Configuring outputs
Now ensure the output section of our configuration file workshop-fb.yaml
following the inputs section is as follows:
# This entry directs all tags (it matches any we encounter)
# to print to standard output, which is our console.
#
outputs:
- name: stdout
match: '*'
format: json_lines
Data loss - Testing this stressed pipeline (container)
Let's now try testing our configuration by running it using a container image. First thing that
is needed is to ensure a file called Buildfile
is created. This is going to
be used to build a new container image and insert our configuration file. Note this file needs
to be in the same directory as our configuration file, otherwise adjust the file path names:
FROM cr.fluentbit.io/fluent/fluent-bit:3.1.4
COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
Data loss - Building stressed image (container)
Now we'll build a new container image, naming it with a version tag, as follows using the
Buildfile
and assuming you are in the same directory:
$ podman build -t workshop-fb:v10 -f Buildfile
STEP 1/3: FROM cr.fluentbit.io/fluent/fluent-bit:3.1.4
STEP 2/3: COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
STEP 3/3: CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
COMMIT workshop-fb:v10
Successfully tagged localhost/workshop-fb:v10
6e6ef0d3e4647242d465cf5c08bd7ebfbbd6bdaac40bf90737d46e0be794b060
Data loss - Running this stressed pipeline (container)
If we run our pipeline in a container configured with constricted memory, in our case we need
to give it around 11MB limit (adjust this number for your machine as needed), then we'll see the
pipeline run for a bit and then fail due to overloading (OOM):
Data loss - Console output stressed pipeline (container)
The console output shows that the pipeline ran for a bit, in our case below before it hit the
OOM limits of our container environment (11MB). We can validate this by inspecting the container
image on the next slide:
...
{"date":1722510405.222825,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510405.222826,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510405.222826,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722510405.222827, <<<< CONTAINER KILLED WITH OOM HERE
Data loss - Validating stressed == OOM
Below we view our container, inspecting it for an OOM failure to validate our backpressure
worked. The following commands show that our container kernel failed and killed it due to an
OOM error:
$ podman inspect fbv8 | grep OOM
"OOMKilled": true,
Data loss - Prevention with filesystem buffering
What we've seen is that when a channel floods with too many events to process, our pipeline
instance fails. From that point onwards we are now unable to collect, process, or deliver any
more events.
Already having tried in a previous lab to manage this with mem_buf_limit
settings, we've seen that this also is not the real fix. To prevent data loss we need to enable
filesystem buffering so that overloading the memory buffer means that events will be buffered in
the filesystem until there is memory free to process them. Let's try this...
Data loss - Filesystem buffering
The configuration of our telemetry pipeline in the INPUT
phase needs a
slight adjustment by adding storage.type: filesystem
to as shown along with
a few SERVICE
section attributes to enable it as shown:
service:
flush: 1
log_level: info
storage.path: /tmp/fluentbit-storage
storage.sync: normal
storage.checksum: off
storage.max_chunks_up: 5
pipeline:
inputs:
- name: dummy
tag: big.data
copies: 15000
dummy: {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah"}
storage.type: filesystem
...
Data loss - Notes about SERVICE section
The properties that might need some explanation are:
- storage.path - putting filesystem buffering in the tmp filesystem
- storage.sync - using normal and turning off checksum processing
- storage.max_chunks_up - set to ~10MB, amount of allowed memory for events
Data loss - Testing filesystem pipeline (container)
Now we'll build a new container image, naming it with a version tag, as follows using the
Buildfile
and assuming you are in the same directory:
$ podman build -t workshop-fb:v11 -f Buildfile
STEP 1/3: FROM cr.fluentbit.io/fluent/fluent-bit:3.1.4
STEP 2/3: COPY ./workshop-fb.yaml /fluent-bit/etc/workshop-fb.yaml
STEP 3/3: CMD [ "fluent-bit", "-c", "/fluent-bit/etc/workshop-fb.yaml"]
COMMIT workshop-fb:v11
Successfully tagged localhost/workshop-fb:v11
d78206548aaaa855c25a9ef2596200c241a2e13c99fe74203f7604493a76fc53
Data loss - Running filesystem pipeline (container)
If we run our pipeline in a container configured with constricted memory (slightly larger value
due to memory needed for mounting the filesystem), in our case we need to give it around 9MB
limit, then we'll see the pipeline running without failure. See next slide to validate filesystem
buffering is working:
$ podman run --rm -v ./:/tmp --memory 11MB --name fbv11 workshop-fb:v11
Data loss - Console output filesystem pipeline (container)
The console output shows that the pipeline runs until we stop it with CTRL-C, with events
rolling by as shown below. We can now validate the filesystem buffering by looking at the
filesystem as shown on the next slide:
...
{"date":1722514136.218682,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722514136.218682,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
{"date":1722514136.218683,"message":"true 200 success","big_data":"blah blah blah blah blah blah blah blah blah blah blah blah"}
...
Data loss - Validating filesystem buffering (container)
Check the filesystem from the directory where you started your container from. While the pipeline
is running, with memory restrictions, it will be using the filesystem to store events until the
memory is free to process them. If you view the contents of the file before stopping your
pipeline, you'll see a messy message format stored inside (cleaned up for you here):
$ ls -l ./fluentbit-storage/dummy.0/1-1716558042.211576161.flb
-rw------- 1 username groupname 1.4M May 24 15:40 1-1716558042.211576161.flb
$ cat fluentbit-storage/dummy.0/1-1716558042.211576161.flb
??wbig.data???fP??
?????message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
?p???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
߲???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
?F???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
?d???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
...
Data loss - Thoughts on filesystem solution
This solution is the way to deal with backpressure and other issues that might flood your
telemetry pipeline and causing it to crash. It's worth noting that using a filesystem to buffer
the events also introduces the limits of the filesystem being used.
It's important to understand that just as memory can run out, so too can the filesystem storage
reach its limits. It's best to have a plan to address any possible filesystem challenges when
using this solution, but this is outside the scope of this workshop.
Data loss solution for pipelines completed!
Lab completed - Results
$ cat fluentbit-storage/dummy.0/1-1716558042.211576161.flb
??wbig.data???fP??
?????message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
?p???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
߲???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
?F???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
?d???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP??
...
Next up, avoiding telemetry data loss...
Contact - are there any questions?