The file
source is one of Vector's oldest and most widely used components.
While it began with relatively simple interface and implementation, it has
collected a large number of knobs, warts, and general issues over time. To
ensure that this important source continues to serve users well, we should
invest effort into making it easier to use, extend, and debug.
This RFC will cover a general overhaul of the file
source internals and
user-facing config. The focus will be on migrating the existing implementation
to a better structure, not rewriting from scratch.
There are three pages of issues tagged source: file
in the Vector repo. Below
I've tried to collect some of the more notable ones and categorize them:
-
File source retains files on disk (#763)
- Slowness due to many files
- Need sequential instead of fair reads
- Unclear why files handles were open
- Unclear that files were not done being processed
-
Checksums are confusing (#828)
- Checksums don't work with small files
-
start_at_beginning
is confusing (#1020)- Config behavior does not match user expectations
-
Only warn about small files not empties (#1065)
- Checksums don't work with small files
- Poor observability
-
- Correctness
-
Periodically read a non-file file (#1198)
- New mode
-
Clean up file checkpoint files (#1427)
- Checkpointing
-
File source slows down with many completed files (#1466)
- Slowness due to many files
- Read scheduling
-
Add path-based fingerprinting (#1948)
- Checksums don't work with small files
- inode fingerprinting is fraught
-
Log permissions issues (#2420)
- Poor observability
-
Stop after reading files (backfill mode) (#3216)
- New mode
-
Tailing limited to single core (#3379)
- Performance
-
Not releasing descriptors due to falling behind (#3440)
- Need fair reads instead of sequential reads
- Unclear why files handles were open
- Unclear that files were not done being processed
-
ignore_older
is confusing (#3567)- Config behavior does not match user expectations
-
Ignore dangling symlinks (optionally?) (#3662)
- Tweak for specific use case
-
Make it easier to reuse internally (#4048)
-
Batch mode (#4271)
- New mode
-
Tail only new data (#4382)
- Config behavior does not match user expectations
-
Slow with millions of files on EBS (#4434)
- Slowness due to many files
- Need sequential instead of fair reads
- New mode
The goal of this RFC is not to fix all of these issues in one fell swoop, but to suggest a reorganization that renders the relevant behaviors as orthogonal as possible and enables addressing them easily and independently.
In order to achieve our goals, we need to break down the mostly-monolithic
file
source into smaller components that can be understood, configured, and
extended in isolation, and then assembled smoothly into a cohesive unit.
To start, let's examine the current top-level structure of the file
source.
This doesn't cover every interesting thing we want to address, but it's useful
context. In very rough pseudocode, it does the following:
let checkpoints = load_checkpoints();
// Find files we're configured to watch
let file_list = look_for_files();
// Prioritize files that exist on startup
sort(file_list);
loop {
// Do these things occasionally to avoid burning CPU
if its_time() {
checkpoints.persist();
let current_file_list = look_for_files();
reconcile(&mut file_list, current_file_list);
}
for file in file_list {
// Don't check inactive files as often
if !should_read(file) {
continue;
}
// Try to read new data from the files
while let Some(line) = file.read_line() {
output.push(line)
// But not an infinite amount at one time
if limit_reached { break }
}
// If configured, rm files when we're done with them
maybe_rm(file)
// Either continue to read the next file, or break to start back at the
// beginning of the (prioritized) list
if should_not_read_next_file() {
break
}
}
// Drop handles of deleted files that we've finished reading
unwatch_dead(&mut file_list);
// Send the gathered data downstream
emit(output);
// If we're not seeing any new data, back off to avoid burning CPU
maybe_backoff();
// If Vector is shutting down, stop processing
maybe_shutdown();
}
Excepting observability and performance, the primary user-facing concerns in the issues above are roughly as follows:
- Read scheduling
- File identity
- File starting point
These concerns are a good guide for determining which parts of the implementation tend to change together. Our first goal is to isolate those areas of change from one another and construct subcomponents that effectively contain complexity. With those seams in place, it should become simpler to improve the internals of each.
Based on the user-level concerns we collected from the list of issues, let's break down what each respective subcomponent should look like.
In the pseudocode above, read scheduling is controlled by sort
, should_read
,
limit_reached
, should_not_read_next_file
, and maybe_backoff
. A component
would need to answer the following questions:
- In what order should I read the available files?
- Should I finish one file before moving on to the next?
- Should I back off reads to this file?
- Should I back off reads to all files (i.e. sleep)?
- How long should I spend working on a single file?
These questions would then be answered by a combination of configuration, file metadata, and gathered statistics.
There are (at least) two approaches we could take to building this component. The most obvious would be to implement a struct that contains configuration and exposes methods very much like the ones mentioned above. This would maintain the current structure of one main read loop with multiple points of control. An alternative would be to introduce a trait representing the logic of the read loop. This may lead to some duplication, but would allow simpler separation of different use cases and require readers of the code understand the interaction between fewer subtle points of control.
I would propose that we begin with the simpler method of consolidating logic from the existing structure. This involves fewer design decisions and we may find with the other simplifications we're planning make that further simplifying the read loop is not worthwhile. That being said, once related features (like the batch mode discussed later) are complete, we can reevaluate this decision.
Identity is an area where we already have the beginning of modularity with the
Fingerprinter
abstraction, but it's incomplete. In our pseudocode, identity is
involved in both look_for_files
and reconcile
. It's not just a way to figure
out a magic identifier, but also the logic to update our list of watched files
based on those identifiers. It needs to answer the following:
- Given a visible path, does it contain a file I've seen before?
- If I have seen it before, has it been renamed?
- If I have seen it before, am I now seeing it in multiple places?
- If I'm seeing duplicates, how should I choose which to follow?
This logic is mostly in one place in the current implementation, so it should
not be terribly difficult to extract it. The use of Fingerprinter
should
likely become purely internal to the new component.
In addition to simply consolidating the logic, we can expand and make the use of
Fingerprinter
more intelligent. We currently have three ways it can work:
- Checksum (usually reliable but frustrating for small files)
- Device and inode (simple and works with small files, but doesn't handle edge cases well)
- First line checksum (solid for intended use case but not yet general)
I'd first propose that we drop device and inode fingerprinting and add path-based fingerprinting in its place. This gives users the option to do the simplest possible thing for use cases that don't need to worry about traditional rotation.
Next, I suggest we unify the two checksumming strategies. Neither is perfectly general, but we should be able to come up with a simple algorithm that combines the best of both. As a prerequisite, we should unify the read path such that these checksums handle compressed files correctly (discussed later). With that in place, the algorithm can look something like the following:
- Read up to
max_line_length
bytes from the file starting atignored_header_bytes
- Return no fingerprint if there is no newline in the returned bytes
- Otherwise, return the checksum of the bytes up to the first newline
This should give a good balance between usability and flexibility for the default strategy. As we implement it, we should evolve the current representation of fingerprints to one that maintains information about how it was determined. This will allow us more flexibility to evolve and/or combine strategies in the future.
This logic is one layer below what's represented in the pseudocode above, but
there is just as much complexity. When we look_for_files
and build watchers
for them, we need to make a decision about where to start reading in that file.
That decision should be based on any stored checkpoint, file metadata (e.g.
mtime), whether the file was found at startup or runtime, and how the source is
configured.
Since this decision really only happens in one place, the challenge is more about providing an understandable config UI than designing the right interface. This should be driven by real world use cases. For example:
- Ignoring existing checkpoints
- Start at the beginning or end of existing files, optionally taking into account factors like mtime
- Start at the beginning or end of files added while we're watching (this can be tricky)
- Ordering which of the above concerns take precedence
I would suggest a config like the following:
ignore_checkpoints = true|false
(still write but don't read existing checkpoints)read_from = beginning|end
(where to start if there's no checkpoint or they're being ignored)skip_older_than = duration
(relevant whenread_from = beginning
, seek to end based on mtime)- Always start at the beginning of files added while we're watching (it's hard
to tell a
mv
from a create and write, so don't rely on seeing an empty file first to get all the data)
If we adopt this, we can also implement a solution where we don't hold open file
handles for files that match skip_older_than
, since this has caused some
issues. Naively, we need the open handle to attempt reads in case new data is
written to the file, but we could also implement that as a new state of
FileWatcher
that stashes the initial size and periodically checks the file
metadata (mtime and size) to see if it should start reading at what was the end.
In addition to extracting these various subcomponents, there are some other relatively simple changes we can make to help address the issues we discussed.
We currently use the outdated glob_minimum_cooldown
config option to determine
how often to do both of these tasks. We should switch them to their own
independent config options and allow them to be disabled (e.g. a batch use case
does not need to continuously look for new files).
We should also move them into their own periodic tasks, outside of the main read loop. This should give us better performance and help avoid bad behavior in situations where either is expensive (e.g. discovering millions of files on EBS).
With most of the other concerns pulled out, we can afford to adjust the main read loop to allow for some new capabilities. The most interesting is concurrent reads, but it will require a bit of experimentation before we're able to determine if it's worthwhile. There are a few possible approaches:
- Dispatch reads to an explicit threadpool
- Spawn a limited number of blocking tokio tasks
- Implement something with
iouring
The first two both introduce the questions of sizing and the ability of the underlying file system to enable concurrent access in a way that actually adds performance. We would need to test a wide variety of scenarios to evaluate the best path for either.
Much more interesting, but also limited, would be to build on top of iouring
.
Given that it's only available in modern Linux, it could not be the only
implementation. But since modern Linux is the majority of Vector usage, it could
be enough to cover the most demanding use cases. This approach would let us
avoid any questions of thread counts and defer concurrency to the kernel, which
is much better equipped to make use of the available hardware.
Given these options, I would currently propose that we wait. There are other
changes in this RFC that should impact performance positively, and we'll have
a better view of the potential cost/benefit of these approaches once those have
landed. When we reach that phase, I would suggest starting with the tokio
filesystem interface, as future improvements like iouring
are likely to better
match the async interface.
As it stands, both Fingerprinter
and FileWatcher
instances read data from
files, one to checksum that data and the other to return lines. The problem is
that only FileWatcher
handles compression, so our fingerprints can get
confused if a file is rotated and compressed.
To address that, we should evolve FileWatcher
into a more general wrapper
struct for file handles. This will allow us to encapsulate all direct file
access to within the struct, where we can more easily ensure that concerns like
compression are handled uniformly.
In addition to our current shutdown logic, we should add a configuration option and corresponding conditional for exiting the source once all files have reached EOF. This, along with disabled file discovery, would neatly and simply implement the oft-requested batch mode.
There's no grand design here, but we should do the work to go through and address all of the relevant issues. Some examples:
- Logging when we see that a file is deleted but need to keep it open
- Not logging noise around small or empty files
- Optionally silencing errors due to dangling symlinks
- Exposing which files are being read and our progress through them
For the last item, it would be particularly helpful to evolve how we store checkpoints. Instead of the strange filename-based system we have now, we should migrate towards a JSON file-based approach as laid out in #1779.
- Rename
ignore_older
toskip_older_than
- Replace
start_from_beginning
withignore_checkpoints
andread_from
(see earlier section) - Replace
oldest_first
withmode = tail|tail_sequential|batch
- Rename
glob_minimum_cooldown
todiscovery_interval
and disable whenmode = batch
The file source has dramatically overgrown its original design and is causing pain for both users and developers of Vector. It warrants spending some time and effort to improve both usability and maintainability. Otherwise, we risk losing a large and increasing amount of our time to its maintenance and user support.
By focusing on modularity and simple improvements over a rewrite, we will be able to maintain the good parts of the source's long history (accumulated bug fixes). This also reduces the risk of introducing new bugs that would be inevitable with a more aggressive rewrite.
Modularity also positions us well for the future by encouraging a split between user-facing config and implementation-level config. This matches well with our config composition RFC and should enable simple config facades for specific file-based use cases.
This is a backwards incompatible change to the config of one of our most widely used sources. To change the interface would likely inconvenience a large number of current users. There is also always a risk that bugs are introduced as part of refactoring, even though our approach is designed to minimize that risk.
One alternative would be to rewrite the source from scratch. While this would likely result in more maintainable code, it would risk losing much of the accumulated knowledge present in the existing implementation. It would also likely take much longer and present a more difficult transition plan than the evolution proposed here.
Another alternative would be to leave the implementation largely alone and focus on improving our documentation and config UI alone. While this would likely yield strong benefits, it would leave a number of important issues unaddressed and do little for our ability to extend the source in the future.
- Should we change the user-facing config at the same time at the implementation or split the two?
- How to handle transitioning users to the new interface?
The work with the highest investment/payoff ratio is likely around checksumming, so I would suggest we attack that first:
- Migrate
FileWatcher
to general purpose file wrapper with fingerprinting - Rework checkpoint persistence to allow differentiating and migrating between types
- Combine checksum and first line checksum fingerprinting strategies
- Add path-based fingerprinting strategy
- Deprecate device/inode fingerprinting strategy
The next most valuable is removing extraneous work from the read loop and hopefully improving performance in some edge cases considerably:
- Move path discovery to its own task and interval
- Move checkpoint persistence to its own task and interval
With the relatively low hanging fruit taken care of, we can move on to the more general reorganization tasks that set the stage for configuration improvements:
- Extract scheduler component
- Extract file identity component (depends on
FileWatcher
work) - Extract file starting point component (depends on
FileWatcher
work)
Followed by the new and improved configuration itself:
- Implement
version = 2
of the file source config with deprecation warning - Rename
ignore_older
toskip_older_than
- Replace
start_from_beginning
withignore_checkpoints
andread_from
- Replace
oldest_first
withmode = tail|tail_sequential
- Rename
glob_minimum_cooldown
todiscovery_interval
Which in turn enable the new batch
mode to be implemented and exposed:
- Implement batch mode shutdown conditions and new config
mode
Any time after the new file wrapper work is done, we can improve it to stop holding unnecessary file handles:
- Add state to file wrapper where file is tracked without an open handle for files that have been idle for a certain period of time
And finally, or in parallel with any of the above, go through and do the work to smooth out observability warts:
- Log when file is no longer findable but watcher isn't dead
- Don't log on empty files
- Add option to disable logging on dangling symlinks