Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync output to disk before removing source #151

Closed
wants to merge 2 commits into from

Conversation

sebastianas
Copy link
Contributor

Synchronize created output to disk before removing original input. This lowers the risk to loose source and destination if a crash happens shortly after.

Add a function which returns the directory part of the filename. This is
either the name up the last deliminator like '/' or simple '.' if it is
current directory.
The function reuses the inner parts of has_dir_sep() for the
deliminator. The inner parts are moved to get_dir_sep() in order to the
pointer while has_dir_sep() still returns the bool.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
If the file is compressed (or decompressed) then the original file is
deleted by default after the written file is closed. All this is fine.

If the machine crashes then it is possible that the newly written file
has not yet hit the disk while the previous delete operation might have.
In that case neither the original file nor the written file is
available. To avoid this scenario, sync the file, followed the directory
(to ensure that the file is part of the directory).

This may cause the `xz' command to take a bit longer because it now
waits additionally until the data has been written to disk.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
@Larhzu
Copy link
Member

Larhzu commented Nov 25, 2024

The Debian bug 814089 has an example why this can matter. Sebastian and I have discussed this both recently and in 2016. There are three related xz-devel mailing list messages from 2016 as well.

I tested fsync at some point before XZ Utils 5.0.0, likely in 2009. Back then it was on rotating media. When compressing a large number of tiny files, the performance was horrible. Obviously no one cares about performance if it leads to data loss. However, some use cases are on files which can be easily re-created, and then the lack of fsync doesn't matter but performance does.

gzip 1.7 (2016) added the --synchronous which enables fsync usage. This way users can get the safer behavior when they need it. However, it might be that users don't learn about the option before an accident has already happened.

Adding gzip's --synchronous would be a conservative change in sense that it wouldn't make current use cases slower. Opinions are welcome. Once we know what to do, we can focus on making the code do that. The current PR is fine for testing the performance impact on GNU/Linux.

Larhzu added a commit that referenced this pull request Dec 27, 2024
xz's default behavior is to delete the input file after successful
compression or decompression (unless writing to standard output).
If the system crashes soon after the deletion, it is possible that
the newly written file has not yet hit the disk while the previous
delete operation might have. In that case neither the original file
nor the written file is available.

The --synchronous option makes xz call fsync() on the file and possibly
the directory where the file was created. A similar option was added to
GNU gzip 1.7 in 2016. There some differences in behavior:

  - When writing to standard output and processing multiple input files,
    xz calls fsync() after every file while gzip does so only after all
    files have been processed.

  - This has no effect on "xz --list". xz doesn't sync standard output
    in --list mode but gzip does.

Portability notes:

  - <libgen.h> and dirname() should be available on all POSIX systems,
    and aren't needed on non-POSIX systems.

  - fsync() is available on all POSIX systems. The directory syncing
    could be changed to fdatasync() although at least on ext4 it
    doesn't seem to make a performance difference in xz's usage.
    fdatasync() would need a build system check to support (old)
    special cases, for example, MINIX 3.3.0 doesn't have fdatasync()
    and Solaris 10 needs -lrt.

  - On native Windows, _commit() is used to replace fsync(). Directory
    syncing isn't done and shouldn't be needed. (In Cygwin, fsync() on
    directories is a no-op.) It is known that syncing will fail if
    writing to stdout and stdout isn't a regular file.

  - DJGPP has fsync() for files. ;-)

Co-authored-by: Sebastian Andrzej Siewior <[email protected]>
Link: https://bugs.debian.org/814089
Link: https://www.mail-archive.com/[email protected]/msg00282.html
Closes: #151
Closes: #159
@Larhzu
Copy link
Member

Larhzu commented Dec 27, 2024

@sebastianas, I don't expect anyone to comment this here. I feel the number of complaints will be smaller if xz does what gzip did, that is, add --synchronous. It's possible to add it in XZ_DEFAULTS if one wants it always enabled.

I created #159 to add --synchronous. It would be nice if you could check if it seems OK. I'm aware that a few choices I made are against the advice I gave you via email.

@sebastianas
Copy link
Contributor Author

sebastianas commented Jan 3, 2025 via email

@Larhzu
Copy link
Member

Larhzu commented Jan 4, 2025

Thanks for the comments! Part of me feels like you do, that is, the default behavior should be safe. So far I got one comment on IRC and that agreed with you. :-)

The critical case is using fsync before deleting the source file. If this is the default behavior and opening the containing directory fails, I wonder if the error from open should be ignored. A directory with write and search permission but no read permission is unusual but it's still a bit odd to fail (de)compression if the directory cannot be openend. This is only a problem because Linux lacks O_SEARCH that POSIX defines. Maybe such directories are unusual enough that it's OK to make xz fail when syncing is enabled. I know I said in an email that ignoring the error could make sense but I might be changing my mind to make this feature more robust.

It's also a question if --flush-timeout should sync by default. The feature was requested for some kind of log files where one wanted to be able to decompress everything received so far with a short delay. It depends on the specific use case if syncing after every flush is desired or not. It's such a special option that it perhaps is enough to document that one can enable syncing separately for it. Typically it's used when writing to stdout and then one needs to create the file and sync the directory before starting xz (like explained in the man page addition in my PR).

So, use fsync by default if not writing to stdout and --keep wasn't used (or implied). This is what your patch does. :-) An option is needed to disable the behavior because there are use cases where one compresses lots of small files which are easy to recreate so fsync isn't needed.

Should there be an option to enable fsync even when input file isn't deleted? If --flush-timeout doesn't sync by default, that alone sounds like a reason to add such an option. But does it make sense in any other situation?

  • If xz is writing to stdout which is a regular file, user can sync it after xz has finished. The --flush-timeout option is the only exception to this.

  • With xz --keep foo one can also sync after xz has finished.

Am I missing something? If not, perhaps --flush-timeout should sync by default. Then all we need is --no-sync instead of --synchronous=<auto|always|never>. I may try to update my code to something like that.

Getting feedback about this kind of topics is difficult so I appreciate all comments. Thanks!

Larhzu added a commit that referenced this pull request Jan 4, 2025
xz's default behavior is to delete the input file after successful
compression or decompression (unless writing to standard output).
If the system crashes soon after the deletion, it is possible that
the newly written file has not yet hit the disk while the previous
delete operation might have. In that case neither the original file
nor the written file is available.

Call fsync() on the file. On POSIX systems, sync also the directory
where the file was created.

Add a new option --no-sync which disables fsync() usage. It can avoid
a (possibly significant) performance penalty when processing many
small files. It's fine to use --no-sync when one knows that the files
are easy to recreate or restore after a system crash.

Using fsync() after every flush initiated by --flush-timeout was
considered. It wasn't implemented at least for now.

  - --flush-timeout is typically used when writing to stdout. If stdout
    is a file, xz cannot (portably) sync the directory of the file.
    One would need to create the output file first, sync the directory,
    and then run xz with fsync() enabled.

  - If xz --flush-timeout output goes to a file, it's possible to use
    a separate script to sync the file, for example, once per minute
    while telling xz to flush more frequently.

  - Not supporting syncing with --flush-timeout was simpler.

Portability notes:

  - On systems that lack O_SEARCH (like Linux), "xz dir/file" will now
    fail if "dir" cannot be opened for reading. If "dir" still has
    write and search permissions (like d-wx------ in "ls -l"),
    previously xz would have been able to compress "dir/file" still.
    Now it only works if using --no-sync (or --keep or --stdout).

  - <libgen.h> and dirname() should be available on all POSIX systems,
    and aren't needed on non-POSIX systems.

  - fsync() is available on all POSIX systems. The directory syncing
    could be changed to fdatasync() although at least on ext4 it
    doesn't seem to make a performance difference in xz's usage.
    fdatasync() would need a build system check to support (old)
    special cases, for example, MINIX 3.3.0 doesn't have fdatasync()
    and Solaris 10 needs -lrt.

  - On native Windows, _commit() is used to replace fsync(). Directory
    syncing isn't done and shouldn't be needed. (In Cygwin, fsync() on
    directories is a no-op.)

  - DJGPP has fsync() for files. ;-)

Using fsync() was considered somewhere around 2009 and again in 2016 but
those times the idea was rejected. For comparison, GNU gzip 1.7 (2016)
added the option --synchronous which enables fsync().

Co-authored-by: Sebastian Andrzej Siewior <[email protected]>
Fixes: https://bugs.debian.org/814089
Link: https://www.mail-archive.com/[email protected]/msg00282.html
Closes: #151
@Larhzu
Copy link
Member

Larhzu commented Jan 4, 2025

I created a new branch synchronous2. I tried adding syncing to --flush-timeout but the conditions become ugly and I'm not even sure if that use case needs syncing. So I will skip that for now and focus on sorting out the main issue.

@Larhzu
Copy link
Member

Larhzu commented Jan 4, 2025

I know that the position of the filename in the error messages might be archaic but if the order is changed, then all messages should be revised for consistency. I suppose it's not worth doing.

@sebastianas
Copy link
Contributor Author

sebastianas commented Jan 4, 2025 via email

@Larhzu
Copy link
Member

Larhzu commented Jan 5, 2025 via email

@sebastianas
Copy link
Contributor Author

sebastianas commented Jan 5, 2025 via email

@Larhzu Larhzu closed this in 2a9e91d Jan 5, 2025
@Larhzu
Copy link
Member

Larhzu commented Jan 5, 2025 via email

@Larhzu
Copy link
Member

Larhzu commented Jan 6, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants