Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

f3read crash #211

Open
axet opened this issue Aug 4, 2023 · 9 comments
Open

f3read crash #211

axet opened this issue Aug 4, 2023 · 9 comments

Comments

@axet
Copy link

axet commented Aug 4, 2023

Creating file 52.h2w ... OK!                           
Creating file 53.h2w ... OK!                           
Creating file 54.h2w ... OK!                           
Creating file 55.h2w ... Write failure: Input/output error

WARNING:
The write error above may be due to your memory card overheating
under constant, maximum write rate. You can test this hypothesis
touching your memory card. If it is hot, you can try f3write
again, once your card has cooled down, using parameter --max-write-rate=2048
to limit the maximum write rate to 2MB/s, or another suitable rate.

Creating file 56.h2w ... f3write: Can't create file /mnt/56.h2w: Input/output error
F3 read 8.0
Copyright (C) 2010 Digirati Internet LTDA.
This is free software; see the source for copying conditions.

                  SECTORS      ok/corrupted/changed/overwritten
Validating file 1.h2w ... f3read: f3read.c:260: validate_file: Assertion `!fdatasync(fd)' failed.
./bin/flashtest: line 72: 23262 Aborted                 (core dumped) ${STDBUF[@]} f3read ${PP[@]} "$M"
@AltraMayor
Copy link
Owner

What's your operating system, @axet?

If I write a pull request for you to test, would you know how to apply the pull request to your local copy of F3, compile it, and test it?

@axet
Copy link
Author

axet commented Aug 19, 2023

Debian 12. I can test and apply patches. But it is hard to reproduce. It only happens on bad SD cards, and I already dispose all of them.

AltraMayor added a commit that referenced this issue Sep 4, 2023
Errors coming from fdatasync() are unexpected, but according to
issue #211 they happen.
This patch just reports the error and general guidance to users.
@AltraMayor
Copy link
Owner

I've added code to the repository to report an error message when this problem comes up. It's an expected error, so I'll watch for what the error messages look like to find a possible root cause and eventually improve the error message.

@axet
Copy link
Author

axet commented Sep 4, 2023

What is the purpose to call fdatasync on readonly file descriptor? Can this call just be removed?

@AltraMayor
Copy link
Owner

fdatasync() is called to satisfy a kernel demand to call posix_fadvise() next; see the manual for details.

One could remove fdatasync() and posix_fadvise() from f3read since they are there to guide the operating system on how to manage the associated file system cache. By calling these functions, though, f3read obtains a better read-speed measurement, and minimizes the change of reading from the file system cache instead of directly from the drive.

In my experience, removing those functions likely won't avoid the underlying problem, but just going to force the kernel to raise the error at another system call.

@axet
Copy link
Author

axet commented Sep 4, 2023

Dirty pages from man page are pages can only occurs when writing file, you working it read only. You wrote the file, closed it. No unwritten cache left. Then open it read-only. No dirty pages are possible here.

But I think you right - it will crash later on probably due to posix_fadvise() call. Yet I think if we can continue, we shall continue testing next files, do not exit(saved_errno) on first error.

At this point we know flash drive has corrupted metadata (few sectors at start of the drive with filesystem metadata) and probably files list is partially corrupted. But we still have 99% surface to check.

@AltraMayor
Copy link
Owner

Dirty pages from man page are pages can only occurs when writing file, you working it read only. You wrote the file, closed it. No unwritten cache left. Then open it read-only. No dirty pages are possible here.

Most users run f3read immediately after running f3write. The kernel doesn't have to flush its file system when f3write ends, so f3read advises the kernel to flush the cache associated with the file that's going to be read and to remove the file from the cache, so the reads come directly from the drive instead of the cache.

But I think you right - it will crash later on probably due to posix_fadvise() call. Yet I think if we can continue, we shall continue testing next files, do not exit(saved_errno) on first error.

Continuing assumes that something else will work, but errors can compound. Especially in cases like this one in which there shouldn't be an error to start with. I expect that we'll learn enough about this error with time, so we'll find the root cause. The overheating error that f3write reports was diagnosticated in a similar way.

At this point we know flash drive has corrupted metadata (few sectors at start of the drive with filesystem metadata) and probably files list is partially corrupted. But we still have 99% surface to check.

If you want to investigate your fake drive, take a look at f3brew; it doesn't rely on a file system to work.

@axet
Copy link
Author

axet commented Sep 5, 2023

fclose man page explicitly saying: "Any unwritten buffered data are flushed to the OS. Any unread buffered data are discarded. ". You do not even need to call fsync before fclose for that reason, for data get written to the drive. It is safe to call fclose on file descriptor to make all data written. Even if you crash your OS immediately after fclose, all data should be fine (but not guaranteed).

fdatasync working with specific fd, freshly opened, I do not think kernel allow to work / flush cache pages from different application / file descriptor, even if here some dirty pages left. Most likely it will do nothing.

I think what causing the crash is last access attribute, which have been written during fdatasync call (I know it not suppose to do that). So it is not important, and we shall continue.

I'm using f3brew at first all the time. Question is, is the f3read suppose to handle errors better, assuming we are working with broken / bad hardware or not. I think it should assume bad hardware and not stop on simple errors. I would add more error handling like skipping files and showing read errors. Instead of stopping the app.

EDIT: note about fdatasync attributes, and f3read error stability

@AltraMayor
Copy link
Owner

fclose man page explicitly saying: "Any unwritten buffered data are flushed to the OS. Any unread buffered data are discarded. ". You do not even need to call fsync before fclose for that reason, for data get written to the drive. It is safe to call fclose on file descriptor to make all data written. Even if you crash your OS immediately after fclose, all data should be fine (but not guaranteed).

fdatasync working with specific fd, freshly opened, I do not think kernel allow to work / flush cache pages from different application / file descriptor, even if here some dirty pages left. Most likely it will do nothing.

I think what causing the crash is last access attribute, which have been written during fdatasync call (I know it not suppose to do that). So it is not important, and we shall continue.

Your hypothesis makes sense, but I'm not changing the code without further evidence that it would be okay to remove fdatasync(). Right now, we don't know if your hypothesis explains the problem at hand. If your hypothesis does not explain the problem, we'd need to find out why fdatasync() returns an error if it does nothing.

I'm using f3brew at first all the time. Question is, is the f3read suppose to handle errors better, assuming we are working with broken / bad hardware or not. I think it should assume bad hardware and not stop on simple errors. I would add more error handling like skipping files and showing read errors. Instead of stopping the app.

All code in F3 does a good job handling errors. I'm reluctant to quickly generalize how to handle new errors because fake drives and dying drives do behave in non-logical ways. As evidence accumulates on a given error, the path forward becomes clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants