Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocksdb v7.0.4 #55

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

Rocksdb v7.0.4 #55

wants to merge 34 commits into from

Conversation

matthewvon
Copy link
Member

Temp PR for getting original rocksdb v7.0.4 tagged for download

ajkr and others added 30 commits February 20, 2022 16:40
Summary: Pull Request resolved: facebook/rocksdb#9609

Reviewed By: riversand963

Differential Revision: D34370309

Pulled By: ajkr

fbshipit-source-id: 5fc9306439aefa4b2d61d847534ea6758c30b6a5
Summary:
We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by facebook/rocksdb#9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem.

Pull Request resolved: facebook/rocksdb#9637

Reviewed By: riversand963

Differential Revision: D34502233

Pulled By: ajkr

fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07
Summary:
In crash test with fault injection, we were seeing stack traces like the following:

```
facebook/rocksdb#3 0x00007f75f763c533 in __GI___assert_fail (assertion=assertion@entry=0x1c5b2a0 "end_offset >= start_offset", file=file@entry=0x1c580a0 "table/block_based/block_based_table_reader.cc", line=line@entry=3245,
function=function@entry=0x1c60e60 "virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller)") at assert.c:101
facebook/rocksdb#4 0x00000000010ea9b4 in rocksdb::BlockBasedTable::ApproximateSize (this=<optimized out>, start=..., end=..., caller=<optimized out>) at table/block_based/block_based_table_reader.cc:3224
facebook/rocksdb#5 0x0000000000be61fb in rocksdb::TableCache::ApproximateSize (this=0x60f0000161b0, start=..., end=..., fd=..., caller=caller@entry=rocksdb::kCompaction, internal_comparator=..., prefix_extractor=...) at db/table_cache.cc:719
facebook/rocksdb#6 0x0000000000c3eaec in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, v=<optimized out>, f=..., start=..., end=..., caller=<optimized out>) at ./db/version_set.h:850
facebook/rocksdb#7 0x0000000000c6ebc3 in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, options=..., v=v@entry=0x621000047500, start=..., end=..., start_level=start_level@entry=0, end_level=<optimized out>, caller=<optimized out>)
at db/version_set.cc:5657
facebook/rocksdb#8 0x000000000166e894 in rocksdb::CompactionJob::GenSubcompactionBoundaries (this=<optimized out>) at ./include/rocksdb/options.h:1869
facebook/rocksdb#9 0x000000000168c526 in rocksdb::CompactionJob::Prepare (this=this@entry=0x7f75f3ffcf00) at db/compaction/compaction_job.cc:546
```

The problem occurred in `ApproximateSize()` when the index `Seek()` for the first `ApproximateDataOffsetOf()` encountered an I/O error, while the second `Seek()` did not. In the old code that scenario caused `start_offset == data_size` , thus it was easy to trip the assertion that `end_offset >= start_offset`.

The fix is to set `start_offset == 0` when the first index `Seek()` fails, and `end_offset == data_size` when the second index `Seek()` fails. I doubt these give an "on average correct" answer for how this function is used, but I/O errors in index seeks are hopefully rare, it looked consistent with what was already there, and it was easier to calculate.

Pull Request resolved: facebook/rocksdb#9615

Test Plan:
run the repro command for a while and stopped seeing coredumps -

```
$ while !  ./db_stress --block_size=128 --cache_size=32768 --clear_column_family_one_in=0 --column_families=1 --continuous_verification_interval=0 --db=/dev/shm/rocksdb_crashtest --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --index_type=2 --iterpercent=10  --kill_random_test=18887 --max_key=1000000 --max_bytes_for_level_base=2048576 --nooverwritepercent=1 --open_files=-1 --open_read_fault_one_in=32 --ops_per_thread=1000000 --prefixpercent=5 --read_fault_one_in=0 --readpercent=45 --reopen=0 --skip_verifydb=1 --subcompactions=2 --target_file_size_base=524288 --test_batches_snapshots=0 --value_size_mult=32 --write_buffer_size=524288 --writepercent=35  ; do : ; done
```

Reviewed By: pdillinger

Differential Revision: D34383069

Pulled By: ajkr

fbshipit-source-id: fac26c3b20ea962e75387515ba5f2724dc48719f
Summary:
PR facebook/rocksdb#9557 introduced a race condition between manual compaction
foreground thread and background compaction thread.
This PR adds the ability to really unschedule manual compaction from
thread-pool queue by differentiate tag name for manual compaction and
other tasks.
Also fix an issue that db `close()` didn't cancel the manual compaction thread.

Pull Request resolved: facebook/rocksdb#9625

Test Plan: unittest not hang

Reviewed By: ajkr

Differential Revision: D34410811

Pulled By: jay-zhuang

fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
Summary:
**Context:**
As part of facebook/rocksdb#6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](facebook/rocksdb@e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal facebook/rocksdb#3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later facebook/rocksdb#6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  facebook/rocksdb#6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: facebook/rocksdb#9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
facebook/rocksdb#1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
facebook/rocksdb#2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
facebook/rocksdb#3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab
Summary:
The plain data length may not be big enough if the compression actually expands data. So use deflateBound() to get the upper limit on the compressed output before deflate().

Pull Request resolved: facebook/rocksdb#9572

Reviewed By: riversand963

Differential Revision: D34326475

Pulled By: ajkr

fbshipit-source-id: 4b679cb7a83a62782a127785b4d5eb9aa4646449
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Pull Request resolved: facebook/rocksdb#9648

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D34555456

Pulled By: riversand963

fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
Summary:
Fix g++ -march=native detection and reenable s390x in travis

This PR fixes s390x assembler messages:
```
Error: invalid switch -march=z14
Error: unrecognized option -march=z14
```

The s390x travis build was failing with gcc-7 because the assembler on
ubuntu 16.04 is too old to recognize the z14 model so it doesn't work
with -march=native on a z14 machine. It fixes the check for the
-march=native flag so that the assembler will get called and correctly
fail on ubuntu 16.04 which will cause the build to fall back to
-march=z196 which works.

The other changes are needed so builds work more consistently on
s390x:

1. Set make parallelism to 1 for s390x: The default was 4 previously
but I saw frequent internal compiler errors on travis probably due to
low resources. The `platform_dependent` job works more consistently
but is roughly 10 minutes slower although it varies.
2. Remove status_checked jobs, as we are relying on CircleCI for
these now and do not really need platform coverage on them.

Fixes facebook/rocksdb#9524

Pull Request resolved: facebook/rocksdb#9631

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D34553989

Pulled By: pdillinger

fbshipit-source-id: a6e3a7276446721c4c0bebc4ed217c2ca2b53f11
Summary:
Addresses the problems described in facebook/rocksdb#9254 (comment) and facebook/rocksdb#9254 (comment) that have blocked a RocksJava release

**NOTE** Also needs to be ported to 6.29.fb branch.

Pull Request resolved: facebook/rocksdb#9662

Reviewed By: ajkr

Differential Revision: D34689200

Pulled By: pdillinger

fbshipit-source-id: c62fe34c54f05be5a00ee1daec8ec7454baa5eb8
Summary:
Fixes facebook/rocksdb#9560. Only use popcnt intrinsic when HAVE_SSE42 is set. Also avoid setting it based on compiler test in portable builds because such test will pass on MSVC even without proper arch flags (ref: https://devblogs.microsoft.com/oldnewthing/20201026-00/?p=104397).

Pull Request resolved: facebook/rocksdb#9680

Test Plan: verified the combinations of -DPORTABLE and -DFORCE_SSE42 produce expected compiler flags on Linux. Verified MSVC build using PORTABLE=1 (in CircleCI) does not set HAVE_SSE42.

Reviewed By: pdillinger

Differential Revision: D34739033

Pulled By: ajkr

fbshipit-source-id: d10456f3392945fc3e59430a1777840f7b60b276
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled

Pull Request resolved: facebook/rocksdb#9656

Reviewed By: ajkr

Differential Revision: D34626296

Pulled By: jay-zhuang

fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
Summary:
facebook/rocksdb#9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.

Pull Request resolved: facebook/rocksdb#9659

Reviewed By: ajkr

Differential Revision: D34651820

Pulled By: jay-zhuang

fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
Summary:
The recommendation can be found e. g. [here](https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ns-winioctl-storage_property_query).

While `<windows.h>` transitively includes `<winioctl.h>` by default, this can be switched off by `/DWIN32_LEAN_AND_MEAN` which forces the user to include-what-you-use.

Pull Request resolved: facebook/rocksdb#9612

Reviewed By: riversand963

Differential Revision: D34845629

Pulled By: ajkr

fbshipit-source-id: 1ef9273074e3d84864c6833a7de6eb9df81e29d9
…(#9622)

Summary:
There was a mistake that incorrectly cast SstPartitionerFactory (missed shared pointer). It worked for database (correct cast), but not for family. Trying to set it in family has caused Access violation.

I have also added test and improved it. Older version was passing even without sst partitioner which is weird, because on Level1 we had two SST files with same key "aaaa1". I was not sure if it is a new feature and changed it to overlaping keys "aaaa0" - "aaaa2" overlaps "aaaa1".

Pull Request resolved: facebook/rocksdb#9622

Reviewed By: ajkr

Differential Revision: D34871968

Pulled By: pdillinger

fbshipit-source-id: a08009766da49fc198692a610e8beb19caf737e6
…(#9258)

Summary:
fix facebook/rocksdb#9255

Pull Request resolved: facebook/rocksdb#9258

Reviewed By: pdillinger

Differential Revision: D34879684

Pulled By: ajkr

fbshipit-source-id: 5934f4b7524dc27ecdf1430e0456a0fc02958fc7
Summary:
In the original code, the value of `NO_FILE_OPENS` corresponding to the Ticker item will be increased regardless of whether the file is successfully opened or not. Even counts are repeated, which can lead to skewed counts.

Pull Request resolved: facebook/rocksdb#9677

Reviewed By: jay-zhuang

Differential Revision: D34725733

Pulled By: ajkr

fbshipit-source-id: 841234ed03802c0105fd2107d82a740265ead576
Summary:
Bloom filters generated by pre-7.0 releases are not read by
7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
in facebook/rocksdb#9590. This can severely impact read performance and read I/O on
upgrade or downgrade with existing DB, but not data correctness.

To fix, we go back using the old, unified name in SST metadata but (for
a while anyway) recognize the aliases that could be generated by early
7.0.x releases. This unfortunately requires a public API change to avoid
interfering with all the good changes from facebook/rocksdb#9590, but the API change
only affects users with custom FilterPolicy, which should be very few.

Pull Request resolved: facebook/rocksdb#9736

Test Plan:
manual

Generate DBs with
```
./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
```
and similar. Compare with
```
for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
```

Results:
```
Testing 6.29 on 6.29:
readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
Testing 6.29 on 7.0:
readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
Testing 6.29 on fixed:
readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
Testing 7.0 on 6.29:
readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
Testing 7.0 on 7.0:
readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
Testing 7.0 on fixed:
readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
Testing fixed on 6.29:
readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
Testing fixed on 7.0:
readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
Testing fixed on fixed:
readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
```

Just paying attention to order of magnitude of ops/sec (short test
durations, lots of noise), it's clear that with the fix we can read <= 6.29
& >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
release can properly read fixed DB at full speed.

Reviewed By: siying, ajkr

Differential Revision: D35057844

Pulled By: pdillinger

fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050
Summary: Also clarified in HISTORY.md that this patch release breaks
binary compatibility (because of FilterPolicy vtable)
Summary:
```
conversion from 'size_t' to 'uLong', possible loss of data
```

Fix facebook/rocksdb#9688

Pull Request resolved: facebook/rocksdb#9699

Reviewed By: riversand963

Differential Revision: D34901116

Pulled By: jay-zhuang

fbshipit-source-id: 969148a7a8c023449bd85055a1f0eec71d0a9b3f
Summary:
In facebook/rocksdb#9659, when `DisableManualCompaction()` is issued, the foreground
manual compaction thread does not have to wait background compaction
thread to finish. Which could be a problem that the user re-enable
manual compaction with `EnableManualCompaction()`, it may re-enable the
BG compaction which supposed be cancelled.
This patch makes the FG compaction wait on
`manual_compaction_state.done`, which either be set by BG compaction or
Unschedule callback. Then when FG manual compaction thread returns, it
should not have BG compaction running. So shared_ptr is no longer needed
for `manual_compaction_state`.

Pull Request resolved: facebook/rocksdb#9694

Test Plan: a StressTest and unittest

Reviewed By: ajkr

Differential Revision: D34885472

Pulled By: jay-zhuang

fbshipit-source-id: e6476175b43e8c59cd49f5c09241036a0716c274
Summary:
SMB mounts do not support hard links. The ENOTSUP error code is
returned, which should be interpreted by PosixFileSystem as
IOStatus::NotSupported().

Pull Request resolved: facebook/rocksdb#9657

Reviewed By: mrambacher

Differential Revision: D34634783

Pulled By: anand1976

fbshipit-source-id: 0d57f5b2e6118e4c20e9ed1a293327428c3aecac
…9686)

Summary:
Pull Request resolved: facebook/rocksdb#9686

According to https://www.cplusplus.com/reference/deque/deque/back/,
"
The container is accessed (neither the const nor the non-const versions modify the container).
The last element is potentially accessed or modified by the caller. Concurrently accessing or modifying other elements is safe.
"

Also according to https://www.cplusplus.com/reference/deque/deque/pop_front/,
"
The container is modified.
The first element is modified. Concurrently accessing or modifying other elements is safe (although see iterator validity above).
"
In RocksDB, we never pop the last element of `DBImpl::alive_log_files_`. We have been
exploiting this fact and the above two properties when ensuring correctness when
`DBImpl::alive_log_files_` may be accessed concurrently. Specifically, it can be accessed
in the write path when db mutex is released. Sometimes, the log_mute_ is held. It can also be accessed in `FindObsoleteFiles()`
when db mutex is always held. It can also be accessed
during recovery when db mutex is also held.
Given the fact that we never pop the last element of alive_log_files_, we currently do not
acquire additional locks when accessing it in `WriteToWAL()` as follows
```
alive_log_files_.back().AddSize(log_entry.size());
```

This is problematic.

Check source code of deque.h
```
  back() _GLIBCXX_NOEXCEPT
  {
__glibcxx_requires_nonempty();
...
  }

  pop_front() _GLIBCXX_NOEXCEPT
  {
...
  if (this->_M_impl._M_start._M_cur
      != this->_M_impl._M_start._M_last - 1)
    {
      ...
      ++this->_M_impl._M_start._M_cur;
    }
  ...
  }
```

`back()` will actually call `__glibcxx_requires_nonempty()` first.
If `__glibcxx_requires_nonempty()` is enabled and not an empty macro,
it will call `empty()`
```
bool empty() {
return this->_M_impl._M_finish == this->_M_impl._M_start;
}
```
You can see that it will access `this->_M_impl._M_start`, racing with `pop_front()`.
Therefore, TSAN will actually catch the bug in this case.

To be able to use TSAN on our library and unit tests, we should always coordinate
concurrent accesses to STL containers properly.

We need to pass information about db mutex and log mutex into `WriteToWAL()`, otherwise
it's impossible to know which mutex to acquire inside the function.

To fix this, we can catch the tail of `alive_log_files_` by reference, so that we do not have to call `back()` in `WriteToWAL()`.

Reviewed By: pdillinger

Differential Revision: D34780309

fbshipit-source-id: 1def9821f0c437f2736c6a26445d75890377889b
Summary:
PR9686 makes `WriteToWAL()` call `assert(...!=rend())` while not holding
db mutex or log mutex. Another thread may concurrently call
`pop_front()`, causing race condition.
To fix, assert only if mutex is held.

Pull Request resolved: facebook/rocksdb#9698

Test Plan: COMPILE_WITH_TSAN=1 make check

Reviewed By: jay-zhuang

Differential Revision: D34898535

Pulled By: riversand963

fbshipit-source-id: 1ddfa5bf1b6ae8d409cab6ff6e1b5321c6803da9
…n opening Posix WritableFile (#9685)

Summary:
Pull Request resolved: facebook/rocksdb#9685

Our TSAN reports a race condition as follows when running test
```
gtest-parallel -r 100 ./external_sst_file_test --gtest_filter=ExternalSSTFileTest.MultiThreaded
```
leads to the following

```
WARNING: ThreadSanitizer: data race (pid=2683148)
  Write of size 1 at 0x556fede63340 by thread T7:
    #0 rocksdb::(anonymous namespace)::PosixFileSystem::OpenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, bool, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/fs_posix.cc:334 (external_sst_file_test+0xb61ac4)
    #1 rocksdb::(anonymous namespace)::PosixFileSystem::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/fs_posix.cc:382 (external_sst_file_test+0xb5ba96)
    #2 rocksdb::CompositeEnv::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::WritableFile, std::default_delete<rocksdb::WritableFile> >*, rocksdb::EnvOptions const&) internal_repo_rocksdb/repo/env/composite_env.cc:334 (external_sst_file_test+0xa6ab7f)
    #3 rocksdb::EnvWrapper::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::WritableFile, std::default_delete<rocksdb::WritableFile> >*, rocksdb::EnvOptions const&) internal_repo_rocksdb/repo/include/rocksdb/env.h:1428 (external_sst_file_test+0x561f3e)
Previous read of size 1 at 0x556fede63340 by thread T4:
    #0 rocksdb::(anonymous namespace)::PosixFileSystem::OpenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, bool, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/fs_posix.cc:328 (external_sst_file_test+0xb61a70)
    #1 rocksdb::(anonymous namespace)::PosixFileSystem::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
...
```

Fix by making sure the following block gets executed only once:
```
      if (!checkedDiskForMmap_) {
        // this will be executed once in the program's lifetime.
        // do not use mmapWrite on non ext-3/xfs/tmpfs systems.
        if (!SupportsFastAllocate(fname)) {
          forceMmapOff_ = true;
        }
        checkedDiskForMmap_ = true;
      }
```

Reviewed By: pdillinger

Differential Revision: D34780308

fbshipit-source-id: b761f66b24c8b5b8389d86ea371c8542b8d869d5
Summary:
On CircleCI MacOS instances, we have been seeing the following assertion error:
```
Assertion failed: (alive_log_files_tail_ == alive_log_files_.rbegin()), function WriteToWAL, file /Users/distiller/project/db/db_impl/db_impl_write.cc, line 1213.
Received signal 6 (Abort trap: 6)
#0   0x1
facebook/rocksdb#1   abort (in libsystem_c.dylib) + 120
facebook/rocksdb#2   err (in libsystem_c.dylib) + 0
facebook/rocksdb#3   rocksdb::DBImpl::WriteToWAL(rocksdb::WriteBatch const&, rocksdb::log::Writer*, unsigned long long*, unsigned long long*, rocksdb::Env::IOPriority, bool, bool) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1213)
facebook/rocksdb#4   rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long long*, bool, bool, unsigned long long) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1251)
facebook/rocksdb#5   rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long long*, unsigned long long, bool, unsigned long long*, unsigned long, rocksdb::PreReleaseCallback*) (in librocksdb.7.0.0.dylib) (db_impl_	rite.cc:421)
facebook/rocksdb#6   rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:109)
facebook/rocksdb#7   rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:2159)
facebook/rocksdb#8   rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:37)
facebook/rocksdb#9   rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db.h:382)
facebook/rocksdb#10  rocksdb::DBBasicTestWithTimestampPrefixSeek_IterateWithPrefix_Test::TestBody() (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:2926)
facebook/rocksdb#11  void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899)
facebook/rocksdb#12  void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935)
facebook/rocksdb#13  testing::Test::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:3980)
facebook/rocksdb#14  testing::TestInfo::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4153)
facebook/rocksdb#15  testing::TestCase::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4266)
facebook/rocksdb#16  testing::internal::UnitTestImpl::RunAllTests() (in db_with_timestamp_basic_test) (gtest-all.cc:6632)
facebook/rocksdb#17  bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899)
facebook/rocksdb#18  bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935)
facebook/rocksdb#19  testing::UnitTest::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:6242)
facebook/rocksdb#20  RUN_ALL_TESTS() (in db_with_timestamp_basic_test) (gtest.h:22110)
facebook/rocksdb#21  main (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:3150)
facebook/rocksdb#22  start (in libdyld.dylib) + 1
```

It's likely caused by concurrent, unprotected access to the deque, even though `back()` is never popped,
and we are comparing `rbegin()` with a cached `riterator`. To be safe, do the comparison only if we have mutex.

Pull Request resolved: facebook/rocksdb#9717

Test Plan:
One example
Ssh to one CircleCI MacOS instance.
```
gtest-parallel -r 1000 -w 8 ./db_test --gtest_filter=DBTest.FlushesInParallelWithCompactRange
```

Reviewed By: pdillinger

Differential Revision: D34990696

Pulled By: riversand963

fbshipit-source-id: 62dd48ae6fedbda53d0a64d73de9b948b4c26eee
…… (#9471)

Summary:
c api release snapshot will core dump when enable inplace_update_support and create snapshot

Pull Request resolved: facebook/rocksdb#9471

Reviewed By: akankshamahajan15

Differential Revision: D34965103

Pulled By: riversand963

fbshipit-source-id: c3aeeb9ea7126c2eda1466102794fecf57b6ab77
Summary:
Multiplier here should be 1e6 to get microseconds.

Pull Request resolved: facebook/rocksdb#9695

Reviewed By: ajkr

Differential Revision: D34897086

Pulled By: jay-zhuang

fbshipit-source-id: 9c1d0811ea740ba0a007edc2da199edbd000b88b
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst

```
Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V
```

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.

Pull Request resolved: facebook/rocksdb#9715

Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```

Reviewed By: ltamasi

Differential Revision: D34984412

Pulled By: riversand963

fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
pdillinger and others added 4 commits March 29, 2022 12:47
Summary:
Although ColumnFamilySet comments say that DB mutex can be
freed during iteration, as long as you hold a ref while releasing DB
mutex, this is not quite true because UnrefAndTryDelete might delete cfd
right before it is needed to get ->next_ for the next iteration of the
loop.

This change solves the problem by making a wrapper class that makes such
iteration easier while handling the tricky details of UnrefAndTryDelete
on the previous cfd only after getting next_ in operator++.

FreeDeadColumnFamilies should already have been obsolete; this removes
it for good. Similarly, ColumnFamilySet::iterator doesn't need to check
for cfd with 0 refs, because those are immediately deleted.

Pull Request resolved: facebook/rocksdb#9730

Test Plan:
was reported with ASAN on unit tests like
DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching

Reviewed By: ltamasi

Differential Revision: D35038143

Pulled By: pdillinger

fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9
Summary:
After commit [d642c60](facebook/rocksdb@d642c60), the stats `READ_BLOCK_COMPACTION_MICROS` cannot record any compaction read duration, and it always report zero.

This PR targets to distinguish `READ_BLOCK_COMPACTION_MICROS` with `READ_BLOCK_GET_MICROS` so that `READ_BLOCK_COMPACTION_MICROS` could record the correct stats.

Pull Request resolved: facebook/rocksdb#9722

Reviewed By: ajkr

Differential Revision: D35021870

Pulled By: jay-zhuang

fbshipit-source-id: f1a804994265e51465de64c2a08f2e0eeb6fc5a3
Summary:
Certain STLs use raw pointers and ADL does not work for them.

Pull Request resolved: facebook/rocksdb#9608

Reviewed By: ajkr

Differential Revision: D34583012

Pulled By: riversand963

fbshipit-source-id: 7de6bbc8a080c3e7243ce0d758fe83f1663168aa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.