Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-35049: Improve adaptive hash index scalability #3562

Draft
wants to merge 4 commits into
base: 10.6
Choose a base branch
from

Conversation

dr-m
Copy link
Contributor

@dr-m dr-m commented Oct 1, 2024

  • The Jira issue number for this PR is: MDEV-35049

Description

MEM_HEAP_BTR_SEARCH: Remove. Let us handle this special type of mem_heap_t allocations in the only compilation unit, btr0sea.cc, specifically, btr_search_sys_t::partition::insert(), which replaces the function ha_insert_for_fold().

mem_block_info_t::ahi_block: Replaces free_block. This caches one buffer page for use in adaptive hash index allocations. This is protected by btr_search_sys_t::partition::latch. It only is Atomic_relaxed because btr_search_free_space() is following a pattern of test, lock, and test.

btr_search_check_free_space(): Protect the ahi_block with a shared AHI partition latch instead of an exclusive one. We must recheck btr_search_enabled after acquiring the latch in order to avoid a race condition with btr_search_disable(). Using a shared latch instead of an exclusive one should reduce contention with btr_search_guess_on_hash() and other operations when running with innodb_adaptive_hash_index=ON.

Release Notes

The performance of innodb_adaptive_hash_index with larger numbers of threads was improved at high concurrency.

How can this PR be tested?

./mtr --parallel=auto --big-test --force --mysqld=--loose-innodb-adaptive-hash-index=ON --max-test-fail=0

I tested this with and without cmake -DWITH_INNODB_AHI=OFF.

Using the innodb-hashtest.sh I still see some waits for an AHI partition latch. Here are all functions that exceed 1% of perf record samples:

  27,97%  mariadbd  mariadbd             [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, mtr_t*)
  21,10%  mariadbd  mariadbd             [.] ssux_lock_impl<true>::rd_wait()
   2,93%  mariadbd  mariadbd             [.] btr_search_sys_t::partition::insert(unsigned long, unsigned char const*)
   2,89%  mariadbd  mariadbd             [.] row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)
   2,17%  mariadbd  mariadbd             [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)
   1,92%  mariadbd  [kernel.kallsyms]    [k] psi_group_change
   1,75%  mariadbd  [kernel.kallsyms]    [k] try_to_wake_up
   1,47%  mariadbd  mariadbd             [.] btr_search_drop_page_hash_index(buf_block_t*, bool)
   1,36%  mariadbd  mariadbd             [.] ssux_lock_impl<true>::wr_wait(unsigned int)
   1,34%  mariadbd  mariadbd             [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)
   1,31%  mariadbd  mariadbd             [.] ha_delete_hash_node(hash_table_t*, mem_block_info_t*, ha_node_t*)
   1,31%  mariadbd  mariadbd             [.] Row_sel_get_clust_rec_for_mysql::operator()(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned short**, mem_block_info_t**, dtuple_t**, mtr_t*)
   1,22%  mariadbd  mariadbd             [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
   1,16%  mariadbd  mariadbd             [.] evaluate_join_record(JOIN*, st_join_table*, int)
   1,15%  mariadbd  mariadbd             [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned short const*, unsigned long, unsigned long*)
   1,06%  mariadbd  [vdso]               [.] __vdso_gettimeofday

It should be noted that conflicts are inevitable under this workload, because ha_delete_hash_node() is covered by an exclusive AHI partition latch.

I repeated the experiment with an additional patch that disables the spin loops in this subsystem:

diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
index 9fb5bb4597a..f8c19922729 100644
--- a/storage/innobase/btr/btr0cur.cc
+++ b/storage/innobase/btr/btr0cur.cc
@@ -2620,7 +2620,7 @@ btr_cur_optimistic_insert(
 		ut_ad(flags == BTR_NO_LOCKING_FLAG);
 	} else if (index->table->is_temporary()) {
 	} else {
-		srw_spin_lock* ahi_latch = btr_search_sys.get_latch(*index);
+		auto ahi_latch = btr_search_sys.get_latch(*index);
 		if (!reorg && cursor->flag == BTR_CUR_HASH) {
 			btr_search_update_hash_node_on_insert(
 				cursor, ahi_latch);
@@ -3402,7 +3402,7 @@ btr_cur_update_in_place(
 
 #ifdef BTR_CUR_HASH_ADAPT
 	{
-		srw_spin_lock* ahi_latch = block->index
+		auto ahi_latch = block->index
 			? btr_search_sys.get_latch(*index) : NULL;
 		if (ahi_latch) {
 			/* TO DO: Can we skip this if none of the fields
diff --git a/storage/innobase/btr/btr0sea.cc b/storage/innobase/btr/btr0sea.cc
index 968ca1993d2..4fd63a4c31e 100644
--- a/storage/innobase/btr/btr0sea.cc
+++ b/storage/innobase/btr/btr0sea.cc
@@ -34,6 +34,8 @@ Created 2/17/1996 Heikki Tuuri
 #include "btr0btr.h"
 #include "srv0mon.h"
 
+#define srw_spin_lock srw_lock
+
 /** Is search system enabled.
 Search system is protected by array of latches. */
 char		btr_search_enabled;
diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc
index be721a51f3c..6368ec62909 100644
--- a/storage/innobase/handler/handler0alter.cc
+++ b/storage/innobase/handler/handler0alter.cc
@@ -5898,7 +5898,7 @@ static bool innobase_instant_try(
 #ifdef BTR_CUR_HASH_ADAPT
 	/* Acquire the ahi latch to avoid a race condition
 	between ahi access and instant alter table */
-	srw_spin_lock* ahi_latch = btr_search_sys.get_latch(*index);
+	auto ahi_latch = btr_search_sys.get_latch(*index);
 	ahi_latch->wr_lock(SRW_LOCK_CALL);
 #endif /* BTR_CUR_HASH_ADAPT */
 	const bool metadata_changed = ctx->instant_column();
diff --git a/storage/innobase/include/btr0sea.h b/storage/innobase/include/btr0sea.h
index a8d746bf8ce..8feb892395a 100644
--- a/storage/innobase/include/btr0sea.h
+++ b/storage/innobase/include/btr0sea.h
@@ -36,6 +36,8 @@ Created 2/17/1996 Heikki Tuuri
 extern mysql_pfs_key_t btr_search_latch_key;
 #endif /* UNIV_PFS_RWLOCK */
 
+#define srw_spin_lock srw_lock
+
 #define btr_search_sys_create() btr_search_sys.create()
 #define btr_search_sys_free() btr_search_sys.free()
 
@@ -354,6 +356,7 @@ again set this much timeout. This is to reduce contention. */
 #define BTR_SEA_TIMEOUT			10000
 #endif /* BTR_CUR_HASH_ADAPT */
 
+#undef srw_spin_lock
 #include "btr0sea.inl"
 
 #endif

With this additional patch, the spin loop is gone:

  28,01%  mariadbd  mariadbd             [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, mtr_t*)
  19,94%  mariadbd  mariadbd             [.] btr_search_sys_t::partition::insert(unsigned long, unsigned char const*)
  15,99%  mariadbd  mariadbd             [.] btr_search_drop_page_hash_index(buf_block_t*, bool)
  15,03%  mariadbd  mariadbd             [.] ha_delete_hash_node(hash_table_t*, mem_block_info_t*, ha_node_t*)
   1,25%  mariadbd  mariadbd             [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)
   1,23%  mariadbd  mariadbd             [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)
   0,88%  mariadbd  mariadbd             [.] row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)
   0,86%  mariadbd  mariadbd             [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
   0,78%  mariadbd  libc.so.6            [.] __memcmp_avx2_movbe
   0,66%  mariadbd  mariadbd             [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned short const*, unsigned long, unsigned long*)
   0,52%  mariadbd  mariadbd             [.] cmp_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long)
   0,52%  mariadbd  mariadbd             [.] mtr_memo_slot_t::release() const
   0,47%  mariadbd  mariadbd             [.] trx_purge(unsigned long, unsigned long)
   0,45%  mariadbd  mariadbd             [.] ssux_lock_impl<false>::rd_wait()
   0,44%  mariadbd  mariadbd             [.] btr_search_check_guess(btr_cur_t*, bool, dtuple_t const*, unsigned long) [clone .constprop.0]
   0,42%  mariadbd  mariadbd             [.] mtr_t::memset(buf_block_t const*, unsigned long, unsigned long, unsigned char)
   0,41%  mariadbd  [vdso]               [.] __vdso_gettimeofday

The ssux_lock_impl<false>::rd_wait() above will involve context switches due to futex system calls.

This needs to be tested with a larger workload that originally reproduced the scalability issue.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This should be directly applicable to 10.5, but the patch is currently is based on the 10.6 branch. For 10.5, we might also limit the change to btr_search_check_free_space_in_heap().

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m self-assigned this Oct 1, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m dr-m marked this pull request as ready for review October 1, 2024 13:27
@dr-m dr-m marked this pull request as draft October 1, 2024 13:34
@dr-m
Copy link
Contributor Author

dr-m commented Oct 1, 2024

Sorry, this needs some more work:

#17 0x00007f78d3e42592 in __assert_fail (assertion=0x5617ede15ee8 "(table->freed_indexes).count == 0", file=0x5617ede15da8 "/mariadb/10.6/storage/innobase/dict/dict0mem.cc", line=0xcf, function=0x5617ede15ea0 "void dict_mem_table_free(dict_table_t*)") at ./assert/assert.c:103
#18 0x00005617ed7cc4fc in dict_mem_table_free (table=0x7f7894290990) at /mariadb/10.6/storage/innobase/dict/dict0mem.cc:207
#19 0x00005617ed5e127f in row_import_cleanup (prebuilt=0x7f789428d480, err=DB_SUCCESS, fts_table=0x7f7894290990) at /mariadb/10.6/storage/innobase/row/row0import.cc:2277
#20 0x00005617ed5e9fde in row_import_for_mysql (table=0x7f7894290990, prebuilt=0x7f789428d480) at /mariadb/10.6/storage/innobase/row/row0import.cc:4991
#21 0x00005617ed428669 in ha_innobase::discard_or_import_tablespace (this=0x7f7894293d40, discard=0x0) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13403

However, this should not prevent any performance testing.

@dr-m dr-m marked this pull request as ready for review October 1, 2024 15:28
@dr-m
Copy link
Contributor Author

dr-m commented Oct 1, 2024

Thanks to @montywi for pointing out my mistake: all updates of ahi_block need to use exchange() or compare_exchange_strong() or similar. With this fixed in f431dc9, the following completed without errors on my laptop:

./mtr --mem --parallel=5 --suite=innodb --mysqld=--loose-innodb-adaptive-hash-index=on

buf_block_t *block= buf_block_alloc();
auto part= btr_search_sys.get_part(*index);

part->latch.wr_lock(SRW_LOCK_CALL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a closer thought, we may need rd_lock(SRW_LOCK_CALL) here in order to prevent a race condition with mem_heap_free(part->heap) or mem_heap_empty(part->heap) (which I think should be covered by an exclusive latch). I am not sure if there can be such calls, so I must check this, and add a source code comment, with or without the latch acquisition, as appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed must check btr_search_enabled while holding part->latch in order to avoid reintroducing a race condition with a concurrent btr_search_disable(), which had been fixed in ad2bf11.

MEM_HEAP_BTR_SEARCH: Remove. Let us handle this special type of
mem_heap_t allocations in the only compilation unit, btr0sea.cc.

mem_block_info_t::ahi_block: Replaces free_block. This caches one
buffer page for use in adaptive hash index allocations. This is
protected by btr_search_sys_t::partition::latch. It only is
Atomic_relaxed because btr_search_free_space() is following a
pattern of test, lock, and test.

btr_search_check_free_space(): Protect the ahi_block with a
shared AHI partition latch. We must recheck btr_search_enabled after
acquiring the latch in order to avoid a race condition with
btr_search_disable(). Using a shared latch instead of an exclusive one
should reduce contention with btr_search_guess_on_hash() and other
operations when running with innodb_adaptive_hash_index=ON.

This has been tested by running the regression test suite
with the adaptive hash index enabled:
./mtr --mysqld=--loose-innodb-adaptive-hash-index=ON
@dr-m
Copy link
Contributor Author

dr-m commented Oct 2, 2024

Sorry, this needs some more work:

#17 0x00007f78d3e42592 in __assert_fail (assertion=0x5617ede15ee8 "(table->freed_indexes).count == 0", file=0x5617ede15da8 "/mariadb/10.6/storage/innobase/dict/dict0mem.cc", line=0xcf, function=0x5617ede15ea0 "void dict_mem_table_free(dict_table_t*)") at ./assert/assert.c:103
#18 0x00005617ed7cc4fc in dict_mem_table_free (table=0x7f7894290990) at /mariadb/10.6/storage/innobase/dict/dict0mem.cc:207
#19 0x00005617ed5e127f in row_import_cleanup (prebuilt=0x7f789428d480, err=DB_SUCCESS, fts_table=0x7f7894290990) at /mariadb/10.6/storage/innobase/row/row0import.cc:2277
#20 0x00005617ed5e9fde in row_import_for_mysql (table=0x7f7894290990, prebuilt=0x7f789428d480) at /mariadb/10.6/storage/innobase/row/row0import.cc:4991
#21 0x00005617ed428669 in ha_innobase::discard_or_import_tablespace (this=0x7f7894293d40, discard=0x0) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13403

However, this should not prevent any performance testing.

This turned out to be an independent bug, which I fixed in cc70ca7.

btr_search_sys_t::erase(): Replaces ha_search_and_delete_if_found().
If we use futex-based read-write locks, start with a shared latch on
the partition and upgrade to an exclusive latch after searching.

ha_node_t: Define in the only compilation unit that uses this.
@dr-m
Copy link
Contributor Author

dr-m commented Oct 2, 2024

ee91743 switches from srw_spin_lock to srw_lock without spin loops, and 62dca98 implements a further optimization to btr_search_sys_t::partition::erase(), which is renamed from ha_search_and_delete_if_found().

Remove the lock upgrade logic; it may cause deadlocks.
@dr-m
Copy link
Contributor Author

dr-m commented Oct 2, 2024

994a740 reverts another attempt to use lock upgrade again. It would lead to occasional deadlocks when running ./mtr --mysqld=--loose-innodb-adaptive-hash-index=ON.

Comment on lines 811 to 818
for (ulint i = 0; i < btr_ahi_parts && btr_search_enabled; ++i) {
const auto part= &btr_search_sys.parts[i];
part->latch.rd_lock(SRW_LOCK_CALL);
ut_ad(part->heap->type == MEM_HEAP_FOR_BTR_SEARCH);
fprintf(file, "Hash table size " ULINTPF
", node heap has " ULINTPF " buffer(s)\n",
part->table.n_cells,
part->heap->base.count - !part->heap->free_block);
part->heap->base.count - !part->heap->ahi_block);
part->latch.rd_unlock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If btr_search_enabled had been cleared between our check and the acquisition of part->latch, we’d be dereferencing part->heap=nullptr here.

A possible fix could be to ensure that part->heap will always remain allocated. In that way, we would not have to check for btr_search_enabled here at all, and btr_search_check_free_space_in_heap() could safely use atomic memory access instead of acquiring part->latch.

@dr-m dr-m marked this pull request as draft October 15, 2024 15:05
Comment on lines 281 to 285
/** Get an adaptive hash index partition */
partition *get_part(index_id_t id, ulint space_id) const
partition *get_part(index_id_t id, ulint space_id) const noexcept
{
return parts + ut_fold_ulint_pair(ulint(id), space_id) % btr_ahi_parts;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had better be just

    return parts + ulint(id) % btr_ahi_parts;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants