Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] iostream: extended read_exactly2 interface with alignment #5

Open
wants to merge 2 commits into
base: ceph-octopus-19.06.0-45-g7744693c
Choose a base branch
from

Conversation

cyx1231st
Copy link
Member

@cyx1231st cyx1231st commented Jun 5, 2019

See https://gist.github.com/cyx1231st/57727c8aa6c98ed48a8b06d64b7923d7

This PR introduces a less intrusive way to implement read with alignment.

Considerations:

  • Less intrusive: it only adds code to the existing seastar, and all the existing interfaces are fully functional with no impact.
  • For posix_data_source_impl, system-call is assumed to be much more expensive than memory-copy, so the prefetch is disabled only if:
    • The read is not allowed to be scattered because of alignment requirement.
    • And, the read size is large enough to be worthwhile to trigger an exclusive syscall.
      (This idea is very similar to the current async-msgr, which will always do prefetch if the read is not large enough, and copy the prefetched data to the out buffer p, see https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L235-L271)
  • For native_data_source_impl, the current implementation will try its best to verify if the memory alignment is already good, and will trigger user-to-user-space copy only if required.
  • For the rest *_data_source_impl, they are simply compatible and we currently haven't used them.

The code is functioning now, but still needs further evaluation of performance impacts.

@cyx1231st cyx1231st changed the title [RFC] iostream: extended read_exactly2 interface with alignment and padding [RFC] iostream: extended read_exactly2 interface with alignment Jun 10, 2019
@cyx1231st
Copy link
Member Author

Update: writer already does padding on the wire in v2, so no need for reader to do the same thing.

// can work with user provided buffer pointer with less copy.
return get().then([buf, size] (auto read_buf) mutable {
auto len_needs_copy = std::min(read_buf.size(), size);
std::copy(read_buf.get_write(), read_buf.get_write() + len_needs_copy, buf);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contrary to DPDK as the network stack is responsible for providing the memory. Delegating this responsibility to user directly translates into obligatory memcpy – even regardless of the contiguity imposed by returning temporary_buffer instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -214,6 +230,8 @@ public:
input_stream(input_stream&&) = default;
input_stream& operator=(input_stream&&) = default;
future<temporary_buffer<CharType>> read_exactly(size_t n);
static constexpr uint16_t DEFAULT_ALIGNMENT = alignof(void*);
future<temporary_buffer<CharType>> read_exactly2(size_t n, uint16_t alignment = DEFAULT_ALIGNMENT);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that using read_exactly2() to read big chunks will impose memcpy for DPDK due to contiguity requirement. The new method returns temporary_buffer which means: only one data pointer and one data size.
If we expect from DPDK fragmented payloads, we should expect from read_exactly2() a lot of memcpy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if crimson-OSD supports fragmented payloads (such as SPDK), it should explicitly instruct messenger to use ceph::net::Socket::read(size) instead of ceph::net::Socket::read_exactly(size, alignment). Because ceph::net::Socket::read(size) will return internally fragmented bufferlist as expected, and IMO it is better renamed to ceph::net::Socket::read_fragmented(size).

Also, the current ceph::net::Socket::read(size) is already optimal for both DPDK stack and POSIX stack if OSD-side supports fragmented DATA payload:

  • for DPDK: it's zero copy.
  • for POSIX: it's zero copy in user-space, and also minimizes syscalls.

If OSD doesn't support fragmented payloads itself (such as kernel), ceph::net::Socket::read_exactly(size) still needs to be used to build up big chunks of aligned payload, regardless of whether the messenger is using Native or POSIX stack.

My point is that whether or not to use fragmented/aligned payloads should be instructed by OSD, not seastar framework. It's our (framework user) specific requirement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the current ceph::net::Socket::read(size) is already optimal for both DPDK stack and POSIX stack if OSD-side supports fragmented DATA payload:

  • for DPDK: it's zero copy.
  • for POSIX: it's zero copy in user-space, and also minimizes syscalls.

I disagree with that. For POSIX stack the SGL will be terribly fragmented and many syscalls will be issued because of the small, 8 KB-long prefetch buffer. For instance: reading 4 MB payload requires 4096 KB / 8 KB = 512 fragments and also 512 syscalls.

Copy link
Member Author

@cyx1231st cyx1231st Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the small, 8 KB-long prefetch buffer

I think it's a separate issue, and there is already another PR addressing it (#4). My analysis (#4 (comment)) shows that messenger performance is much better with larger trunks (1 MB), as expected. But I still don't know why rados bench disagreed (from kefu).

@@ -325,6 +325,21 @@ posix_data_source_impl::get() {
});
}

future<size_t, temporary_buffer<char>>
posix_data_source_impl::get_direct(char* buf, size_t size) {
if (size > _buf_size / 2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, currently 4096 looks in the 1 syscal/msg testing as a reasonable threshold for prefetching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it is the same strategy implemented in the current async-messenger.

posix_data_source_impl::get_direct(char* buf, size_t size) {
if (size > _buf_size / 2) {
// this was a large read, we don't prefetch
return _fd->read_some(buf, size).then([] (auto read_size) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the POSIX stack would serve large chunks with less syscalls and with limited-but-still-present mempcy. Two factors contribute:

  • the mechanism is not aware (and really shouldn't be – this is Seastar ;-) where in the stream are boundaries of Ceph frame's parts (preamble, segments, epilogue). In the consequence the prefetch buffer may already contain a portion of the data we're interested in. If so, we must to memmove it to the single output buffer as
  • read_exactly2 can return only contiguous memory.

The probability for having the mempcy is quite large, I bet. The impact depends on chunk size. Copying up to 8k isn't a big deal for 4M but can be meaningful for 16k.

Copy link
Member Author

@cyx1231st cyx1231st Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's under the assumption that syscall is slower than memcpy, which is to say, for smaller chunks, prefetch with memcpy is faster than exclusive syscalls.

I'm not 100% sure about this, and that's why I'm working on improving perf_crimson_msgr to get more accurate & informative results.

auto len_needs_copy = std::min(available(), n);
std::copy(_buf.get(), _buf.get() + len_needs_copy, out.get_write());
_buf.trim_front(len_needs_copy);
if (len_needs_copy == n) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we don't need this special if. We might consider unifying with the one in ::read_exactly_part_direct():

    if (completed == n) {
        return make_ready_future<tmp_buf>(std::move(out));
    }

});
} else {
// read with prefetch, but with extra memory copy,
// because we prefer less system calls.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about io_uring? Currently it's reasonable to do a lot of extra work just to lower the number of syscalls. However, io_uring is intended to lower the costs of communication between kernel and user-space, and thus I would expect it will move the threshold much lower.

Copy link
Member Author

@cyx1231st cyx1231st Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For io_uring I believe we need another concrete data_source_impl class, not posix_data_source_impl . And io_uring also needs a new poller, right?

@@ -325,6 +325,21 @@ posix_data_source_impl::get() {
});
}

future<size_t, temporary_buffer<char>>
posix_data_source_impl::get_direct(char* buf, size_t size) {
if (size > _buf_size / 2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ::get_direct() good place for such logic? Maybe moving it up is preferred? I'm afraid the name is currently a little bit misleading as the _direct part is actually conditional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I cannot move it up, because this is the special case to reduce syscall for posix sockets, which is not general to other concrete data_source_impl classes such as native_data_source_impl.

// can work with user provided buffer pointer with less copy.
return get().then([buf, size] (auto read_buf) mutable {
auto len_needs_copy = std::min(read_buf.size(), size);
std::copy(read_buf.get_write(), read_buf.get_write() + len_needs_copy, buf);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe get() instead of get_write()?

@cyx1231st
Copy link
Member Author

Early evaluation shows this performance is similar with smaller chunks (256B 4K), but much faster with larger chunks (64K 1M) because of reduced memcpy.

I'm working on improving perf_crimson_msgr to get more accurate reports...

@cyx1231st cyx1231st changed the base branch from ceph-octopus to ceph-octopus-19.06.0-45-g7744693c July 1, 2019 08:18
For posix-stack: minimize system-calls with prefetch, and minimize
unecessary memory copies.

For native-stack: minimize unecessary memory copies.

TODO: compatible but may not be optimal
- tls_connected_socket_impl
- file_data_source_impl
- loopback_data_source_impl
- packet_data_source

Signed-off-by: Yingxin <[email protected]>
cyx1231st pushed a commit to cyx1231st/seastar that referenced this pull request Jan 2, 2020
This reverts commit 33406cf. It
introduces memory leaks:

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7fb773b389d7 in operator new(unsigned long) (/lib64/libasan.so.5+0x10f9d7)
    ceph#1 0x108f0d4 in seastar::reactor::poller::~poller() ../src/core/reactor.cc:2879
    ceph#2 0x11c1e59 in std::experimental::fundamentals_v1::_Optional_base<seastar::reactor::poller, true>::~_Optional_base() /usr/include/c++/9/experimental/optional:288
    ceph#3 0x118f2d7 in std::experimental::fundamentals_v1::optional<seastar::reactor::poller>::~optional() /usr/include/c++/9/experimental/optional:491
    ceph#4 0x108c5a5 in seastar::reactor::run() ../src/core/reactor.cc:2587
    ceph#5 0xf1a822 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) ../src/core/app-template.cc:199
    ceph#6 0xf1885d in seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) ../src/core/app-template.cc:115
    ceph#7 0xeb2735 in operator() ../src/testing/test_runner.cc:72
    ceph#8 0xebb342 in _M_invoke /usr/include/c++/9/bits/std_function.h:300
    ceph#9 0xf3d8b0 in std::function<void ()>::operator()() const /usr/include/c++/9/bits/std_function.h:690
    ceph#10 0x1034c72 in seastar::posix_thread::start_routine(void*) ../src/core/posix.cc:52
    ceph#11 0x7fb7738804e1 in start_thread /usr/src/debug/glibc-2.30-13-g919af705ee/nptl/pthread_create.c:479

Reported-by: Rafael Avila de Espindola <[email protected]>
tchaikov pushed a commit that referenced this pull request May 13, 2021
…o_with

Fixes failures in debug mode:
```
$ build/debug/tests/unit/closeable_test -l all -t deferred_close_test
WARNING: debug mode. Not for benchmarking or production
random-seed=3064133628
Running 1 test case...
Entering test module "../../tests/unit/closeable_test.cc"
../../tests/unit/closeable_test.cc(0): Entering test case "deferred_close_test"
../../src/testing/seastar_test.cc(43): info: check true has passed
==9449==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases!
terminate called after throwing an instance of 'seastar::broken_promise'
  what():  broken promise
==9449==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7fbf1f49f000; bottom 0x7fbf40971000; size: 0xffffffffdeb2e000 (-558702592)
False positive error reports may follow
For details see google/sanitizers#189
=================================================================
==9449==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:356 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0)
    #0 0x7fbf45f39d0b  (/lib64/libasan.so.6+0xb3d0b)
    #1 0x7fbf45f57d4e  (/lib64/libasan.so.6+0xd1d4e)
    #2 0x7fbf45f3e724  (/lib64/libasan.so.6+0xb8724)
    #3 0x7fbf45eb3e5b  (/lib64/libasan.so.6+0x2de5b)
    #4 0x7fbf45eb51e8  (/lib64/libasan.so.6+0x2f1e8)
    #5 0x7fbf45eb7694  (/lib64/libasan.so.6+0x31694)
    #6 0x7fbf45f39398  (/lib64/libasan.so.6+0xb3398)
    #7 0x7fbf45f3a00b in __asan_report_load8 (/lib64/libasan.so.6+0xb400b)
    #8 0xfe6d52 in bool __gnu_cxx::operator!=<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > >(__gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&, __gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&) /usr/include/c++/10/bits/stl_iterator.h:1116
    #9 0xfe615c in dl_iterate_phdr ../../src/core/exception_hacks.cc:121
    #10 0x7fbf44bd1810 in _Unwind_Find_FDE (/lib64/libgcc_s.so.1+0x13810)
    #11 0x7fbf44bcd897  (/lib64/libgcc_s.so.1+0xf897)
    #12 0x7fbf44bcea5f  (/lib64/libgcc_s.so.1+0x10a5f)
    #13 0x7fbf44bcefd8 in _Unwind_RaiseException (/lib64/libgcc_s.so.1+0x10fd8)
    #14 0xfe6281 in _Unwind_RaiseException ../../src/core/exception_hacks.cc:148
    scylladb#15 0x7fbf457364bb in __cxa_throw (/lib64/libstdc++.so.6+0xaa4bb)
    scylladb#16 0x7fbf45e10a21  (/lib64/libboost_unit_test_framework.so.1.73.0+0x1aa21)
    scylladb#17 0x7fbf45e20fe0 in boost::execution_monitor::execute(boost::function<int ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2afe0)
    scylladb#18 0x7fbf45e21094 in boost::execution_monitor::vexecute(boost::function<void ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2b094)
    scylladb#19 0x7fbf45e43921 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d921)
    scylladb#20 0x7fbf45e5eae1  (/lib64/libboost_unit_test_framework.so.1.73.0+0x68ae1)
    scylladb#21 0x7fbf45e5ed31  (/lib64/libboost_unit_test_framework.so.1.73.0+0x68d31)
    scylladb#22 0x7fbf45e2e547 in boost::unit_test::framework::run(unsigned long, bool) (/lib64/libboost_unit_test_framework.so.1.73.0+0x38547)
    scylladb#23 0x7fbf45e43618 in boost::unit_test::unit_test_main(bool (*)(), int, char**) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d618)
    scylladb#24 0x44798d in seastar::testing::entry_point(int, char**) ../../src/testing/entry_point.cc:77
    scylladb#25 0x4134b5 in main ../../include/seastar/testing/seastar_test.hh:65
    scylladb#26 0x7fbf44a1b1e1 in __libc_start_main (/lib64/libc.so.6+0x281e1)
    scylladb#27 0x4133dd in _start (/home/bhalevy/dev/seastar/build/debug/tests/unit/closeable_test+0x4133dd)
```

Signed-off-by: Benny Halevy <[email protected]>
Message-Id: <[email protected]>
tchaikov pushed a commit that referenced this pull request Nov 21, 2022
When we enable the sanitizer, we get following error while running
iotune:

==86505==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4096 byte(s) in 1 object(s) allocated from:
    #0 0x5701b8 in aligned_alloc (/home/syuu/seastar.2/build/sanitize/apps/iotune/iotune+0x5701b8) (BuildId: 411f9852d64ed8982d5b33d02489b5932d92b8b7)
    #1 0x6d0813 in seastar::filesystem_has_good_aio_support(seastar::basic_sstring<char, unsigned int, 15u, true>, bool) /home/syuu/seastar.2/src/core/fsqual.cc:74:16
    #2 0x5bcd0d in main::$_0::operator()() const::'lambda'()::operator()() const /home/syuu/seastar.2/apps/iotune/iotune.cc:742:21
    #3 0x5bb1f1 in seastar::future<int> seastar::futurize<int>::apply<main::$_0::operator()() const::'lambda'()>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&) /home/syuu/seastar.2/include/seastar/core/future.hh:2118:28
    #4 0x5bb039 in seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0::operator()() const::'lambda'()&&)::'lambda'()::operator()() const /home/syuu/seastar.2/include/seastar/core/thread.hh:258:13
    #5 0x5bb039 in seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0::operator()() const::'lambda'()&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) /home/syuu/seastar.2/include/seastar/util/noncopyable_function.hh:124:20
    #6 0x8e0a77 in seastar::thread_context::main() /home/syuu/seastar.2/src/core/thread.cc:299:9
    #7 0x7f30ff8547bf  (/lib64/libc.so.6+0x547bf) (BuildId: 85c438f4ff93e21675ff174371c9c583dca00b2c)

SUMMARY: AddressSanitizer: 4096 byte(s) leaked in 1 allocation(s).

This is because we don't free buffer which allocated at filesystem_has_good_aio_support(), we should free it to avoid such error.

And this is needed to test Scylla machine image with debug mode binary,
since it tries to run iotune with the sanitizer and fails.

Closes scylladb#1284
Matan-B pushed a commit that referenced this pull request Jul 7, 2024
in main(), we creates an instance of `http_server_control` using
new, but we never destroy it. this is identified by ASan

```
==2190125==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x55e21cf487bd in operator new(unsigned long) /home/kefu/dev/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:86:3
    #1 0x55e21cf6cf31 in main::$_0::operator()() const::'lambda'()::operator()() const /home/kefu/dev/seastar/apps/httpd/main.cc:121:27
    #2 0x55e21cf6b4cc in int std::__invoke_impl<int, main::$_0::operator()() const::'lambda'()>(std::__invoke_other, main::$_0::operator()() const::'lambda'()&&) /usr/lib/gcc/x86_64-redhat-linux/14/../../../../incl
ude/c++/14/bits/invoke.h:61:14
    #3 0x55e21cf6b46c in std::__invoke_result<main::$_0::operator()() const::'lambda'()>::type std::__invoke<main::$_0::operator()() const::'lambda'()>(main::$_0::operator()() const::'lambda'()&&) /usr/lib/gcc/x86_
64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:96:14
    #4 0x55e21cf6b410 in decltype(auto) std::__apply_impl<main::$_0::operator()() const::'lambda'(), std::tuple<>>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&, std::integer_sequence<unsigned long, .
..>) /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/tuple:2921:14
    #5 0x55e21cf6b3b2 in decltype(auto) std::apply<main::$_0::operator()() const::'lambda'(), std::tuple<>>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&) /usr/lib/gcc/x86_64-redhat-linux/14/../../../
../include/c++/14/tuple:2936:14
    #6 0x55e21cf6b283 in seastar::future<int> seastar::futurize<int>::apply<main::$_0::operator()() const::'lambda'()>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&) /home/kefu/dev/seastar/include/sea
star/core/future.hh:2005:28
    #7 0x55e21cf6b043 in seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0:
:operator()() const::'lambda'()&&)::'lambda'()::operator()() const /home/kefu/dev/seastar/include/seastar/core/thread.hh:260:13
    #8 0x55e21cf6ae74 in seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()()
 const::'lambda'()>(seastar::thread_attributes, main::$_0::operator()() const::'lambda'()&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) /home/kefu/dev/seastar/include/seastar/util/noncopyable
_function.hh:129:20
    #9 0x7f5d757a0fb3 in seastar::noncopyable_function<void ()>::operator()() const /home/kefu/dev/seastar/include/seastar/util/noncopyable_function.hh:215:16
    #10 0x7f5d75ef5611 in seastar::thread_context::main() /home/kefu/dev/seastar/src/core/thread.cc:311:9
    #11 0x7f5d75ef50eb in seastar::thread_context::s_main(int, int) /home/kefu/dev/seastar/src/core/thread.cc:287:43
    #12 0x7f5d72f8a18f  (/lib64/libc.so.6+0x5a18f) (BuildId: b098f1c75a76548bb230d8f551eae07a2aeccf06)
```

so, in this change, let's hold it using a smart pointer, so we
can destroy it when it leaves the lexical scope.

Signed-off-by: Kefu Chai <[email protected]>

Closes scylladb#2224
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants