Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CpuBoundWork#CpuBoundWork(): don't spin on atomic int to acquire slot #9990

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Feb 7, 2024

This is inefficient and involves unfair scheduling. The latter implies
possible bad surprises regarding waiting durations on busy nodes. Instead,
use AsioConditionVariable#Wait() if there are no free slots. It's notified
by others' CpuBoundWork#~CpuBoundWork() once finished.

fixes #9988

Also, the current implementation is a spin-lock. 🙈 #10117 (comment)

@Al2Klimov Al2Klimov self-assigned this Feb 7, 2024
@cla-bot cla-bot bot added the cla/signed label Feb 7, 2024
@icinga-probot icinga-probot bot added area/api REST API area/distributed Distributed monitoring (master, satellites, clients) core/quality Improve code, libraries, algorithms, inline docs ref/IP labels Feb 7, 2024
@Al2Klimov
Copy link
Member Author

Low load test

[2024-02-07 12:16:44 +0100] information/ApiListener: New client connection from [::ffff:127.0.0.1]:51732 (no client certificate)
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/HttpServerConnection: Request: GET /v1/objects/services/3d722963ae43!4272 (from [::ffff:127.0.0.1]:51732), user: root, agent: curl/8.4.0, status: Not Found).
[2024-02-07 12:16:44 +0100] information/HttpServerConnection: HTTP client disconnected (from [::ffff:127.0.0.1]:51732)
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12

Just a few increments/decrements. 👍

@Al2Klimov
Copy link
Member Author

High load test

If I literally DoS Icinga with https://github.com/Al2Klimov/i2all.tf/tree/master/i2dos, I get a few of these:

[2024-02-07 12:19:37 +0100] information/CpuBoundWork: Handing over one used slot, free: 0 => 0

After I stop that program and fire one curl as in my low load test above, I get the same picture: still 12 free slots. 👍

Logs

--- lib/base/io-engine.cpp
+++ lib/base/io-engine.cpp
@@ -24,6 +24,7 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc)
        std::unique_lock<std::mutex> lock (sem.Mutex);

        if (sem.FreeSlots) {
+               Log(LogInformation, "CpuBoundWork") << "Using one free slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots - 1u;
                --sem.FreeSlots;
                return;
        }
@@ -32,7 +33,9 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc)

        sem.Waiting.emplace(&cv);
        lock.unlock();
+       Log(LogInformation, "CpuBoundWork") << "Waiting...";
        cv.Wait(yc);
+       Log(LogInformation, "CpuBoundWork") << "Waited!";
 }

 void CpuBoundWork::Done()
@@ -42,8 +45,10 @@ void CpuBoundWork::Done()
                std::unique_lock<std::mutex> lock (sem.Mutex);

                if (sem.Waiting.empty()) {
+                       Log(LogInformation, "CpuBoundWork") << "Releasing one used slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots + 1u;
                        ++sem.FreeSlots;
                } else {
+                       Log(LogInformation, "CpuBoundWork") << "Handing over one used slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots;
                        sem.Waiting.front()->Set();
                        sem.Waiting.pop();
                }

@Al2Klimov Al2Klimov removed their assignment Feb 7, 2024
@Al2Klimov Al2Klimov marked this pull request as ready for review February 7, 2024 11:24
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from c11989f to 8d24525 Compare February 9, 2024 13:03
Comment on lines 36 to 39
try {
cv->Wait(yc);
} catch (...) {
Done();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Done() called here? Wouldn't this release a slot that was never acquired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the pillars of the whole logic:

A regular CpuBoundWork#Done() from CpuBoundWork#~CpuBoundWork() calls AsioConditionVariable#Set() and expects it to successfully finish AsioConditionVariable#Wait() and CpuBoundWork#CpuBoundWork(). The latter implies a later CpuBoundWork#~CpuBoundWork() which again calls CpuBoundWork#Done(). But if AsioConditionVariable#Wait() throws, I can call CpuBoundWork#Done() now or never.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still, this means that more coroutines can simultaneously acquire CpuBoundMutex than what is permitted by ioEngine.m_CpuBoundSemaphore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... technically speaking and in an edge case where probably all of them are purged anyway.

@julianbrost
Copy link
Contributor

Is the way boost::asio::deadline_timer is used by AsioConditionVariable here actually safe? It's documentation says that shared objects are not thread-safe.

@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 8d24525 to bf74280 Compare February 20, 2024 12:21
IoEngine::YieldCurrentCoroutine(yc);
continue;
}
AsioConditionVariable cv (ioEngine.GetIoContext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put it on the stack, you can't simply call Done() which could possibly wake any other coroutine. Because if that happened, there would then be dangling pointers in the queue. If you'd retract the queue entry from coroutine, that wouldn't be a problem. And as an extra benefit, you wouldn't create extra slots out of thin air.

@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from bf74280 to 9062934 Compare February 21, 2024 11:13
Comment on lines +32 to +33
bool gotSlot = false;
auto pos (sem.Waiting.insert(sem.Waiting.end(), IoEngine::CpuBoundQueueItem{&strand, cv, &gotSlot}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you're using a boolean pointer here! Why not just use a simple bool type instead?

Comment on lines +37 to +39
try {
cv->Wait(yc);
} catch (...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to catch here? AsioConditionVariable#Wait() asynchronously waits with the non-throwing form of Asio async_wait().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly catch forced_unwind.

} catch (...) {
std::unique_lock<std::mutex> lock (sem.Mutex);

if (gotSlot) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use pos->GotSlot instead here and don't need to keep track of a bool type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Items get moved out of sem.Waiting which invalidates pos. gotSlot tells me whether pos is still valid or not.

Copy link
Member

@yhabteab yhabteab Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cppreference says this:

Adding, removing and moving the elements within the list or across several lists does not invalidate the iterators or references. An iterator is invalidated only when the corresponding element is deleted.

I would simply use a pointer to CpuBoundQueueItem for the queue instead then.

IoEngine::CpuBoundQueueItem item{&strand, cv, false};
auto pos (sem.Waiting.emplace(sem.Waiting.end(), &item));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only when the corresponding element is deleted.

Exactly that gotSlot tells.

continue;
*next.GotSlot = true;
sem.Waiting.pop_front();
boost::asio::post(*next.Strand, SetAsioCV(std::move(next.CV)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use something like this instead and drop the intermediate class SetAsioCV entirely.

boost::asio::post(*next.Strand, [cv = std::move(next.CV)]() { cv->Set(); });

IoBoundWorkSlot#~IoBoundWorkSlot() will wait for a free semaphore slot
which will be almost immediately released by CpuBoundWork#~CpuBoundWork().
Just releasing the already aquired slot via CpuBoundWork#Done()
is more efficient.
This is inefficient and involves unfair scheduling. The latter implies
possible bad surprises regarding waiting durations on busy nodes. Instead,
use AsioConditionVariable#Wait() if there are no free slots. It's notified
by others' CpuBoundWork#~CpuBoundWork() once finished.
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from a00262f to 26ef66e Compare September 27, 2024 09:51
@Al2Klimov
Copy link
Member Author

In addition, v2.14.2 could theoretically misbehave once the free slot amount falls "temporarily" noticeably below zero. Like, three requestors achieve an ioEngine.m_CpuBoundSemaphore.fetch_sub(1) while it's zero (0 - 3 x 1 = -3). Now, requestor A realizes that it's not allowed to take that slot and adds 1 again (-2). And so does requestor B (-1). But A subtracts again (-2) before C also adds 1 again (-1). And so on.

https://github.com/Icinga/icinga2/blob/v2.14.2/lib/base/io-engine.cpp#L24-L31

So that spinlock blocks not only CPU time, but also slots from legit requestors. The father of all spinlocks, so to say. 🙈 #10117 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) cla/signed core/quality Improve code, libraries, algorithms, inline docs ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-think CpuBoundWork implementation and usage
3 participants