-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45266: [C++][Acero] Fix the running tasks count of Scheduler when get error tasks in multi-threads #45268
base: main
Are you sure you want to change the base?
Conversation
|
Great! Also would it easy to add a test for this case? |
Actually, I think it is hard to add some tests for this case. Because It is a multi-threads case. If we wanna add a test about it, we need to make sure the Abort-operation's happen time and the task which in this time should be return error-status. |
I'm thinking this might not be very necessary for the following two reasons:
What do you think? @wuzhoupei |
I agree some of you. @zanmato1984
|
Hi @wuzhoupei , thank you for the further explanation. Yes you are right, the problem of abort continuation not being called exists if the task count is not correct. And by looking at the code, the task group continuation being called after task meets error seems to be by design allowed - it will first examine the somewhat internal error state to exist early. I think we can proceed with this PR, I'll put my review comment on the code. Meanwhile, it will be very helpful to have a test cast about this. I can help if you have trouble on that. |
Thanks the review from @zanmato1984. |
Hi @wuzhoupei , I've committed two test cases that can stably reproduce the issue of abort continuation not being invoked after aborting. Along with them are some more fixes necessary. Will you take a look? Thanks. |
// Mark the current and remaining picked tasks as finished | ||
for (size_t j = i; j < tasks.size(); ++j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for the serial execution path to correctly set the task count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So PostExecuteTask need execute the task even if failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is pretty much the whole point of this change.
@@ -413,6 +421,8 @@ void TaskSchedulerImpl::Abort(AbortContinuationImpl impl) { | |||
all_finished = false; | |||
task_group.state_ = TaskGroupState::ALL_TASKS_STARTED; | |||
} | |||
} else if (task_group.state_ == TaskGroupState::ALL_TASKS_STARTED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for allowing the Abort
to be called in a task itself, otherwise the abort continuation will be called twice.
I suppose the Abort
function is not necessarily to be called inside a task body to trigger the issue. In other words, it could happen when Abort
is called in a timing that is subtle enough.
// PostExecuteTask must be called later if any error ocurres during task execution | ||
// (including ScheduleMore), so we preserve the status. | ||
auto status = [&]() { | ||
RETURN_NOT_OK(ScheduleMore(thread_id, 1)); | ||
return ExecuteTask(thread_id, group_id, task_id, &task_group_finished); | ||
}(); | ||
|
||
if (!status.ok()) { | ||
task_group_finished = PostExecuteTask(thread_id, group_id); | ||
} | ||
|
||
if (task_group_finished) { | ||
bool all_task_groups_finished = false; | ||
return OnTaskGroupFinished(thread_id, group_id, &all_task_groups_finished); | ||
RETURN_NOT_OK( | ||
OnTaskGroupFinished(thread_id, group_id, &all_task_groups_finished)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some refinement on top of your original change, PTAL.
Rationale for this change
When the TaskGroup should be canceled, it will move the number which not-start to finished to avoid do them(in
TaskSchedulerImpl::Abort
). But this is one operation that happens in multi-threads. At the same time, maybe some task start to running and happen some error. Then they will return the bad status.But the tasks are running for Scheduler, they will just return bad status and not change the running_task count. Because the code uses
RETURN_NOT_OK
.What changes are included in this PR?
For any task, what status weather it returns, it will change the running_count before return.
Are these changes tested?
No. It is too hard to build ut.
Are there any user-facing changes?
No. But I am very shocked at hasn't this happened to anyone?