[YUNIKORN-2895] Don't add duplicated allocation to node when the allo… #976

zhuqi-lucas · 2024-10-01T03:43:07Z

…cation already allocated before.

When i try to revisit the new update allocation logic, the potential duplicated allocation to node will happen if the allocation already allocated. And we try to add the allocation to the node again and don't revert it.

Note:

We have unwind logic for another side call for placeholder allocate:

yunikorn-core/pkg/scheduler/objects/application.go

Line 1285 in fe067b7

_ = node.RemoveAllocation(reqFit.GetAllocationKey())

What type of PR is it?

Todos

- Task

What is the Jira issue?

Open an issue on Jira https://issues.apache.org/jira/browse/YUNIKORN-2895
Put link here, and add [YUNIKORN-Jira number] in PR title, eg. [YUNIKORN-2] Gang scheduling interface parameters

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

…cation already allocated before.

codecov · 2024-10-01T03:44:55Z

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 80.99%. Comparing base (fe067b7) to head (5d1f2ba).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/scheduler/objects/application.go	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #976      +/-   ##
==========================================
- Coverage   81.03%   80.99%   -0.04%     
==========================================
  Files          97       97              
  Lines       12523    12529       +6     
==========================================
  Hits        10148    10148              
- Misses       2104     2110       +6     
  Partials      271      271

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ryankert01 · 2024-10-01T07:27:49Z

Good catch!

craigcondit

This will break horribly. If allocateAsk() fails, there is nothing to unwind as the allocation result won't be returned and the final allocation will never happen. Your logic will remove something that was not added in the first place.

zhuqi-lucas · 2024-10-01T12:47:57Z

Thanks @craigcondit for review:
We have inconsistent logic for allocate failed, here is the call with unwind, so i am assuming it may wrong:

yunikorn-core/pkg/scheduler/objects/application.go

Line 1285 in fe067b7

_ = node.RemoveAllocation(reqFit.GetAllocationKey())

It seems both two cases call TryNodeAllocation, so i am confused.

craigcondit · 2024-10-01T13:24:03Z

If you're sure this is correct, build a set of unit tests that prove it. Demonstrate that in the failure case without this patch that the system is left in an inconsistent state. Then show that it operates correctly with the patch. There's no tests in this PR at all.

wilfred-s · 2024-10-02T01:19:48Z

-1 on this approach. I think the issue is way more basic. I opened a discussion on the jira as I see multiple issues in the way now track allocations and asks.

zhuqi-lucas · 2024-10-02T01:58:39Z

Thanks @wilfred-s @craigcondit , i will try do add unit test and check the comments, look into the root cause of those issues.

craigcondit · 2024-10-10T16:35:10Z

The allocateAsk() call can only fail if the ask was previously allocated. This is checked for in all code paths that call allocateAsk() earlier, and always with the application lock held. Therefore, if we ever hit the error case here, we have a serious BUG elsewhere.

craigcondit · 2024-10-10T19:52:02Z

Closing this after determining that original issue is not a problem.

[YUNIKORN-2895] Don't add duplicated allocation to node when the allo…

5d1f2ba

…cation already allocated before.

zhuqi-lucas requested review from wilfred-s, pbacsko and craigcondit and removed request for pbacsko October 1, 2024 03:43

craigcondit requested changes Oct 1, 2024

View reviewed changes

craigcondit assigned zhuqi-lucas Oct 10, 2024

craigcondit closed this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2895] Don't add duplicated allocation to node when the allo… #976

[YUNIKORN-2895] Don't add duplicated allocation to node when the allo… #976

zhuqi-lucas commented Oct 1, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

ryankert01 commented Oct 1, 2024

craigcondit left a comment •

edited

Loading

zhuqi-lucas commented Oct 1, 2024

craigcondit commented Oct 1, 2024

wilfred-s commented Oct 2, 2024

zhuqi-lucas commented Oct 2, 2024

craigcondit commented Oct 10, 2024

craigcondit commented Oct 10, 2024

[YUNIKORN-2895] Don't add duplicated allocation to node when the allo… #976

[YUNIKORN-2895] Don't add duplicated allocation to node when the allo… #976

Conversation

zhuqi-lucas commented Oct 1, 2024 • edited Loading

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented Oct 1, 2024 • edited Loading

Codecov Report

ryankert01 commented Oct 1, 2024

craigcondit left a comment • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas commented Oct 1, 2024

craigcondit commented Oct 1, 2024

wilfred-s commented Oct 2, 2024

zhuqi-lucas commented Oct 2, 2024

craigcondit commented Oct 10, 2024

craigcondit commented Oct 10, 2024

zhuqi-lucas commented Oct 1, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

craigcondit left a comment •

edited

Loading