Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jg ironic rebalance #5

Open
wants to merge 5 commits into
base: stackhpc/train
Choose a base branch
from
Open

Conversation

JohnGarbutt
Copy link
Member

Cherry-pick of this onto stable/strain: https://review.opendev.org/#/c/695189/6

Bug 1853009 describes a race condition involving multiple nova-compute
services with ironic. As the compute services start up, the hash ring
rebalances, and the compute services have an inconsistent view of which
is responsible for a compute node.

The sequence of actions here is adapted from a real world log [1], where
multiple nova-compute services were started simultaneously. In some
cases mocks are used to simulate race conditions.

There are three main issues with the behaviour:

* host2 deletes the orphan node compute node after host1 has taken
  ownership of it.

* host1 assumes that another compute service will not delete its nodes.
  Once a node is in rt.compute_nodes, it is not removed again unless the
  node is orphaned. This prevents host1 from recreating the compute
  node.

* host1 assumes that another compute service will not delete its
  resource providers. Once an RP is in the provider tree, it is not
  removed.

This functional test documents the current behaviour, with the idea that
it can be updated as this behaviour is fixed.

[1] http://paste.openstack.org/show/786272/

Co-Authored-By: Matt Riedemann <[email protected]>

Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
Related-Bug: #1853009
Related-Bug: #1853159
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The issue being addressed here is that if a compute node is deleted by a host
which thinks it is an orphan, then the compute host that actually owns the node
might not recreate it if the node is already in its resource tracker cache.

This change fixes the issue by clearing nodes from the resource tracker cache
for which a compute node entry does not exist. Then, when the available
resource for the node is updated, the compute node object is not found in the
cache and gets recreated.

Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
Partial-Bug: #1853009
There is a race condition in nova-compute with the ironic virt driver
as nodes get rebalanced. It can lead to compute nodes being removed in
the DB and not repopulated. Ultimately this prevents these nodes from
being scheduled to.

The issue being addressed here is that if a compute node is deleted by a
host which thinks it is an orphan, then the resource provider for that
node might also be deleted. The compute host that owns the node might
not recreate the resource provider if it exists in the provider tree
cache.

This change fixes the issue by clearing resource providers from the
provider tree cache for which a compute node entry does not exist. Then,
when the available resource for the node is updated, the resource
providers are not found in the cache and get recreated in placement.

Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
Related-Bug: #1853009
Related-Bug: #1841481
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The main race condition involved is in update_available_resources in
the compute manager. When the list of compute nodes is queried, there is
a compute node belonging to the host that it does not expect to be
managing, i.e. it is an orphan. Between that time and deleting the
orphan, the real owner of the compute node takes ownership of it ( in
the resource tracker). However, the node is still deleted as the first
host is unaware of the ownership change.

This change prevents this from occurring by filtering on the host when
deleting a compute node. If another compute host has taken ownership of
a node, it will have updated the host field and this will prevent
deletion from occurring. The first host sees this has happened via the
ComputeHostNotFound exception, and avoids deleting its resource
provider.

Closes-Bug: #1853009
Related-Bug: #1841481
Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
restored, to ensure we can reuse ironic node UUIDs as compute node
UUIDs. While this seems to largely work, it results in some nasty errors
being generated [3]:

    InvalidRequestError This session is in 'inactive' state, due to the
    SQL transaction being rolled back; no further SQL can be emitted
    within this transaction.

This happens because compute_node_create is decorated with
pick_context_manager_writer, which begins a transaction. While
_compute_node_get_and_update_deleted claims that calling a second
pick_context_manager_writer decorated function will begin a new
subtransaction, this does not appear to be the case.

This change removes pick_context_manager_writer from the
compute_node_create function, and adds a new _compute_node_create
function which ensures the transaction is finished if
_compute_node_get_and_update_deleted is called.

The new unit test added here fails without this change.

This change marks the removal of the final FIXME from the functional
test added in [4].

[1] https://bugs.launchpad.net/nova/+bug/1839560
[2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
[3] http://paste.openstack.org/show/786350/
[4] https://review.opendev.org/#/c/695012/

Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
Closes-Bug: #1853159
Related-Bug: #1853009
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants