-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add private pool and keepalive test to VMM, device tree updates #740
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally it seems fine, but I'd like to get other's thoughts on exposing such a raw protocol detail at high level APIs. It doesn't sit well with me, I think we probably need to abstract it a bit more.
@@ -12,6 +13,7 @@ use mesh::rpc::RpcSend; | |||
pub async fn service_underhill( | |||
vm_send: &mesh::Sender<VmRpc>, | |||
send: &mesh::Sender<GuestEmulationRequest>, | |||
flags: SaveGuestVtl2StateFlags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this feels like a layering/abstraction violation, as in it seems odd to me to specify raw GET protocol at such a high level function. What are other people's thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm let me think, it didn't feel to me this way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least to me, these properties seem device specific. We generally try to abstract stuff at the vmm level even so we don't refer to hyper-v specific concepts, so it seems really wrong to use a device specific protocol crate here.
IE should we add some capabilities field to where GuestEmulationRequest
lives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we anticipate this struct changing in the future? Currently it only has the one bit, are we thinking we'll be adding many more? What might they be?
Or should we be exposing something more like an nvme_keepalive: bool here? Or maybe a unified device_keepalive that will also control mana someday? Do we anticipate any scenarios where one will be kept alive and the other won't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a weird function, since from what I understand, its only really used in OpenVMM and petri, right?
I wouldn't be so sure that this is a layering violation in and of itself.
I'd me more concerned if we saw these sorts of concepts leaking into various "core" OpenVMM and OpenHCL codepaths, but I don't think this helper falls under that bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our vision is that more bits will be added in future. For now we definitely can declare a bool, but having so very specific NVMe-related bool feels less intuitive than a bitmask. Also, they already started working on MANA-keepalive, so more changes are coming.
@@ -494,6 +495,12 @@ fn build_device_tree( | |||
.end_node()?; | |||
} | |||
|
|||
// Indicate that NVMe keep-alive feature is supported by this VMM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want this to be configurable, or we want to always advertise we can do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It follows current Windows-side implementation. This particular property is always present starting with specific OS version, other properties are configurable.
Keep-alive is actually triggered when both this property exist and capabilities_flags allows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think want the ability to test with hosts that do not support keepalive, so that we can make sure the old tear down/restart flow also works since we have hosts that will operate in that mode too? It seems like this should be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want some sort of live migration downgrade test then I guess it should be configurable.
The original meaning of this property is to tell VTL that this specific version of VMM (OpenVMM) does support keepalive.
E.g. it is also unconditionally hardcoded in AH2025 worker process, but doesn't exist in anything older than AH2025.
@@ -172,7 +173,11 @@ impl PetriVmOpenVmm { | |||
); | |||
petri_vm_fn!( | |||
/// Restarts OpenHCL. | |||
pub async fn restart_openhcl(&mut self, new_openhcl: ArtifactHandle<impl petri_artifacts_common::tags::IsOpenhclIgvm>) -> anyhow::Result<()> | |||
pub async fn restart_openhcl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just me, or is there a weird spacing/line ending here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is macro so clippy/fmt task ignores it. After adding another parameter the line would be too long. @smalis-msft what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh right it's in a macro.. I think github is just rendering it weird for me then
@@ -12,6 +12,7 @@ use anyhow::Context; | |||
use async_trait::async_trait; | |||
use futures::FutureExt; | |||
use futures_concurrency::future::Race; | |||
use get_protocol::SaveGuestVtl2StateFlags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this seems like an abstraction/layering violation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer test-specific definitions similar to declare_artifacts / LATEST_LINUX_DIRECT_TEST_X64, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaking this protocol type to the user-facing petri API certainly seems less palettable to me, vs. the other use-case in the hvlite_helper crate...
I would suggest having restart_openhcl
accept a more petri-specific OpenhclServicingFlags
, or something like that, rather than exposing this low-level protocol type directly
openhcl_servicing_core( | ||
config, | ||
LATEST_LINUX_DIRECT_TEST_X64, | ||
SaveGuestVtl2StateFlags::new().with_enable_nvme_keepalive(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to specify false, that's what new does.
@@ -160,7 +160,7 @@ pub mod ged { | |||
/// Wait for VTL2 to start VTL0. | |||
WaitForVtl0Start(Rpc<(), Result<(), Vtl0StartError>>), | |||
/// Save VTL2 state. | |||
SaveGuestVtl2State(Rpc<(), Result<(), SaveRestoreError>>), | |||
SaveGuestVtl2State(Rpc<u64, Result<(), SaveRestoreError>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please define a proper struct (bitfield struct?) or this payload, vs. sending a raw (and presently undocumented) u64 payload
Start bringing up missing coverage for private pool and NVMe keepalive to VMM test suite.
This is not complete end-to-end test yet, but brings necessary infrastructure changes.
Update device tree properties to sync with Windows host changes.