Uhyve is very slow on AMD Machines #790

jounathaen · 2024-11-12T17:37:26Z

Example run:

https://github.com/hermit-os/kernel/actions/runs/11801739625/job/32875789502

Takes ~2:30 Min with 1 core
Takes ~11:00 Min with 4 cores

n0toose · 2024-11-20T17:38:48Z

Uploaded the job-longs.txt for posterity and because GitHub requires you to login to view the logs:

n0toose · 2024-11-21T21:35:27Z

After removing the following line on my system (with its fair share of environment variables), I managed to gain a performance boost of 10 seconds (before, it was around ~14) with 20k-27k less hypercalls (depending on the log level).

This probably warrants some further investigation:

uhyve/src/vm.rs

Line 323 in d4bc7da

.envs(env::vars())

mkroening · 2024-11-22T07:39:11Z

In release mode? This would be most surprising. 🤔

I'll have a look at the kernel to see if anything jumps on me.

mkroening · 2024-11-22T09:26:09Z

Statistics from running a release-profile rusty_demo on hermit-os/kernel@430da84:

With env in FDT:

VM exits:                       total     cpu0 
  Hypercall(Exit):                  1        1 
  Hypercall(FileWrite):           385      385 
  Hypercall(Uart):              21692    21692 
  PCIRead:                       1043     1043 
  PCIWrite:                      1067     1067 
CPU runtimes:
  cpu 0: 1.127073035s

Without env in FDT:

Run statistics:
VM exits:                       total     cpu0 
  Hypercall(Exit):                  1        1 
  Hypercall(FileWrite):           150      150 
  Hypercall(Uart):               3252     3252 
  PCIRead:                       1043     1043 
  PCIWrite:                      1067     1067 
CPU runtimes:
  cpu 0: 534.278001ms

(@jounathaen it would be nice if the statistics would be sorted automatically, also there is a trailing space)

These numbers do make sense, since the output with the environment is a 21,000 characters longer than without, and we currently do one hypercall per character if I am not mistaken.

This does not explain the slowdown in CI, though, since the CI spends most of the time with number crunching and not with UART output.

Edit: for comparison, QEMU with KVM for rusty_demo without environment variables takes this long:

________________________________________________________
Executed in    1.33 secs      fish           external
   usr time    0.92 millis  916.00 micros    0.00 millis
   sys time  117.51 millis  424.00 micros  117.09 millis

jounathaen · 2024-11-22T09:40:30Z

We definitely have to implement the buffered serial hypercall in Hermit.

(@jounathaen it would be nice if the statistics would be sorted automatically, also there is a trailing space)

By which measure do you want to sort them? I think alphabetically makes most sense to have the best comparability.

mkroening · 2024-11-22T12:22:36Z

By which measure do you want to sort them? I think alphabetically makes most sense to have the best comparability.

Yeah, alphabetically is what I was thinking of. I had to sort them manually, the order was non-deterministic.

mkroening · 2024-12-11T14:13:45Z

Firecracker is much faster on the same GitHub-hosted runner (dev profile):

Uhyve:

Measured time for 100ms sleep: 184.914ms

Calculating Pi (Sequential): 469.843ms

Calculating Pi (Parallel):   2.292218s

Matrix multiplication
seq row-major:	64x64 matrix: 6.1972733 s
seq z-order:	64x64 matrix: 5.092721 s
par z-order:	64x64 matrix: 5.624933 s
par strassen:	64x64 matrix: 5.5266623 s
speedup: 0.91x

Laplace iterations
1000 iterations: 170.188916s (residual: 0.0000000000000000000044790090308916814)

Firecracker:

Measured time for 100ms sleep: 100.147ms

Calculating Pi (Sequential): 836µs

Calculating Pi (Parallel):   3.06ms

Matrix multiplication
seq row-major:	64x64 matrix: 0.011902 s
seq z-order:	64x64 matrix: 0.009583 s
par z-order:	64x64 matrix: 0.009726 s
par strassen:	64x64 matrix: 0.009326 s
speedup: 0.99x

Laplace iterations
1000 iterations: 117.193ms (residual: 0.0000000000000000000044790090308916814)

Uhyve SMP:

Measured time for 100ms sleep: 206.062ms

Calculating Pi (Sequential): 534.138ms

Calculating Pi (Parallel):   5.323512s

Matrix multiplication
seq row-major:	64x64 matrix: 6.868419 s
seq z-order:	64x64 matrix: 5.602098 s
par z-order:	64x64 matrix: 3.658734 s
par strassen:	64x64 matrix: 4.7791476 s
speedup: 1.53x

Laplace iterations
1000 iterations: 863.561143s (residual: 0.0000000000000000000044790090308916814)

Firecracker SMP:

Measured time for 100ms sleep: 100.176ms

Calculating Pi (Sequential): 836µs

Calculating Pi (Parallel):   6.463ms

Matrix multiplication
seq row-major:	64x64 matrix: 0.011897 s
seq z-order:	64x64 matrix: 0.009555 s
par z-order:	64x64 matrix: 0.007463 s
par strassen:	64x64 matrix: 0.007992 s
speedup: 1.28x

Laplace iterations
1000 iterations: 1.385644s (residual: 0.0000000000000000000044790090308916814)

Maybe our interrupt handling is way slower?

For the record, these are the KVM capabilities, but I have not looked into whether they might be relevant here.

jounathaen · 2025-01-10T10:14:43Z

Ok, we found out that all systems with bad performance are AMD systems. Furthermore, it seems to be related to the FPU and/or interrupts, as the hello_world demo apparently does not suffer any slowdown. A profiling of a slow run indicates some issues with the APIC, but that is very vague.

mkroening · 2025-01-10T10:32:12Z

This is the difference in reported CPUID features when running on the same CPU with both QEMU and Uhyve for reference:

--- cpuid-qemu  2025-01-09 22:36:28.624696853 +0100
+++ cpuid-uhyve 2025-01-09 22:35:57.689579100 +0100
@@ -14,8 +14,8 @@
             stepping_id: 1,
             brand_index: 0,
             cflush_cache_line_size: 8,
-            initial_local_apic_id: 0,
-            max_logical_processor_ids: 0,
+            initial_local_apic_id: 98,
+            max_logical_processor_ids: 64,
             edx_ecx: SSE3 | PCLMULQDQ | SSSE3 | FMA | CMPXCHG16B | PCID | SSE41 | SSE42 | X2APIC | MOVBE | POPCNT | TSC_DEADLINE | AESNI | XSAVE | AVX | F16C | RDRAND | HYPERVISOR | FPU | VME | DE | PSE | TSC | MSR | PAE | MCE | CX8 | APIC | SEP | MTRR | PGE | MCA | CMOV | PAT | PSE36 | CLFSH | MMX | FXSR | SSE | SSE2,
         },
     ),
@@ -28,13 +28,13 @@
                 level: 1,
                 is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 2,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
-                associativity: 2,
-                sets: 512,
-                is_write_back_invalidate: true,
+                associativity: 8,
+                sets: 64,
+                is_write_back_invalidate: false,
                 is_inclusive: false,
                 has_complex_indexing: false,
             },
@@ -43,29 +43,29 @@
                 level: 1,
                 is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 2,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
-                associativity: 2,
-                sets: 512,
-                is_write_back_invalidate: true,
+                associativity: 8,
+                sets: 64,
+                is_write_back_invalidate: false,
                 is_inclusive: false,
                 has_complex_indexing: false,
             },
             CacheParameter {
                 cache_type: Unified,
                 level: 2,
-                is_self_initializing: false,
+                is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 2,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
-                associativity: 16,
-                sets: 512,
+                associativity: 8,
+                sets: 2048,
                 is_write_back_invalidate: false,
-                is_inclusive: false,
+                is_inclusive: true,
                 has_complex_indexing: false,
             },
             CacheParameter {
@@ -73,15 +73,15 @@
                 level: 3,
                 is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 8,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
                 associativity: 16,
-                sets: 16384,
-                is_write_back_invalidate: false,
-                is_inclusive: true,
-                has_complex_indexing: true,
+                sets: 32768,
+                is_write_back_invalidate: true,
+                is_inclusive: false,
+                has_complex_indexing: false,
             },
         ],
     ),
@@ -89,8 +89,8 @@
         MonitorMwaitInfo {
             smallest_monitor_line: 0,
             largest_monitor_line: 0,
-            extensions_supported: true,
-            interrupts_as_break_event: true,
+            extensions_supported: false,
+            interrupts_as_break_event: false,
             supported_c0_states: 0,
             supported_c1_states: 0,
             supported_c2_states: 0,
@@ -135,22 +135,7 @@
     direct_cache_access_info: None,
     performance_monitoring_info: None,
     extended_topology_info: Some(
-        [
-            ExtendedTopologyLevel {
-                processors: 1,
-                number: 0,
-                type: SMT,
-                x2apic_id: 0,
-                next_apic_id: 0,
-            },
-            ExtendedTopologyLevel {
-                processors: 1,
-                number: 1,
-                type: Core,
-                x2apic_id: 0,
-                next_apic_id: 0,
-            },
-        ],
+        [],
     ),
     extended_state_info: Some(
         ExtendedStateInfo {
@@ -231,31 +216,31 @@
     extended_processor_and_feature_identifiers: Some(
         ExtendedProcessorFeatureIdentifiers {
             extended_signature: 10555153,
-            pkg_type: 0,
+            pkg_type: 4,
             brand_id: 0,
-            ecx_features: LAHF_SAHF | CMP_LEGACY | SVM | ALTMOVCR8 | LZCNT | SSE4A | MISALIGNSSE | PREFETCHW | OSVW | PERFCTREXT,
+            ecx_features: LAHF_SAHF | CMP_LEGACY | SVM | ALTMOVCR8 | LZCNT | SSE4A | MISALIGNSSE | PREFETCHW | OSVW | TOPEXT | PERFCTREXT,
             edx_features: SYSCALL_SYSRET | EXECUTE_DISABLE | MMXEXT | FFXSR | GIB_PAGES | RDTSCP | I64BIT_MODE,
         },
     ),
     processor_brand_string: Some(
         ProcessorBrandString {
-            as_str: "AMD EPYC 9354P 32-Core Processor",
+            as_str: "uhyve - unikernel hypervisor",
         },
     ),
     l1_cache_and_tlb_info: Some(
         L1CacheTlbInfo {
-            eax: 33489407,
-            ebx: 33489407,
-            ecx: 1073873216,
-            edx: 1073873216,
+            eax: 4282974016,
+            ebx: 4282974016,
+            ecx: 537395520,
+            edx: 537395520,
         },
     ),
     l2_l3_cache_and_tlb_info: Some(
         L2And3CacheTlbInfo {
-            eax: 0,
-            ebx: 1107313152,
-            ecx: 33587520,
-            edx: 8421696,
+            eax: 1543512576,
+            ebx: 1811956224,
+            ecx: 67133760,
+            edx: 134254912,
         },
     ),
     advanced_power_mgmt_info: Some(
@@ -263,12 +248,12 @@
             _eax: 0,
             ebx: (empty),
             ecx: 0,
-            edx: (empty),
+            edx: INVTSC,
         },
     ),
     processor_capacity_feature_info: Some(
         ProcessorCapacityAndFeatureInfo {
-            physical_address_bits: 48,
+            physical_address_bits: 52,
             linear_address_bits: 57,
             guest_physical_address_bits: 0,
             has_cl_zero: true,
@@ -282,9 +267,9 @@
             has_unsupported_efer_lmsle: false,
             has_invlpgb_nested: false,
             perf_tsc_size: 40,
-            apic_id_size: 0,
-            maximum_logical_processors: 1,
-            num_phys_threads: 1,
+            apic_id_size: 7,
+            maximum_logical_processors: 128,
+            num_phys_threads: 64,
             invlpgb_max_pages: 0,
             max_rdpru_id: 0,
         },
@@ -292,22 +277,22 @@
     svm_info: Some(
         SvmFeatures {
             eax: 1,
-            ebx: 16,
+            ebx: 8,
             _ecx: 0,
             edx: NP | LBR_VIRT | NRIPS | TSC_RATE_MSR | VMCB_CLEAN | PAUSE_FILTER | PAUSE_FILTER_THRESHOLD | VMSAVE_VIRT | VGIF,
         },
     ),
     tlb_1gb_page_info: Some(
         Tlb1gbPageInfo {
-            eax: 0,
-            ebx: 0,
+            eax: 4031311936,
+            ebx: 4030726144,
             _ecx: 0,
             _edx: 0,
         },
     ),
     performance_optimization_info: Some(
         PerformanceOptimizationInfo {
-            eax: (empty),
+            eax: MOVU | FP256,
             _ebx: 0,
             _ecx: 0,
             _edx: 0,
@@ -324,10 +309,10 @@
     ),
     memory_encryption_info: Some(
         MemoryEncryptionInfo {
-            eax: (empty),
-            ebx: 0,
-            ecx: 0,
-            edx: 0,
+            eax: SEV | SEV_ES | HWENFCACHECOH,
+            ebx: 51,
+            ecx: 1006,
+            edx: 11,
         },
     ),
 }

mkroening · 2025-01-10T10:44:26Z

We can retrieve which CPUID features KVM is emulating instead of providing natively with (KVM_GET_EMULATED_CPUID):

    let cpuid = KVM.get_emulated_cpuid(KVM_MAX_CPUID_ENTRIES).unwrap();
    dbg!(cpuid.as_slice());

On an affected system, this gives:

[src/linux/x86_64/kvm_cpu.rs:146:3] cpuid_entries = [
    kvm_cpuid_entry2 {
        function: 0,
        index: 0,
        flags: 0,
        eax: 7,
        ebx: 0,
        ecx: 0,
        edx: 0,
        padding: [
            0,
            0,
            0,
        ],
    },
    kvm_cpuid_entry2 {
        function: 1,
        index: 0,
        flags: 0,
        eax: 0,
        ebx: 0,
        ecx: 1 << 22,
        edx: 0,
        padding: [
            0,
            0,
            0,
        ],
    },
    kvm_cpuid_entry2 {
        function: 7,
        index: 0,
        flags: 1,
        eax: 0,
        ebx: 0,
        ecx: 1 << 22,
        edx: 0,
        padding: [
            0,
            0,
            0,
        ],
    },
]

On an unaffected system, the result is the same, unfortunately.

Still, we can decode these using Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-L:

This entry says that 7 is the Maximum Input Value for Basic CPUID Information.
This entry corresponds to the MOVBE instruction (bit 21 would be x2apic 👀).
This entry corresponds to RDPID and IA32_TSC_AUX are available if 1.

jounathaen changed the title ~~Uhyve is very slow in Githubs CI~~ Uhyve is very slow ~~in Githubs CI~~ on AMD Machines Jan 10, 2025

jounathaen changed the title ~~Uhyve is very slow ~~in Githubs CI~~ on AMD Machines~~ Uhyve is very slow on AMD Machines Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uhyve is very slow on AMD Machines #790

Uhyve is very slow on AMD Machines #790

jounathaen commented Nov 12, 2024

n0toose commented Nov 20, 2024

n0toose commented Nov 21, 2024

mkroening commented Nov 22, 2024

mkroening commented Nov 22, 2024 •

edited

Loading

jounathaen commented Nov 22, 2024

mkroening commented Nov 22, 2024

mkroening commented Dec 11, 2024

jounathaen commented Jan 10, 2025

mkroening commented Jan 10, 2025

mkroening commented Jan 10, 2025

Uhyve is very slow on AMD Machines #790

Uhyve is very slow on AMD Machines #790

Comments

jounathaen commented Nov 12, 2024

n0toose commented Nov 20, 2024

n0toose commented Nov 21, 2024

mkroening commented Nov 22, 2024

mkroening commented Nov 22, 2024 • edited Loading

jounathaen commented Nov 22, 2024

mkroening commented Nov 22, 2024

mkroening commented Dec 11, 2024

jounathaen commented Jan 10, 2025

mkroening commented Jan 10, 2025

mkroening commented Jan 10, 2025

mkroening commented Nov 22, 2024 •

edited

Loading