Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uhyve is very slow on AMD Machines #790

Open
jounathaen opened this issue Nov 12, 2024 · 10 comments
Open

Uhyve is very slow on AMD Machines #790

jounathaen opened this issue Nov 12, 2024 · 10 comments

Comments

@jounathaen
Copy link
Member

Example run:

https://github.com/hermit-os/kernel/actions/runs/11801739625/job/32875789502

  • Takes ~2:30 Min with 1 core
  • Takes ~11:00 Min with 4 cores
@n0toose
Copy link
Member

n0toose commented Nov 20, 2024

Uploaded the job-longs.txt for posterity and because GitHub requires you to login to view the logs:

logs_30798294207.zip
job-logs.txt

@n0toose
Copy link
Member

n0toose commented Nov 21, 2024

After removing the following line on my system (with its fair share of environment variables), I managed to gain a performance boost of 10 seconds (before, it was around ~14) with 20k-27k less hypercalls (depending on the log level).

This probably warrants some further investigation:

uhyve/src/vm.rs

Line 323 in d4bc7da

.envs(env::vars())

@mkroening
Copy link
Member

In release mode? This would be most surprising. 🤔

I'll have a look at the kernel to see if anything jumps on me.

@mkroening
Copy link
Member

mkroening commented Nov 22, 2024

Statistics from running a release-profile rusty_demo on hermit-os/kernel@430da84:

With env in FDT:

VM exits:                       total     cpu0 
  Hypercall(Exit):                  1        1 
  Hypercall(FileWrite):           385      385 
  Hypercall(Uart):              21692    21692 
  PCIRead:                       1043     1043 
  PCIWrite:                      1067     1067 
CPU runtimes:
  cpu 0: 1.127073035s

Without env in FDT:

Run statistics:
VM exits:                       total     cpu0 
  Hypercall(Exit):                  1        1 
  Hypercall(FileWrite):           150      150 
  Hypercall(Uart):               3252     3252 
  PCIRead:                       1043     1043 
  PCIWrite:                      1067     1067 
CPU runtimes:
  cpu 0: 534.278001ms

(@jounathaen it would be nice if the statistics would be sorted automatically, also there is a trailing space)

These numbers do make sense, since the output with the environment is a 21,000 characters longer than without, and we currently do one hypercall per character if I am not mistaken.

This does not explain the slowdown in CI, though, since the CI spends most of the time with number crunching and not with UART output.

Edit: for comparison, QEMU with KVM for rusty_demo without environment variables takes this long:

________________________________________________________
Executed in    1.33 secs      fish           external
   usr time    0.92 millis  916.00 micros    0.00 millis
   sys time  117.51 millis  424.00 micros  117.09 millis

@jounathaen
Copy link
Member Author

We definitely have to implement the buffered serial hypercall in Hermit.

(@jounathaen it would be nice if the statistics would be sorted automatically, also there is a trailing space)

By which measure do you want to sort them? I think alphabetically makes most sense to have the best comparability.

@mkroening
Copy link
Member

By which measure do you want to sort them? I think alphabetically makes most sense to have the best comparability.

Yeah, alphabetically is what I was thinking of. I had to sort them manually, the order was non-deterministic.

@mkroening
Copy link
Member

Firecracker is much faster on the same GitHub-hosted runner (dev profile):

Uhyve:

Measured time for 100ms sleep: 184.914ms

Calculating Pi (Sequential): 469.843ms

Calculating Pi (Parallel):   2.292218s

Matrix multiplication
seq row-major:	64x64 matrix: 6.1972733 s
seq z-order:	64x64 matrix: 5.092721 s
par z-order:	64x64 matrix: 5.624933 s
par strassen:	64x64 matrix: 5.5266623 s
speedup: 0.91x

Laplace iterations
1000 iterations: 170.188916s (residual: 0.0000000000000000000044790090308916814)

Firecracker:

Measured time for 100ms sleep: 100.147ms

Calculating Pi (Sequential): 836µs

Calculating Pi (Parallel):   3.06ms

Matrix multiplication
seq row-major:	64x64 matrix: 0.011902 s
seq z-order:	64x64 matrix: 0.009583 s
par z-order:	64x64 matrix: 0.009726 s
par strassen:	64x64 matrix: 0.009326 s
speedup: 0.99x

Laplace iterations
1000 iterations: 117.193ms (residual: 0.0000000000000000000044790090308916814)

Uhyve SMP:

Measured time for 100ms sleep: 206.062ms

Calculating Pi (Sequential): 534.138ms

Calculating Pi (Parallel):   5.323512s

Matrix multiplication
seq row-major:	64x64 matrix: 6.868419 s
seq z-order:	64x64 matrix: 5.602098 s
par z-order:	64x64 matrix: 3.658734 s
par strassen:	64x64 matrix: 4.7791476 s
speedup: 1.53x

Laplace iterations
1000 iterations: 863.561143s (residual: 0.0000000000000000000044790090308916814)

Firecracker SMP:

Measured time for 100ms sleep: 100.176ms

Calculating Pi (Sequential): 836µs

Calculating Pi (Parallel):   6.463ms

Matrix multiplication
seq row-major:	64x64 matrix: 0.011897 s
seq z-order:	64x64 matrix: 0.009555 s
par z-order:	64x64 matrix: 0.007463 s
par strassen:	64x64 matrix: 0.007992 s
speedup: 1.28x

Laplace iterations
1000 iterations: 1.385644s (residual: 0.0000000000000000000044790090308916814)

Maybe our interrupt handling is way slower?

For the record, these are the KVM capabilities, but I have not looked into whether they might be relevant here.

@jounathaen
Copy link
Member Author

Ok, we found out that all systems with bad performance are AMD systems. Furthermore, it seems to be related to the FPU and/or interrupts, as the hello_world demo apparently does not suffer any slowdown. A profiling of a slow run indicates some issues with the APIC, but that is very vague.

@jounathaen jounathaen changed the title Uhyve is very slow in Githubs CI Uhyve is very slow ~~in Githubs CI~~ on AMD Machines Jan 10, 2025
@jounathaen jounathaen changed the title Uhyve is very slow ~~in Githubs CI~~ on AMD Machines Uhyve is very slow on AMD Machines Jan 10, 2025
@mkroening
Copy link
Member

This is the difference in reported CPUID features when running on the same CPU with both QEMU and Uhyve for reference:

--- cpuid-qemu  2025-01-09 22:36:28.624696853 +0100
+++ cpuid-uhyve 2025-01-09 22:35:57.689579100 +0100
@@ -14,8 +14,8 @@
             stepping_id: 1,
             brand_index: 0,
             cflush_cache_line_size: 8,
-            initial_local_apic_id: 0,
-            max_logical_processor_ids: 0,
+            initial_local_apic_id: 98,
+            max_logical_processor_ids: 64,
             edx_ecx: SSE3 | PCLMULQDQ | SSSE3 | FMA | CMPXCHG16B | PCID | SSE41 | SSE42 | X2APIC | MOVBE | POPCNT | TSC_DEADLINE | AESNI | XSAVE | AVX | F16C | RDRAND | HYPERVISOR | FPU | VME | DE | PSE | TSC | MSR | PAE | MCE | CX8 | APIC | SEP | MTRR | PGE | MCA | CMOV | PAT | PSE36 | CLFSH | MMX | FXSR | SSE | SSE2,
         },
     ),
@@ -28,13 +28,13 @@
                 level: 1,
                 is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 2,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
-                associativity: 2,
-                sets: 512,
-                is_write_back_invalidate: true,
+                associativity: 8,
+                sets: 64,
+                is_write_back_invalidate: false,
                 is_inclusive: false,
                 has_complex_indexing: false,
             },
@@ -43,29 +43,29 @@
                 level: 1,
                 is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 2,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
-                associativity: 2,
-                sets: 512,
-                is_write_back_invalidate: true,
+                associativity: 8,
+                sets: 64,
+                is_write_back_invalidate: false,
                 is_inclusive: false,
                 has_complex_indexing: false,
             },
             CacheParameter {
                 cache_type: Unified,
                 level: 2,
-                is_self_initializing: false,
+                is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 2,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
-                associativity: 16,
-                sets: 512,
+                associativity: 8,
+                sets: 2048,
                 is_write_back_invalidate: false,
-                is_inclusive: false,
+                is_inclusive: true,
                 has_complex_indexing: false,
             },
             CacheParameter {
@@ -73,15 +73,15 @@
                 level: 3,
                 is_self_initializing: true,
                 is_fully_associative: false,
-                max_cores_for_cache: 1,
+                max_cores_for_cache: 8,
                 max_cores_for_package: 1,
                 coherency_line_size: 64,
                 physical_line_partitions: 1,
                 associativity: 16,
-                sets: 16384,
-                is_write_back_invalidate: false,
-                is_inclusive: true,
-                has_complex_indexing: true,
+                sets: 32768,
+                is_write_back_invalidate: true,
+                is_inclusive: false,
+                has_complex_indexing: false,
             },
         ],
     ),
@@ -89,8 +89,8 @@
         MonitorMwaitInfo {
             smallest_monitor_line: 0,
             largest_monitor_line: 0,
-            extensions_supported: true,
-            interrupts_as_break_event: true,
+            extensions_supported: false,
+            interrupts_as_break_event: false,
             supported_c0_states: 0,
             supported_c1_states: 0,
             supported_c2_states: 0,
@@ -135,22 +135,7 @@
     direct_cache_access_info: None,
     performance_monitoring_info: None,
     extended_topology_info: Some(
-        [
-            ExtendedTopologyLevel {
-                processors: 1,
-                number: 0,
-                type: SMT,
-                x2apic_id: 0,
-                next_apic_id: 0,
-            },
-            ExtendedTopologyLevel {
-                processors: 1,
-                number: 1,
-                type: Core,
-                x2apic_id: 0,
-                next_apic_id: 0,
-            },
-        ],
+        [],
     ),
     extended_state_info: Some(
         ExtendedStateInfo {
@@ -231,31 +216,31 @@
     extended_processor_and_feature_identifiers: Some(
         ExtendedProcessorFeatureIdentifiers {
             extended_signature: 10555153,
-            pkg_type: 0,
+            pkg_type: 4,
             brand_id: 0,
-            ecx_features: LAHF_SAHF | CMP_LEGACY | SVM | ALTMOVCR8 | LZCNT | SSE4A | MISALIGNSSE | PREFETCHW | OSVW | PERFCTREXT,
+            ecx_features: LAHF_SAHF | CMP_LEGACY | SVM | ALTMOVCR8 | LZCNT | SSE4A | MISALIGNSSE | PREFETCHW | OSVW | TOPEXT | PERFCTREXT,
             edx_features: SYSCALL_SYSRET | EXECUTE_DISABLE | MMXEXT | FFXSR | GIB_PAGES | RDTSCP | I64BIT_MODE,
         },
     ),
     processor_brand_string: Some(
         ProcessorBrandString {
-            as_str: "AMD EPYC 9354P 32-Core Processor",
+            as_str: "uhyve - unikernel hypervisor",
         },
     ),
     l1_cache_and_tlb_info: Some(
         L1CacheTlbInfo {
-            eax: 33489407,
-            ebx: 33489407,
-            ecx: 1073873216,
-            edx: 1073873216,
+            eax: 4282974016,
+            ebx: 4282974016,
+            ecx: 537395520,
+            edx: 537395520,
         },
     ),
     l2_l3_cache_and_tlb_info: Some(
         L2And3CacheTlbInfo {
-            eax: 0,
-            ebx: 1107313152,
-            ecx: 33587520,
-            edx: 8421696,
+            eax: 1543512576,
+            ebx: 1811956224,
+            ecx: 67133760,
+            edx: 134254912,
         },
     ),
     advanced_power_mgmt_info: Some(
@@ -263,12 +248,12 @@
             _eax: 0,
             ebx: (empty),
             ecx: 0,
-            edx: (empty),
+            edx: INVTSC,
         },
     ),
     processor_capacity_feature_info: Some(
         ProcessorCapacityAndFeatureInfo {
-            physical_address_bits: 48,
+            physical_address_bits: 52,
             linear_address_bits: 57,
             guest_physical_address_bits: 0,
             has_cl_zero: true,
@@ -282,9 +267,9 @@
             has_unsupported_efer_lmsle: false,
             has_invlpgb_nested: false,
             perf_tsc_size: 40,
-            apic_id_size: 0,
-            maximum_logical_processors: 1,
-            num_phys_threads: 1,
+            apic_id_size: 7,
+            maximum_logical_processors: 128,
+            num_phys_threads: 64,
             invlpgb_max_pages: 0,
             max_rdpru_id: 0,
         },
@@ -292,22 +277,22 @@
     svm_info: Some(
         SvmFeatures {
             eax: 1,
-            ebx: 16,
+            ebx: 8,
             _ecx: 0,
             edx: NP | LBR_VIRT | NRIPS | TSC_RATE_MSR | VMCB_CLEAN | PAUSE_FILTER | PAUSE_FILTER_THRESHOLD | VMSAVE_VIRT | VGIF,
         },
     ),
     tlb_1gb_page_info: Some(
         Tlb1gbPageInfo {
-            eax: 0,
-            ebx: 0,
+            eax: 4031311936,
+            ebx: 4030726144,
             _ecx: 0,
             _edx: 0,
         },
     ),
     performance_optimization_info: Some(
         PerformanceOptimizationInfo {
-            eax: (empty),
+            eax: MOVU | FP256,
             _ebx: 0,
             _ecx: 0,
             _edx: 0,
@@ -324,10 +309,10 @@
     ),
     memory_encryption_info: Some(
         MemoryEncryptionInfo {
-            eax: (empty),
-            ebx: 0,
-            ecx: 0,
-            edx: 0,
+            eax: SEV | SEV_ES | HWENFCACHECOH,
+            ebx: 51,
+            ecx: 1006,
+            edx: 11,
         },
     ),
 }

@mkroening
Copy link
Member

We can retrieve which CPUID features KVM is emulating instead of providing natively with (KVM_GET_EMULATED_CPUID):

    let cpuid = KVM.get_emulated_cpuid(KVM_MAX_CPUID_ENTRIES).unwrap();
    dbg!(cpuid.as_slice());

On an affected system, this gives:

[src/linux/x86_64/kvm_cpu.rs:146:3] cpuid_entries = [
    kvm_cpuid_entry2 {
        function: 0,
        index: 0,
        flags: 0,
        eax: 7,
        ebx: 0,
        ecx: 0,
        edx: 0,
        padding: [
            0,
            0,
            0,
        ],
    },
    kvm_cpuid_entry2 {
        function: 1,
        index: 0,
        flags: 0,
        eax: 0,
        ebx: 0,
        ecx: 1 << 22,
        edx: 0,
        padding: [
            0,
            0,
            0,
        ],
    },
    kvm_cpuid_entry2 {
        function: 7,
        index: 0,
        flags: 1,
        eax: 0,
        ebx: 0,
        ecx: 1 << 22,
        edx: 0,
        padding: [
            0,
            0,
            0,
        ],
    },
]

On an unaffected system, the result is the same, unfortunately.

Still, we can decode these using Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-L:

  1. This entry says that 7 is the Maximum Input Value for Basic CPUID Information.
  2. This entry corresponds to the MOVBE instruction (bit 21 would be x2apic 👀).
  3. This entry corresponds to RDPID and IA32_TSC_AUX are available if 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants