Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of components build with cargo-component #2980

Open
radu-matei opened this issue Jan 14, 2025 · 3 comments
Open

Performance of components build with cargo-component #2980

radu-matei opened this issue Jan 14, 2025 · 3 comments

Comments

@radu-matei
Copy link
Member

In the application linked in the repo above, I have two HTTP components:

  • /rust -- this is a Rust HTTP handler that directly uses a Rust dependency to perform a face detection algorithm on the request body
  • /... -- this is also a Rust HTTP handler, but imports a Wasm component through component dependencies and executes the same logic

I am seeing a significant performance difference when sending the same request to those two handlers:

# calling the HTTP handler that runs the process directly
$ time curl http://localhost:3000/rust --data-binary @grace_hopper.jpg
0.01s user 0.01s system 0% cpu 2.204 total

# calling the HTTP handler that uses component dependencies
$ time curl http://localhost:3000 --data-binary @grace_hopper.jpg
0.01s user 0.01s system 0% cpu 5.537 total

The performance difference here is pretty big; any thoughts into what might be going on here?

Thanks!

@lann

This comment was marked as resolved.

@radu-matei
Copy link
Member Author

cc @alexcrichton who's been doing some digging here.

(TL; DR: the main issue is coming from the optimization level for the component itself, which was the default cargo-component template: https://github.com/radu-matei/spin-yolo/blob/92910318f050a818a4106df9e5acf40adbafb4f8/lib/Cargo.toml#L16)

@lann lann changed the title Performance of request going through component dependency Performance of components build with cargo-component Jan 15, 2025
@alexcrichton
Copy link
Contributor

Ah yes the findings I've got so far are:

  • The original major performance difference is due to the component dependencies version using opt-level = "s" and the "direct-rust" version using opt-level = "3". As @radu-matei mentioned that's due to the default of cargo component new using opt-level = "s".
  • To some extent this is still pretty far off from native performance, so some more performance gaps I've found are:
    • Spin enables epoch interruption by default (e.g. for timeouts and time-slicing) and that adds significant overhead on AArch64 macOS M2 (which I think @radu-matei is on). I measured less relative overhead on an x64 machine I have
    • Wasmtime has spectre mitigation for tables on-by-default which hurts aarch64 performance way more than x64.

If epochs/spectre-mitigations are disabled then the performance is relatively close to native, or about what you might expect from wasm's overhead. There's still some subpar instruction selection in Cranelift that may be possible to improve on the x64 side of things. Additionally the library in use here, tract-linalg, has wasm simd optimizations but doesn't use the relaxed-simd proposal namely the f32x4.relaxed_madd instruction. I briefly tried that locally though and didn't get much speedup, so I might be wrong in which kernel is being used in that file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants