Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interleave benchmark iterations from different processes #143

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Jun 4, 2021

This commit makes it so that, rather than spawning a process for a Wasm
benchmark and engine pair and running all iterations for that process
immediately, we now spawn a bunch of processes and then run one iteration from a
random one of them at a time. This interleaves benchmark iterations, not just
processes.

This means that we need to have communication back and forth between the parent
process and the child processes. Before, we had fork/join style subprocesses
where we just spawned a child and waited for it to finish before starting the
next subprocess and there was no communication between the parent and child
before the child was done executing. In order to interleave benchmark
iterations, we need to spawn many processes, have them wait on the parent, and
have the parent tell one subprocess at a time to run one iteration and then wait
again. The way that this is done is with a very simple stdin and
stdout-based protocol. All children are waiting to read a newline on their
stdin before they run an iteration. The parent chooses a child, writes a
newline and then waits for the child to finish by reading a newline from the
child's stdout. When the child finishes its iteration, it writes a newline to
its stdout. When all iterations are complete, then the child writes its
results as JSON to stdout.

Fixes #139

Even when we are only using one process for all samples. This will make
everything easier when we parent<-->child communication protocol changes to
support interleaving iterations from different processes as part of bytecodealliance#139.
This commit makes it so that, rather than spawning a process for a Wasm
benchmark and engine pair and running all iterations for that process
immediately, we now spawn a bunch of processes and then run one iteration from a
random one of them at a time. This interleaves benchmark iterations, not just
processes.

This means that we need to have communication back and forth between the parent
process and the child processes. Before, we had fork/join style subprocesses
where we just spawned a child and waited for it to finish before starting the
next subprocess and there was no communication between the parent and child
before the child was done executing. In order to interleave benchmark
iterations, we need to spawn many processes, have them wait on the parent, and
have the parent tell one subprocess at a time to run one iteration and then wait
again. The way that this is done is with a very simple `stdin` and
`stdout`-based protocol. All children are waiting to read a newline on their
`stdin` before they run an iteration. The parent chooses a child, writes a
newline and then waits for the child to finish by reading a newline from the
child's `stdout`. When the child finishes its iteration, it writes a newline to
its `stdout`. When all iterations are complete, then the child writes its
results as JSON to `stdout`.

Fixes bytecodealliance#139
@fitzgen fitzgen requested review from cfallin and abrown June 4, 2021 20:41
Not sure how things were working before...
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice, thanks!

Just to make sure I understand the synchronization fully: child processes' iterations are fully mutually exclusive, because the parent only unblocks one at a time, and waits for the baton back before unblocking the next, yes? But: the first iteration(s) of the main loop might run while other child processes are still starting up, if startup is slow for some reason; in other words, we have the dependency graph:

Parent
  |            (spawn N children)
  |\_________________________________________
  |        |                |                |   ...
  |      __|____            |
  |     |startup|        ___|___
  | (go)|       |       |startup|
  |\__  |_______|       |       |
  |   |    |            |       |
  |   | ___|_____       |       |
  |   >| iter 0  |      |_______|
  |    |_________|          |
  |  ___/  |                |
  | /(done)|                |
  |<       |                |
  |        |                |
  | (go)   |                |
  |\_______|_________       |
  |        |         \  ____|____
  |        |          >| iter 0  |
  |        |           |_________|
  | _______|____________/   |
  |/ (done)|                |
  |<       |                |
  :        :                :
  :        :                :

and the startup work (process initialization, etc.) of the second child is concurrent with (has no synchronizing edge with respect to) the 0th iteration of the first child.

I think the fix is pretty simple -- just write a byte from each child to say "I'm started up and ready", and wait for such a byte from each child before commencing the main loop in the parent. Thoughts?

@fitzgen
Copy link
Member Author

fitzgen commented Jun 4, 2021

Good eye @cfallin! I'll fix that up in one sec.

@fitzgen fitzgen requested a review from cfallin June 4, 2021 22:23
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@fitzgen
Copy link
Member Author

fitzgen commented Jun 4, 2021

Looks like the tests are hanging on windows. I wonder if \n isn't flushing stdout? I'll try explicitly flushing.

Copy link
Collaborator

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at this a few times over the last few days and there are couple of concerns that I wasn't able to articulate, hence my lack of review. But here goes:

  • this PR adds quite a bit of complexity but I'm not yet convinced that the benefits are worth it. I'm not saying that interleaving iterations isn't a good idea or that the benefits aren't there, but rather that I would like to see how much more accurate the results are with this enabled compared to without it being enabled. Is there some way to see that?
  • Also, to mitigate the complexity, I wonder if this functionality shouldn't be encapsulated somehow (queues + a process pool, e.g.) so that it is easier to reason about separately from the "run a benchmark" logic. It might also make it easier to run things with and without it turned on to see if it is effective (see first point).

@fitzgen, what do you think? I don't want to block merging this, so go ahead if you believe in the change. But could you consider what I'm trying to get at above?

@fitzgen fitzgen force-pushed the interleave-benchmark-iterations branch from f2be469 to 726979e Compare June 14, 2021 20:17
@fitzgen
Copy link
Member Author

fitzgen commented Jun 14, 2021

I would like to see how much more accurate the results are with this enabled compared to without it being enabled. Is there some way to see that?

I would expect that we would see less correlation between iteration # and the measurement's count (e.g count of cycles or count of nanoseconds) as well as process # and the measurement's count. We should, in theory, be able to measure this with a chi-squared test, but I haven't been able to get that working in R myself.

Also, to mitigate the complexity, I wonder if this functionality shouldn't be encapsulated somehow (queues + a process pool, e.g.) so that it is easier to reason about separately from the "run a benchmark" logic. It might also make it easier to run things with and without it turned on to see if it is effective (see first point).

I've pulled the wait/notify logic of writing/reading to/from stdio out into helper methods, but I'm not sure how to simplify further without introducing a bunch more complexity/infrastructure than what we have here. Right now it is a minimal protocol over pipes, and doing something like a process pool is a whole lot more moving parts than what we have now. If you have concrete suggestions, I'm definitely open to them, because I agree it would be nice to factor this out a bit more!

@fitzgen
Copy link
Member Author

fitzgen commented Jun 21, 2021

@abrown did you have any specific suggestions for how to change this PR before we merge it?

@abrown
Copy link
Collaborator

abrown commented Jun 21, 2021

What about something like the following:

trait BenchmarkExecutionStrategy {
  fn execute(&self) -> Result<()>;
}
struct SameProcess(...); // Maybe wrap up `BenchmarkCommand` here? Or an intermediate struct?
struct MultiProcess(...);
struct MultiProcessInterleaved(...);

Then in BenchmarkCommand::execute pick the appropriate strategy and call execute on that. Perhaps the BenchmarkCommand flags get passed down into the implementations as you have done for Child but maybe it's cleaner to create an intermediate struct that represents the "engine-wasm-iterations-etc." tuple needed to actually run the benchmark.

Here's another option:

struct BenchmarkExecution { engine, wasm, iterations, ... }
struct BenchmarkResults { /* there is probably already a type for this */ }
trait BenchmarkExecutionQueue {
  async fn submit(&mut self, work: BenchmarkExecution) -> Result<BenchmarkResults>;
}
struct MultiProcess(...);
struct MultiProcessInterleaved(..

@fitzgen
Copy link
Member Author

fitzgen commented Jun 22, 2021

trait BenchmarkExecutionStrategy

...

trait BenchmarkExecutionQueue

This PR simplifies the current set up, where we either are doing single process benchmarking or multiprocess, so that we are always doing multiprocess even if it is just one process. So there would only ever be a single implementation of this trait, which seems like the trait wouldn't be carrying its weight, and would actually be making things more complicated / have more pieces.

Regarding factoring out the parent into something analogous to the Child struct: with child processes, we have reified state we can encapsulate: the child process object and its stdin and stdout objects. With the parent, that state isn't something we can encapsulate and control access to because it is ambient in the whole process: any bit of code can add a println!. This could be an argument for using named pipes instead of stdin/stdout but then we get into portability issues and would need some kind of solution for Windows, and we are once again talking about a lot of new infrastructure and moving parts.

@fitzgen
Copy link
Member Author

fitzgen commented Feb 13, 2023

Some folks noticed the lack of this behavior in their benchmarks (i.e. that the same benchmark would run 10 times in a row) and brought it up to me as a potential issue and source of measurement bias.

Would be great to land this almost two years later! 😬

@abrown did you find my last comment convincing for why a trait would be overkill here? If so, I can rebase this and we can finally merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interleave benchmark iterations, not just processes
3 participants