Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the duration provider pattern #86

Conversation

yuki24
Copy link
Contributor

@yuki24 yuki24 commented Apr 24, 2024

First off, thanks so much for crafting such a fantastic gem! I’m a big fan of how flatware handles the forking and dRuby communication—it really makes a difference.

Background
We’ve started using flatware in one of our projects and hit a snag with the RSpec example status persistence file. It got massive—over 500MB, due to the existence of more than 80,000 examples—and it was slowing down our tests almost more than just running them!

We tried a workaround with parallel_tests' ParallelTests::RSpec::RuntimeLogger, which trimmed the data down to just 500KB (mostly because it only tracks the execution time for each file, rather than for each example.) We also had to monky-patch the JobBuilder a bit to make it work, but it felt a bit like a hack and might be a pain to maintain.

What This PR is about

So, I'm proposing a new "Duration Provider" pattern in this pull request. It would let us and other users customize how duration calculations are done without messing too much with JobBuilder. This should make things more flexible and cleaner overall.

Here’s a peek at the custom provider we whipped up:

# lib/flatware/rspec/duration_providers/parallel_tests_runtime_provider.rb
module Flatware
  module RSpec
    module DurationProviders
      class ParallelTestsRuntimeProvider
        attr_reader :test_runtime_file

        def initialize(test_runtime_file: ENV.fetch('TEST_RUNTIME'))
          @test_runtime_file = test_runtime_file
        end

        def seconds_per_file
          @seconds_per_file ||= # Load the parallel tests runtime file here...
      end
    end
  end
end

We then use this command to run the tests:

$ # The bin/flatware file is also our own executable that `require`s the provider above:
$ bin/flatware rspec -d parallel_tests_runtime ...

I think this could really streamline how we manage custom durations and keep things neat. Let me know what you all think.

/cc @zzak @kei-s

@yuki24 yuki24 force-pushed the feature/introduce-duration-provider-pattern branch 2 times, most recently from 9324ab6 to f88f64f Compare April 24, 2024 13:07
@yuki24 yuki24 force-pushed the feature/introduce-duration-provider-pattern branch from f88f64f to ec406db Compare April 24, 2024 13:17
@@ -27,6 +27,8 @@ Naming/FileName:
- lib/flatware-*.rb
Style/Documentation:
Enabled: false
Metrics/AbcSize:
Max: 20
Copy link
Contributor Author

@yuki24 yuki24 Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has added a bit of extra AbcSize so RUbocop started failing.

duration_provider = RSpec::DurationProviders.lookup(options['duration-provider'])
jobs = RSpec.extract_jobs_from_args rspec_args, workers: workers, duration_provider: duration_provider

I looked to see if there is any meaningful refactoring opportunity here, but even a few deliberate changes (e.g. just moving lines around, extracting and putting it in a different file, etc) didn't get the Rubocop check passing. I don't think I can come up with meaningfully refactored code out of this low AbsSize setting, so I just increased the size.

I know people have opinions about this, but this is becoming a bit of a barrier to new contributors...

@briandunn
Copy link
Owner

@yuki24 Thanks for your work on this! Curious, have you measured timing for that giant suite with and without time balanced jobs. I added the feature because other parallel runners had it and it seemed fun to write, but I haven't found a suite in the wild where time balancing has a consistent speedup effect.

If so, I'd also be interested in removing the use of example_status_persistence_file_path and adding a Flatware.config option to set a path for a parallel tests style time persister - one that only records per-file run time. I think if you do get a consistent speed up then it's probably related to your suite size, and at that size any flatware user that would benefit from time balancing would also benefit from more efficient persistence. Make sense?

Plus there are issues with example_status_persistence_file_path when using some strategies to speed up boot time. It's hacky to get that config field to load without loading the whole suite, The parent process needs that value to build jobs, but then if we fork the parent process (to avoid booting rails again n times) each worker is already configured to run the whole suite. So you have to fight that configuration to set the worker to only run it's job.

So time balancing just doesn't work if you have a custom boot. And I have consistently found, so far, that a custom boot saves more time than time balancing.

If this was a config option like Flatware.config.time_persistence_path or something we could sidestep all of that and have both custom boots and time balancing.

In short, since you're the first user I've spoken to who finds the time balancing feature useful, I'd rather see it changed to meet your needs then introduce an adapter to support a, in my experience, pretty useless version.

Is that something you'd be interesting in building?

@yuki24
Copy link
Contributor Author

yuki24 commented May 7, 2024

@briandunn Thanks for your thoughtful reply!

Time balancing has definitely been useful for not just one but a handful of relatively large projects I have worked on in the past. 80k is probably the largest, but even for 10-20k evenly distributing tests has always cut 1-7min of test execution time, depending on how long larger spec files took. I would think that ~1k specs may not be large enough for time balancing to be useful.

The problem is, RSpec can only distribute tests by file, not by example. So if a large spec file that takes e.g. 5min happened to run at the end of the suite in one of the workers, only a single worker (or a CPU) would be busy while the rest of the CPUs would be doing nothing for the duration of the last execution. Even with time balancing, we have consistently observed test imbalance, resulting in 1-2min of extra execution time.

In order to solve all this, we have also implemented a centralized queue based off of flatware as it already has a drb server and it was a snap to add. In the centralized queue, tests are grouped into the following groups and randomized within the group:

  1. Untimed files for which there is no time information available.
  2. The slow files that may take more than 10s. 10s is the 75th percentile for our test suite, so it could vary from suite to suite.
  3. The rest of fast spec files.

By executing from 1 to 3 we have been able to reduce even the last bit of imbalance, which was around 1-2min. Of course, this comes with a few caveats where specs are not guaranteed to be as independent as complete randomization, etc. However, we think this is a reasonable trade-off given that reducing 1min, with 90 distributed workers executing 80k examples, is really, really difficult.

In summary, we wanted to do more than just introducing a new pattern. I know this is a lot to digest, so I would like to learn more about what you think about this. Any feedback is appreciated.

@yuki24
Copy link
Contributor Author

yuki24 commented May 13, 2024

Closing this PR as I think this new pattern wouldn't be useful for both of us regardless. We will experiment more within our app and get back (or start a new conversation) once we feel good about our experiments. Thank you!

@yuki24 yuki24 closed this May 13, 2024
@briandunn
Copy link
Owner

briandunn commented May 13, 2024

I did not know time balancing offered such gains that's awesome.

The original design for flatware held a queue in a "Ventilator". When a worker finished one Job it would request another off the queue. The queue would serve jobs longest-first. The hope was that workers would then be self-balancing. This worked well, but I didn't cary it over when switching from ZMQ to DRb because it

  1. required quite a bit more fighting with RSpec, which at the time was much simpler to use if each worker had it's full workload at once.

  2. Added another child process to the topology, which inherently increased complexity.

I think the responsibility could be moved to the Sink, and that could relieve the second issue. And I haven't tried to convince RSpec to keep it's World in place and run different specs against it in a while, that may be simpler now.

I would be interested in reintroducing this strategy if it can be done reliably. @yuki24 Do you think such an approach would be helpful in your use case?

@yuki24
Copy link
Contributor Author

yuki24 commented Jun 17, 2024

@briandunn Apologies for the delay. I did not know that there was a name for what we implemented! We are still experimenting with this implementation, but overall we are more than happy to give back to the community, and we will get back to you as soon as we feel comfortable with it.

(1) is definitely an issue, as we had to hack around RSpec core. We have a working PoC in a private repo, but we aren't extremely happy with the hack for forcefully resetting World each time the runner loads a spec file.

I think the responsibility could be moved to the Sink, and that could relieve the second issue.

This is exactly what we did. It would bring in extra complexity for sure. However, personally, this is a much smaller concern to me as the added complexity is small enough and quite manageable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants