Collecting outputs from running in parallel is very slow #19

benkrikler · 2019-05-16T17:34:16Z

It's been noticed by several people that when a large number of events and files are processed using a parallel mode (batch system, local multiprocessing, etc) the individual tasks run quickly, but the final collecting step can take a long time. It would be good to understand why this is and accelerate the steps as much as possible; certainly merging many pandas dataframes shouldn't take too long.

benkrikler · 2019-05-16T17:35:37Z

On 2019-02-14 Benjamin Krikler (bkrikler) wrote:

@vmilosev you are one of the people that has mentioned this to me.

benkrikler · 2019-05-16T17:35:38Z

On 2019-03-29 Benjamin Krikler (bkrikler) wrote:

mentioned in merge request gitlab:!43

benkrikler · 2019-05-16T17:35:39Z

On 2019-04-08 Benjamin Krikler (bkrikler) wrote:

I've just noticed / remembered that the collector for cut flows uses very similar code to what the old binned dataframe stage code was doing, and since that was accelerated quite a bit by the rewrite, it would be worth adopting this approach for the cut-flow stage collector. It's not obvious that this is a bottle-neck at this point, but could still be helpful. We need to increase the amount of testing of the cut-flow collector, since at this point it is poorly covered.

benkrikler · 2019-05-16T17:35:40Z

On 2019-04-17 Benjamin Krikler (bkrikler) wrote:

mentioned in merge request gitlab:!44

benkrikler · 2019-05-28T15:09:27Z

According to @vukasinmilosevic, the above two merge requests actually do seem to have solved the slowness he was seeing. I'm going to close this for now, but we'll keep an eye on the situation and can re-open if it becomes an issue again.

rob-tay · 2019-06-11T15:46:53Z

I'm finding that the collecting of outputs can be pretty slow. For ~ 100 datasets, it can take 5+ minutes plus to merge all the dataframes

benkrikler · 2019-06-11T16:31:42Z

Ah poop. That's still better than it was when this was opened, but that's obviously not very good, so I'll reopen this.

@rob-tay can you give any extra details? Which systems are you using, what sorts of steps are you doing with the data, what size dataframes (number of bins)/ cut-flows (total number of cuts) are you using, etc?

asnaylor · 2019-07-08T14:51:00Z

I have found a similar issue when running over a large dataset in --mode htcondor. I ran fast_carpenter data.yml config.yml --mode htcondor --outdir condor_test on ~4400 root files and it took 490mins but after 30 mins all of the condor jobs had finished so it took 430 mins to get the the output and merge the ~4440 jobs. I was binning 7 variables and i had two cuts.

benkrikler · 2019-10-28T19:19:32Z

The most recent insight into this is that the poor performance is to do with the number of bins being used. @asnaylor, correct me if I'm wrong, but for your above comment, were you actually binning your values?

I plan to implement a parallelised merge step which takes advantage of multiple cores / batch cluster, if that's available. This should improve things considerably. However, it's awkward to achieve this using the AlphaTwirl backend, hence it's waiting for us to add in Parsl support, and / or Coffea executors.

asnaylor · 2019-10-29T20:09:35Z

yeah @benkrikler i'm pretty sure i didn't actually bin the values.

kreczko · 2020-12-10T10:20:07Z

BTW: with boost-histogram now being relatively mature, we try to change the intermediate format.
@asnaylor has tried this in his Hyper framework and it is both quicker and more memory efficient^{numbers for a 1 to 1 comparison would be useful}.

That said, we probably should not fixate on a particular solution. Instead, we can go the route of uproot where you can choose between multiple implementations. This way we can try new things more frequently and make it easier to contribute new implementations.

All these implementations would need to provide is a way to read, write and merge them. So as part of this issue I would like to create two subissues:

change the pandas implementation into a form were we can accept drop-in replacements
Add boost-histogram as the first new replacement (I have others in mind, but it is very difficult to try them atm)

benkrikler self-assigned this May 16, 2019

benkrikler added originally gitlab For items that were originally created on gitlab and imported over Performance Improvement labels May 16, 2019

benkrikler closed this as completed May 28, 2019

benkrikler reopened this Jun 11, 2019

kreczko self-assigned this Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting outputs from running in parallel is very slow #19

Collecting outputs from running in parallel is very slow #19

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 28, 2019

rob-tay commented Jun 11, 2019

benkrikler commented Jun 11, 2019

asnaylor commented Jul 8, 2019

benkrikler commented Oct 28, 2019

asnaylor commented Oct 29, 2019

kreczko commented Dec 10, 2020

Collecting outputs from running in parallel is very slow #19

Collecting outputs from running in parallel is very slow #19

Comments

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 16, 2019

benkrikler commented May 28, 2019

rob-tay commented Jun 11, 2019

benkrikler commented Jun 11, 2019

asnaylor commented Jul 8, 2019

benkrikler commented Oct 28, 2019

asnaylor commented Oct 29, 2019

kreczko commented Dec 10, 2020