Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collecting outputs from running in parallel is very slow #19

Open
benkrikler opened this issue May 16, 2019 · 11 comments
Open

Collecting outputs from running in parallel is very slow #19

benkrikler opened this issue May 16, 2019 · 11 comments
Assignees
Labels
originally gitlab For items that were originally created on gitlab and imported over Performance Improvement

Comments

@benkrikler
Copy link
Member

Imported from gitlab issue 19

It's been noticed by several people that when a large number of events and files are processed using a parallel mode (batch system, local multiprocessing, etc) the individual tasks run quickly, but the final collecting step can take a long time. It would be good to understand why this is and accelerate the steps as much as possible; certainly merging many pandas dataframes shouldn't take too long.

@benkrikler benkrikler self-assigned this May 16, 2019
@benkrikler benkrikler added originally gitlab For items that were originally created on gitlab and imported over Performance Improvement labels May 16, 2019
@benkrikler
Copy link
Member Author

On 2019-02-14 Benjamin Krikler (bkrikler) wrote:

@vmilosev you are one of the people that has mentioned this to me.

@benkrikler
Copy link
Member Author

On 2019-03-29 Benjamin Krikler (bkrikler) wrote:

mentioned in merge request gitlab:!43

@benkrikler
Copy link
Member Author

On 2019-04-08 Benjamin Krikler (bkrikler) wrote:

I've just noticed / remembered that the collector for cut flows uses very similar code to what the old binned dataframe stage code was doing, and since that was accelerated quite a bit by the rewrite, it would be worth adopting this approach for the cut-flow stage collector. It's not obvious that this is a bottle-neck at this point, but could still be helpful. We need to increase the amount of testing of the cut-flow collector, since at this point it is poorly covered.

@benkrikler
Copy link
Member Author

On 2019-04-17 Benjamin Krikler (bkrikler) wrote:

mentioned in merge request gitlab:!44

@benkrikler
Copy link
Member Author

According to @vukasinmilosevic, the above two merge requests actually do seem to have solved the slowness he was seeing. I'm going to close this for now, but we'll keep an eye on the situation and can re-open if it becomes an issue again.

@rob-tay
Copy link
Contributor

rob-tay commented Jun 11, 2019

I'm finding that the collecting of outputs can be pretty slow. For ~ 100 datasets, it can take 5+ minutes plus to merge all the dataframes

@benkrikler
Copy link
Member Author

Ah poop. That's still better than it was when this was opened, but that's obviously not very good, so I'll reopen this.

@rob-tay can you give any extra details? Which systems are you using, what sorts of steps are you doing with the data, what size dataframes (number of bins)/ cut-flows (total number of cuts) are you using, etc?

@benkrikler benkrikler reopened this Jun 11, 2019
@asnaylor
Copy link
Collaborator

asnaylor commented Jul 8, 2019

I have found a similar issue when running over a large dataset in --mode htcondor. I ran fast_carpenter data.yml config.yml --mode htcondor --outdir condor_test on ~4400 root files and it took 490mins but after 30 mins all of the condor jobs had finished so it took 430 mins to get the the output and merge the ~4440 jobs. I was binning 7 variables and i had two cuts.

@benkrikler
Copy link
Member Author

The most recent insight into this is that the poor performance is to do with the number of bins being used. @asnaylor, correct me if I'm wrong, but for your above comment, were you actually binning your values?

I plan to implement a parallelised merge step which takes advantage of multiple cores / batch cluster, if that's available. This should improve things considerably. However, it's awkward to achieve this using the AlphaTwirl backend, hence it's waiting for us to add in Parsl support, and / or Coffea executors.

@asnaylor
Copy link
Collaborator

yeah @benkrikler i'm pretty sure i didn't actually bin the values.

@kreczko kreczko self-assigned this Dec 10, 2020
@kreczko
Copy link
Contributor

kreczko commented Dec 10, 2020

BTW: with boost-histogram now being relatively mature, we try to change the intermediate format.
@asnaylor has tried this in his Hyper framework and it is both quicker and more memory efficient numbers for a 1 to 1 comparison would be useful.

That said, we probably should not fixate on a particular solution. Instead, we can go the route of uproot where you can choose between multiple implementations. This way we can try new things more frequently and make it easier to contribute new implementations.

All these implementations would need to provide is a way to read, write and merge them. So as part of this issue I would like to create two subissues:

  1. change the pandas implementation into a form were we can accept drop-in replacements
  2. Add boost-histogram as the first new replacement (I have others in mind, but it is very difficult to try them atm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
originally gitlab For items that were originally created on gitlab and imported over Performance Improvement
Projects
None yet
Development

No branches or pull requests

4 participants