-
-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid workers waiting for training of surrogate model to finish #1170
Comments
Hi @bbudescu, However, the new RF model might still rely on the existing packages (e.g., sklearn). Since we are only a small team maintaining SMAC, we might not have enough manpower to customize a new random forest package. However, if you have any good ideas on how to implement this efficiently (in Python) or you would like to create a new PR regarding the RF replacement stuff. We are happy to help you (on how to integrate this into SMAC). |
Hi @dengdifan, Thanks for your reply above. Yes, I am aware of the #1116, however, the main point I was trying to get across in this issue was making the Now, I haven't looked into the code, but I assume this doesn't have anything to do with the choice of random forest package, but just with running RF training in a separate thread. Perhaps one of the optional suggestions I made, namely, the one where I suggested adding GPU support, might be relevant to the choice of RF implementation. |
Hi @bbudescu Seond, SMAC is implemented based on BO, an algorithm designed for optimizing expensive black-box functions, such as training a deep neural network. Usually, the expense of evaluating a function is much more significant than training the surrogate model. It is still unclear if directly starting a new configuration is better than waiting until the current surrogate model provides a new candidate. Additionally, making |
Hi @dengdifan
I've been working on something, and I just added #1178 to share it here. It's not yet tested, but it's a start, I guess.
That's quite right :). I had to do a bit of reading before finally deciding at least on what to try
Now, while looking into this I've taken a few ideas into consideration, namely, Other than the fact that I used
Of course, because of the GIL, the main optimization loop gets less cpu time, but although I haven't tested, I expect that the main thread would spend most time in the training section anyway. And I don't think the other parts in the loop are very CPU intensive anyway. Or are they? Do you know if, perhaps the inference or something takes really long?
Well, other than expanding SMAC's use cases to things like database knob tuning and other tasks that also have a low evaluation time for the target cost function, it's also an impeding the scalability of SMAC. E.g., let's say you want to train a neural net that takes, on average, 2 hours on a gpu to fit to a dataset and you need to find the optimal parameters fast. In this case you might opt to do something like renting 256 machines on aws so that you increase the chance you get a decent neural net within a tight budget like 24 hours. In this case you will face the same problem, because you'll get new results every 30 seconds anyway, not because it takes little to train neural nets, but because of the number of training processes.
If you're paying for an aws instance with a gpu you'd hate it to see it stay idle. You'd even prefer it even running a random config rather than just doing nothing.
Well, I'm not sure I understand this correctly, but if I do, then I don't think that is the case other than in the beginning. So, here's an example at the opposite end: let's say that you already finished 1,000,000 trials, and trained the RF on that data. Let's now say that a worker process finishes running the 1,000,001st trial, reports results and asks for a new configuration to try. What happens in the current setting is that the worker needs to wait for the model to retrain taking into consideration that last result, so it can benefit from the extra information added by that extra sample in the RF training set. Now, my hypothesis is that there is only a minor benefit of doing this, as I don't think the expected improvement in the cost function caused by making the suggestion based on 1,000,001 vs 1,000,000 is going to be that much different anyway. And also, when setting So, to be clear, I'm not suggesting using random configurations, but, rather, to adapt the LE: Here is a comparison between the performance of some IPC methods when sharing numpy data. |
Hi @bbudescu Now I am confusing by your two examples.
In your first example (the neural network training one), you suggest sampling a new configuration randomly:
This is exactly the case for training a RF in a separated thread. Assuming that you have two cores, one core already finishes its configuration evaluation. Since your RF is still under training (you can not expect that both RF training and configuration evaluation ends at exact the same time clock for both CPUs), you cannot expect how long it would take to finish the RF training in another core(thread). Then in that case you simply start a new configuration randomly (as suggested by your AWS example) even the RF training will end just 1 CPU clock after. Therefore, in this case, you always have the core 1 running BO+evaluation and core 2 running random evaluation.
Here you suggest to adapt the retrain_after parameters dynamically instead of sampling new configurations randomly. However, how should this value be scheduled? To which degree should we increase this value? Should we always apply this strategy or we only apply this if the number of budgets is higher than a threshold? Since we are aiming at arbitrary budgets and number of evaluations, all these need to be carefully designed to avoid performance degeneration |
Sorry, I tried to make a comparison. Conceptually, training in a different thread would be somewhat similar to when
Yes, the key word here is "even". I think even a random config is better than nothing, but what I propose is to use a trained random forest to suggest new configs, just we shouldn't care so much that it was trained on the absolute latest results, just as in the case when
Ok, so look, let's say we have 2 cores. One core always trains (because training takes longer than evaluation), and one core always evaluates the cost function (i.e., it's a worker that runs trials):
NOTE: Indeed, at step 5 above, it's also possible for thread A to start running another random trial, but this is NOT the subject of this feature request, but rather a potential further optimization similar to what you said above. NOTE 2: actually, what I think would be more relevant and more in line with what you asked (and I replied to), is that at step 4 above thread B can just start a random config without waiting for the first model to finish training. However, again, this would be just an optimization, and NOT the point of this issue and associated PR |
Hi @bbudescu, As you said, we could simply set |
Well, the problem with setting
A tradeoff is hard to find, and even the best tradeoff will be suboptimal. And, after all, I've just pushed some code to implement this feature. I know it's a potential risk, but, as far as I have come to understand, you guys are ok with at least offering some support for integration (I'm not talking about you implementing it yourself). LE: It's not done yet, i.e., there are no tests, I haven't added in a way to switch to the old behavior, and I still have a few questions about, e.g., what to do with the |
This should also be done dynamically, because, as a user, it's kinda hard to anticipate how long training will take on the current number of finished trials, how long running a trial will take, as it also depends on the hyperparameters in the suggested config, which evolve over time etc. I'm thinking there would be value in adding estimators for all of these, but I feel that just running a second thread on the main cpu when there's some new data available is easier. LE: The alternative for the user would be to do trial and error until you get the schedule right, but that beats the whole purpose of making the session faster / more efficient |
Hi @dengdifan, I think I was able to bring the code to a state in which it can be tested. As per my #1178 (comment), could this, perhaps, be solved by a compromise like merging the pull request and exposing the option of using concurrent background training as an experimental feature, at the user's (documented) risk? |
Hi @bbudescu |
Hi @dengdifan, Thanks for your response. I'm really sorry, but I still don't understand what you mean, unfortunately. Would you care to, maybe elaborate a bit further, or perhaps try to explain in a different fashion than before? Perhaps, take a look at the code I propose in PR #1178, and show me where this might happen. I mean, the behavior using my code is definitely not what I was after, so by all means, if you can provide some insight, I'm really interested. Perhaps also take into account the latest measurements and conclusions I posted in this #1178 (comment). It seemed to me like @eddiebergman's comments in the PR sounded as if the stuff I proposed seemed to him like an idea he'd be on board with. @eddiebergman, could you, maybe, please help us get on the same page? LE: I'm only guessing here, so I could be dead wrong, so please forgive me if I am. Do you, perhaps, think that I'm using Python's |
Hi @bbudescu let's still use the example of the two workers scenario. Assuming that we have two workers That at time 0:00, |
Hi @dengdifan, and Happy New Year! Thanks for taking the time to answer.
The thought has crossed my mind (actually, this was my initial idea as well, see this, and this comments I made back in November when considering implementing ASHA first #1169, before even opening the current issue), but I think it's a bit harder to implement a dynamic scheduler that looks at current developments of the acquisition function's surface in a rigorous manner. In order to make a decision on whether or not to start training or just keep sampling from the current model, one would have to somehow figure out which decision is more advantageous, and to do that, I'm thinking that one could compute something like the Expected Improvement of doing another training as an estimated benefit in a cost/benefit analysis. To establish whether it's worth it, perhaps a framework like Value of Information might prove useful. To make an estimate of the cpu load, one would have to know:
For these, we need additional models:
I guess it'd make sense to consider training a model (another random forest, or just add another output to the current RF, in addition to the one that does cost estimation) that is able to predict how long we'd have to wait for the current trial to finish, given the sampled params. Optionally, one could take into account previous trial durations as input features (in an autoregressive manner) that could be used to improve the prediction when the previous samples don't cover the sampling space very well.
Also, it'd make sense to fit another model, perhaps not something as complicated as a random forest, but maybe to just do a regression to find some optimal coefficients for known functions, since we know that the computational complexity of training an RF should look something like
Perhaps a simple regression will provide an accurate enough estimate of how long trials of configs sampled in the future will take. My conclusionAll in all, I think that this looks more like a good subject for research, rather than just optimizing the resource usage a bit, and there has to be some easier way to achieve this. I think there is more value in estimating the trial time given a certain configuration, because solvers, in general, when making the decisions on what to try next, should factor in trial duration and, perhaps even pecuniary cost to make better use of the given budget, and to increase the chances of finding a lower cost within the constraints. LE: Perhaps other, simpler heuristics can be used to maintain the CPU load above a certain threshold. E.g., if the minimal load target is, say, 90%, one could start a new training when, since the last training finished, nine times the last training's duration had passed. Also, as you can see, I sometimes tend overcomplicate things. If I've slid beyond reason, I apologize. |
Now, back to the problem at hand. From #1170 (comment)
Why does it have to be random? Since "we already have 10000 finished configurations", presumably some model must have already been trained. Why not have the acquisition function optimizer use that older model to suggests the next configuration to run a trial on? |
Another option would be to just use a fixed Ideally, the Having the user specify the schedule doesn't really work either, because he wouldn't possibly know it beforehand because trial times are dependent on params sampled by the acq fn optimizer, and he wouldn't know the training times either. So he could only figure the schedule out through trial and error, thus defeating the purpose of finding the best combo within a given budget of time and other types of costs (e.g., pecuniary). |
Also, in #1170 (comment), I had listed another potential improvement, that now I'm thinking isn't all that great:
This, I think, will have the undesired effect of having a maximum CPU load of exactly 90% as soon as the models start taking long enough to train. |
Also, take a look at this: #1178 (comment). It seems that currently the bottleneck is somewhere else altogether, because the cpu load is bad even when no model is being used, both with and without hyperband, so perhaps some thorough profiling is of higher priority right now. |
Sorry, my words might be a bit misleading, the three items that follows "either" are equivalent, one can determine any of the solutions mentioned above, one can also use the older candidate queues, but that comes back to the solution of "scheduler for retrain_after".
If even the random sampling has a lower CPU utilization, then I am afraid that the issue lies somewhere else, as in this case no surrogate model training and acquisition function optimization process is involved. |
Using retrain_after with or without a scheduler will result in suboptimal performance because all workers are still going to have to wait for training to finish, at least sometimes. Figuring out CPU load isn't straightforward and, consequently, neither is anticipating the impact of training on CPU occupancy. Using a dedicated process naturally provides the latest model that can be obtained without affecting performance. I understand the risks of multiprocessing, but, in essence, the idea behind the implementation is quite straightforward: it's just training RFs in an endless loop. |
Motivation
Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading,
n_workers=64
), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).
Previous Discussion
I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (#1169) instead would help eliminate that hypothesis, however, after @eddiebergman's #1169 (comment), I understand the problem is caused by workers waiting to get another suggestion from the surrogate model
Potential solution
The text was updated successfully, but these errors were encountered: