-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try dask on top of h5pyd #31
Comments
I tried a notebook example with
and also top of plot looks incorrect with repeated values in rows. I don't really know what I'm doing here, so likely doing something bad with cc: @mrocklin |
Very cool. Some things to try:
|
Increasing the chunk size fixed the connection pool problems. It looks like with the original chunk size, dask was sending 1000's of http request to the server, which overwhelmed the http connection pool. Still not having the correct data display though. Tried the lock=True, it made the code run slower, but still had the messed up data. I'll see if I can get a trace of the http requests. |
Sorry, still haven't had a chance to try out Dask yet. |
I would be surprised to see Dask send 1000s of concurrent connections. By default we only run as many tasks as there are logical cores on a machine. I recommend trying your service with multiple threads, perhaps using some standard library like concurrent.futures or multiprocessing.pool.ThreadPool and seeing how it works. You might also try setting dask to run in single-threaded mode: import dask
dask.set_options(get=dask.local.get_sync) Just to set expectations, all Dask is doing here is running computations like |
Ok - I'll try out your suggestions. My plan is to devote some time in 2018Q1 to stress testing HSDS, so this course of action will fit in nicely with that. |
As an FYI I'll be giving a talk about cloud-deployed Dask/XArray workloads at AMS on January 8th. If you make progress by then it would be interesting to discuss this as an option. https://ams.confex.com/ams/98Annual/webprogram/Paper337859.html Although to be clear we're not just talking about a single machine reading in this case. We're talking about several machines on the cloud reading the same dataset simultaneously. |
I'll see if I can cook something up. Would it be possible for you to send me a draft of your presentation? |
Once I have such a draft, sure. I'm unlikely to have anything solid before the actual presentation though. I'll be talking about Dask, XArray, and HPC/Cloud. Some topic of interest are in this github repository: https://github.com/pangeo-data/pangeo/issues |
Try
dask
onh5pyd
instead ofh5py
to see if there are issues.The text was updated successfully, but these errors were encountered: