Improve parallelization of IO for readers #99

mgrover1 · 2023-03-17T15:30:42Z

xradar version: 0.1.0

Description

We should take a look at how we can speed up the xarray backends, and if there are more levels of parallelization possible.

I wonder if upstream enhancements of xarray
pydata/xarray#7437

Might help with this, enabling us to plug in the io directly/benefit from more parallelization here.

What I Did

I read the data the following code:

import xarray as xr
import xradar as xd
import numpy as np

def fix_angle(ds):
    """
    Aligns the radar volumes
    """
    ds["time"] = ds.time.load()  # Convert time from dask to numpy

    start_ang = 0  # Set consistent start/end values
    stop_ang = 360

    # Find the median angle resolution
    angle_res = 0.5

    # Determine whether the radar is spinning clockwise or counterclockwise
    median_diff = ds.azimuth.diff("time").median()
    ascending = median_diff > 0
    direction = 1 if ascending else -1

    # first find exact duplicates and remove
    ds = xd.util.remove_duplicate_rays(ds)

    # second reindex according to retrieved parameters
    ds = xd.util.reindex_angle(
        ds, start_ang, stop_ang, angle_res, direction, method="nearest"
    )

    ds = ds.expand_dims("volume_time")  # Expand for volumes for concatenation

    ds["volume_time"] = [np.nanmin(ds.time.values)]

    return ds

ds = xr.open_mfdataset(
    files,
    preprocess=fix_angle,
    engine="cfradial1",
    group="sweep_0",
    concat_dim="volume_time",
    combine="nested",
    chunks={'range':250},
)

Which resulted in this task graph, where the green is the open_dataset function.

Which has quite a bit of whitespace/could use some optimization.

The text was updated successfully, but these errors were encountered:

kmuehlbauer · 2023-03-17T15:59:13Z

I've tried to reproduce this with some of our ODIM_H5 data with similar outcome.

kmuehlbauer · 2023-03-19T11:53:04Z

@mgrover1 I've tried to track that down now using some GAMIC source data from our BoXPol radar.

In the normal case I get the above shown white spaces in the task graph.

If I remove the additional lines from the gamic open_dataset-function after store_entrypoint.open_dataset:

xradar/xradar/io/backends/gamic.py

Line 468 in 02b2d92

ds = store_entrypoint.open_dataset(

the call to open_mfdataset returns without triggering any dask-operation.

Only if I .load(), .compute() or otherwise trigger a computation(eg.plotting), the files are accessed and the data is loaded and processed.

That leads to the task graph's as shown below:

One Timestep Single Moment of 15 (time: 12, azimuth: 360, range: 750):

All Timesteps Single Moment of 15 (time: 12, azimuth: 360, range: 750):

Compute the whole thing:

So as a consequence we might need to make sure no immediate dask-computations are triggered before actually doing something with the data. Would it make sense to create a test repo for that?

mgrover1 · 2023-03-19T20:36:13Z

Yeah, let's create a test repo to try this out - this is promising! We can take a look at more testing/establishing some benchmarks to dig in here.

mgrover1 · 2023-03-21T13:28:34Z

Maybe xradar-benchmark?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parallelization of IO for readers #99

Improve parallelization of IO for readers #99

mgrover1 commented Mar 17, 2023

kmuehlbauer commented Mar 17, 2023

kmuehlbauer commented Mar 19, 2023

mgrover1 commented Mar 19, 2023

mgrover1 commented Mar 21, 2023

Improve parallelization of IO for readers #99

Improve parallelization of IO for readers #99

Comments

mgrover1 commented Mar 17, 2023

Description

What I Did

kmuehlbauer commented Mar 17, 2023

kmuehlbauer commented Mar 19, 2023

mgrover1 commented Mar 19, 2023

mgrover1 commented Mar 21, 2023