Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NotImplementedError: xarray can't set arrays with multiple array indices to dask yet. #115

Closed
tomvothecoder opened this issue Oct 4, 2021 · 19 comments · Fixed by #132
Assignees
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Oct 4, 2021

What versions of software are you using?

  • Package Version: latest master

What are the steps to reproduce this issue?

import xcdat

# specify chunks to open with Dask
ds = xcdat.open_dataset("path/to/file.nc", var="tas", chunks={"time": 5})

# call spatial averaging and receive 
ts_n34 = ds.spatial.avg("tas", axis=["lat", "lon"], lat_bounds=(-5, 5), lon_bounds=(-170, -120))["tas"]

What happens? Any logs, error output, etc?

```python
NotImplementedError: xarray can't set arrays with multiple array indices to dask yet.
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
~/Documents/Repositories/tomvothecoder/xcdat/qa/PR98/refactor.py in <module>
      24 # ---------------------
      25 # NotImplementedError: xarray can't set arrays with multiple array indices to dask yet.
----> 26 avg = ds.spatial.avg("tas", axis, lat_bounds=lat_bounds, lon_bounds=lon_bounds)

~/Documents/Repositories/tomvothecoder/xcdat/xcdat/spatial_avg.py in avg(self, data_var_name, axis, weights, lat_bounds, lon_bounds)
    137             if lon_bounds is not None:
    138                 self._validate_region_bounds("lon", lon_bounds)
--> 139             weights = self._get_weights(axis, lat_bounds, lon_bounds)
    140 
    141         self._validate_weights(data_var, axis, weights)

~/Documents/Repositories/tomvothecoder/xcdat/xcdat/spatial_avg.py in _get_weights(self, axis, lat_bounds, lon_bounds)
    270                             domain_bounds, reg_bounds
    271                         )
--> 272                     dom_bounds = self._scale_domain_to_region(dom_bounds, reg_bounds)
    273 
    274                 if dim == "lat":

~/Documents/Repositories/tomvothecoder/xcdat/xcdat/spatial_avg.py in _scale_domain_to_region(self, domain_bounds, region_bounds)
    420             # Case 1: not wrapping around prime meridian.
    421             # Adjustments for above / right of region.
--> 422             d_bounds[d_bounds[:, 0] > r_bounds[1], 0] = r_bounds[1]
    423             d_bounds[d_bounds[:, 1] > r_bounds[1], 1] = r_bounds[1]
    424             # Adjustments for below / left of region.

/opt/miniconda3/envs/xcdat_dev/lib/python3.8/site-packages/xarray/core/dataarray.py in __setitem__(self, key, value)
    765                 for k, v in self._item_key_to_dict(key).items()
    766             }
--> 767             self.variable[key] = value
    768 
    769     def __delitem__(self, key: Any) -> None:

/opt/miniconda3/envs/xcdat_dev/lib/python3.8/site-packages/xarray/core/variable.py in __setitem__(self, key, value)
    852 
    853         indexable = as_indexable(self._data)
--> 854         indexable[index_tuple] = value
    855 
    856     @property

/opt/miniconda3/envs/xcdat_dev/lib/python3.8/site-packages/xarray/core/indexing.py in __setitem__(self, key, value)
   1244                 )
   1245                 if num_non_slices > 1:
-> 1246                     raise NotImplementedError(
   1247                         "xarray can't set arrays with multiple "
   1248                         "array indices to dask yet."

NotImplementedError: xarray can't set arrays with multiple array indices to dask yet.
### What were you expecting to happen?
<!-- A clear and concise description of what you expected to happen. -->
Parallelization of chunks should occur without issue.

### Any other comments?
<!-- Add any other context about the problem here. -->
Figure out a way to use xarray to perform the scaling.
@tomvothecoder tomvothecoder added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Oct 4, 2021
@tomvothecoder tomvothecoder added this to the v1.0.0 milestone Oct 4, 2021
@tomvothecoder
Copy link
Collaborator Author

Hey @pochedls, maybe you can help look into this issue if you have time?

@pochedls
Copy link
Collaborator

pochedls commented Oct 20, 2021

Hey @pochedls, maybe you can help look into this issue if you have time?

@tomvothecoder - I just tried this with the latest release and I didn't get an error. I am not sure if it is automatically parallelizing, but I don't seem to be getting an error. Do you get this error with the latest release?

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 20, 2021

@pochedls I've updated the example so that region selection is performed, which raises the error.
ts_n34 = ds.spatial.avg("tas", axis=["lat", "lon"], lat_bounds=(-5, 5), lon_bounds=(-170, -120))["tas"]

pochedls added a commit that referenced this issue Oct 21, 2021
@pochedls
Copy link
Collaborator

@tomvothecoder - here is a proposed fix. I wonder if there are going to be a lot of little places where we need to add a .load().

Note that I get a different error than the one specified in this issue. I got this error:

TypeError: this variable's data is stored in a dask array, which does not support item assignment. To assign to this variable, you must first load it into memory explicitly using the .load() method or accessing its .values attribute.

This actually can speed up the code pretty substantially.

# With parallelization / specifying chunks [chunks={"time": 20}]
$ time python notImplementedError.py                                                                                      
...
real    0m1.037s
user    0m0.824s
sys     0m0.119s

# Without parallelization
$ time python notImplementedError.py 

real    0m5.090s
user    0m4.040s
sys     0m0.891s

@tomvothecoder
Copy link
Collaborator Author

Note that I get a different error than the one specified in this issue. I got this error:
TypeError: this variable's data is stored in a dask array, which does not support item assignment. To assign to this variable, you must first load it into memory explicitly using the .load() method or accessing its .values attribute.

I get NotImplementedError: xarray can't set arrays with multiple array indices to dask yet. when using the xcdat_test env (https://github.com/XCDAT/xcdat_test/blob/master/conda-env/test.yml), which includes xcdat=0.1.0, xarray=0.19.0, and dask=2021.9.1.

What versions are you running in your env? I am trying to reproduce your error.

@tomvothecoder - here is a proposed fix. I wonder if there are going to be a lot of little places where we need to add a .load().

Nice to see the speed gains! Have you tried using the .values attribute instead of .load()? I wonder if the speed gains are comparable using .values since it isn't loading the data into memory. Using .values will hopefully reduce the memory footprint if it is working on "the array's data as numpy.ndarray" directly.
https://xarray.pydata.org/en/stable/generated/xarray.DataArray.values.html

Also, I don't think we need to have .load()/.values unless we're attempting to work on the Dask arrays directly (e.g., scaling domain down to region using numpy instead of xarray). The TypeError will definitely let us know where we need to place those lines of code.

@pochedls
Copy link
Collaborator

pochedls commented Oct 21, 2021

I just re-created my environment and can reproduce the error message that you got. I'm not sure if I was using an old environment derived from dev.xml or if I had tweaked the dev environment for some reason. This commit does remove the NotImplementedError (even using the appropriate environment). Perhaps there is another way, though.

I did use .values, but this only works for dataarray objects (and we sometimes pass in a ndarray). I just updated the commit to remove .load() here. The speed gains are comparable (1.113s real).

@tomvothecoder
Copy link
Collaborator Author

The dev.yml was updated before the release of v0.1.0, so you may have been on an older version of xarray and/or dask. I forgot to bring it up at the meeting in case folks had an older env set up, my apologies.

I'm glad to hear that your fix still holds up with the new dependencies and with .values. You can open up a new PR whenever you're ready and I'll review it. Thanks!

@tomvothecoder
Copy link
Collaborator Author

Also, I'll compare .load() against .values() to see if there are any major differences in memory usage.

@pochedls
Copy link
Collaborator

@tomvothecoder - I'm working on the PR and adding some tests. I wanted to add a test that checks that spatial averaging works for chunked data, but the fixtures dataset cannot be chunked (TypeError: 'Frozen' object is not callable). Could I find a dataset online (e.g., via THREDDS) or could we include a small netCDF file somewhere for testing?

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 25, 2021

@pochedls How did you produce that error? generate_dataset() returns a normal, chunkable Dataset object.

This should work:

def setup(self):
    self.ds = generate_dataset(cf_compliant=True, has_bounds=True)

def some_test(self):
   ds = self.ds.copy()

   # Perform chunking
   ds = ds.chunk(5)

You can also set up your own Dataset object if the fixture isn't sufficient enough for the test(s). I often do this for specific tests cases.

@pochedls
Copy link
Collaborator

@tomvothecoder - thanks - I had mistakenly thought .chunks() should work (here). I think this test is set now.

@pochedls
Copy link
Collaborator

In testing this, I'm not actually sure what is going on. If I revert to the beta release, I get an error (even with a single netCDF file and no chunking) when specifying lat/lon region bounds: ValueError: Cannot apply_along_axis when any iteration dimensions are 0.

If I use the proposed PR something is not working with the weights when I specify a region (I am getting non-zero weights everywhere).

I think this may be a package version issue, because in preliminary testing (pre-release) this was working, but I didn't have the updated dependencies.

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 26, 2021

I can try taking a look if you can provide me with the inputs and commands to produce the errors.

If I revert to the beta release, I get an error (even with a single netCDF file and no chunking) when specifying lat/lon region bounds: ValueError: Cannot apply_along_axis when any iteration dimensions are 0.

It looks like a numpy error related to mismatching shapes. I'm not sure which function in the stack trace you're getting this error.

A few links that might help:

If I use the proposed PR something is not working with the weights when I specify a region (I am getting non-zero weights everywhere).

Not sure what is going on here either, but I can try debugging too.

I think this may be a package version issue, because in preliminary testing (pre-release) this was working, but I didn't have the updated dependencies.

The unit tests should catch API breaking errors, but they still pass with the latest dev env.

It is difficult to pinpoint package versioning as the root of this regression given the number of possible factors (changing inputs for testing, edge cases we've missed, a bug in the code that wasn't revealed until now, etc.).

However, if it were package versioning related, it could be a newer numpy version in the latest dev env that has a new check in apply_along_axis() for an undesirable condition. I noticed numpy and pandas are not pinned in the dev env and are installed implicitly through xarray. This might introduce problems, but I doubt there are any API breaking errors since are no recent major version releases of numpy. If anything, it is a good thing that it now raises this error if an undesirable condition is met.

Latest dev env

dependencies:
    # Base
    # ==================
    - python=3.9.7
    - pip=21.2.4
    - typing_extensions=3.10.0.0 # Required to make use of Python >=3.8 backported types
    - netcdf4=1.5.7
    - xarray=0.19.0
    - cf_xarray=0.6.1
    - dask=2021.9.1
prefix: /opt/miniconda3/envs/xcdat_dev

Previous dev env

dependencies:
    # Base
    # ==================
    - python=3.8.8
    - pip=21.0.1
    - typing_extensions=3.7.4 # Required to make use of Python >=3.8 backported types
    - netcdf4=1.5.6
    - xarray=0.17.0
    - cf_xarray=0.6.0
    - dask=2021.7.0
prefix: /opt/miniconda3/envs/xcdat_dev

My actions

@tomvothecoder
Copy link
Collaborator Author

I noticed numpy and pandas are not pinned in the dev env and are installed implicitly through xarray. This might introduce problems, but I doubt there are any API breaking errors since are no recent major version releases of numpy. If anything, it is a good thing that it now raises this error if an undesirable condition is met.

Also, we don't want to pin dependency versions in the conda recipe to avoid package conflicts. So either way, we would need to address this issue (rather than limit numpy to an older version).

@pochedls
Copy link
Collaborator

pochedls commented Oct 26, 2021

The original error

NotImplementedError: xarray can't set arrays with multiple array indices to dask yet.

occurs when loading arrays with dask without using .values. It arises because dask did not support item assignment. But this was fixed in dask 2021.04.1. Since we are using dask 2021.9.1, I think that xarray does not yet support this.

But we need a new approach (relative to my last commit) because using .values is not actually updating the array values:

# set one of the boundaries to all ones
ds.lat_bnds.values[:, 0] = np.ones(ds.lat_bnds.shape[0])
# print results (nothing changes)
print(ds.lat_bnds[:, 0])

array([-90. , -89.52879 , -88.58639 , -87.64398 , -86.70157 , ...

It seems like the only option at the moment is to call .load() (along the lines of this initial attempt on this issue).

@tomvothecoder - do you see a different way of resolving this?

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 27, 2021

@pochedls

Originally, the old dev env had xarray v0.17.0, which did not support item assignment at all. Dask was not pinned. https://github.com/pydata/xarray/blob/v0.17.0/xarray/core/indexing.py#L1378-L1385

    def __setitem__(self, key, value):
        raise TypeError(
            "this variable's data is stored in a dask array, "
            "which does not support item assignment. To "
            "assign to this variable, you must first load it "
            "into memory explicitly using the .load() "
            "method or accessing its .values attribute."
        )

The dev env now includes xarray v0.19.0 and Dask 2021.9.1.

xarray v0.19.0 includes pydata/xarray#5174, which adds supports for just single index item assignment if dask_version >= "2021.04.1" (code below).
https://github.com/pydata/xarray/blob/v0.19.0/xarray/core/indexing.py#L1235-L1258

    def __setitem__(self, key, value):
        if dask_version >= "2021.04.1":
            if isinstance(key, BasicIndexer):
                self.array[key.tuple] = value
            elif isinstance(key, VectorizedIndexer):
                self.array.vindex[key.tuple] = value
            elif isinstance(key, OuterIndexer):
                num_non_slices = sum(
                    0 if isinstance(k, slice) else 1 for k in key.tuple
                )
                if num_non_slices > 1:
                    raise NotImplementedError(
                        "xarray can't set arrays with multiple "
                        "array indices to dask yet."
                    )
                self.array[key.tuple] = value
        else:
            raise TypeError(
                "This variable's data is stored in a dask array, "
                "and the installed dask version does not support item "
                "assignment. To assign to this variable, you must either upgrade dask or"
                "first load the variable into memory explicitly using the .load() "
                "method or accessing its .values attribute."
            )

The actual root cause of the issue is the error, NotImplementedError xarray can't set arrays with multiple array indices to dask yet from this chunk of code:

            elif isinstance(key, OuterIndexer):
                num_non_slices = sum(
                    0 if isinstance(k, slice) else 1 for k in key.tuple
                )
                if num_non_slices > 1:
                    raise NotImplementedError(
                        "xarray can't set arrays with multiple "
                        "array indices to dask yet."
                    )

I think if we use don't use .load()/.values, it is considered an OuterIndex, and the num_non_slices is > 1.
I'd have to add breakpoints to check though.


These are the things we've tried, which produced different results.

  1. Don't use .load()/.values
    • Results in NotImplementedError xarray can't set arrays with multiple array indices to dask yet
  2. Use .values
    • Works directly on the array, but does not actually update the array (maybe works on a copy of the array?)
    • Produces incorrect results
  3. Use .load()
    • Loads array into memory, updates array
    • May reduce performance
    • Produces correct results

So far, I only see 3. Use .load() working. I will do some more investigating though.

@tomvothecoder
Copy link
Collaborator Author

Hey @lee1043, can you push the latest changes to your demo_spatial_avg notebook (https://github.com/XCDAT/xcdat_test/blob/spatial_avg_tutorial/tutorials/demo_spatial_avg.ipynb) where you produce the error ValueError: Cannot apply_along_axis when any iteration dimensions are 0.?

Steve also received this error, and I'm thinking it is from an updated version of numpy. We'll need to address this issue since we don't pin numpy in v0.1.0. Refer to #115 (comment) for more background info.

@lee1043
Copy link
Collaborator

lee1043 commented Oct 27, 2021

@tomvothecoder sure, just pushed. I see my numpy version is 1.21.2.

@tomvothecoder
Copy link
Collaborator Author

Background info on .load() and .values:

.load()

  • "The easiest way to convert an xarray data structure from lazy Dask arrays into eager, in-memory NumPy arrays is to use the load() method."
  • Allows us to manipulate the arrays directly
  • If we don't use it, an error is raised: NotImplementedError xarray can't set arrays with multiple array indices to dask yet
  • Alternative solution is to make a copy of the data, perform the swap, then create a new DataArray using the swapped data. Probably more work and not sure if it is any faster though.

.values

  • .values is the array’s data as a numpy.ndarray.
  • Fixes NotImplementedError xarray can't set arrays with multiple array indices to dask yet (but I think it is read only so that values don't actually update)
  • Have to do .load() first, so might as well access the values directly (lon_swap[inds] instead of lon_swap.values[inds])

Debugging

1. ValueError: Cannot apply_along_axis when any iteration dimensions are 0. (numpy)

This is a standalone bug that is unrelated to chunking or dependencies.

Related Code:

   _swap_lon_axes()
   elif to == 360:
       inds = np.where(lon_swap < 0)
       if type(lon_swap) == xr.core.dataarray.DataArray:
           lon_swap.values[inds] = lon_swap.values[inds] + 360
       else:
           lon_swap[inds] = lon_swap[inds] + 360

Cause:
inds can return no indices. When we attempt to access lon_swap[inds] with an empty array of indices, it throws
the numpy error ValueError: Cannot apply_along_axis when any iteration dimensions are 0.

Solution:

   elif to == 360:
       inds = np.where(lon_swap < 0)
       count = len(inds[0])

       if count > 0:
           if type(lon_swap) == xr.DataArray:
               lon_swap.values[inds] = lon_swap.values[inds] - 360
           else:
               lon_swap[inds] = lon_swap[inds] + 360
   else:
       raise ValueError("Only longitude axis orientation 180 or 360 is supported.")

2. NotImplementedError xarray can't set arrays with multiple array indices to dask yet (xarray + Dask)

This is related to having to perform either a load() for DataArrays that are chunked, or copying the data + performing the swap and then constructing a new DataArray from it as the return value.

Solution:

        lon_swap = lon.copy()

        # Must convert convert an xarray data structure from lazy Dask arrays
        # into eager, in-memory NumPy arrays before performing manipulations on
        # the data. Otherwise, it raises `NotImplementedError xarray can't set
        # arrays with multiple array indices to dask yet`
        if type(lon_swap) == xr.DataArray:
            lon_swap.load()

        if to == 180:
            indices = np.where(lon_swap > 180)
            count = len(indices[0])
            if count > 0:
                lon_swap[indices] = lon_swap[indices] - 360
        elif to == 360:
            indices = np.where(lon_swap < 0)
            count = len(indices[0])
            if count > 0:
                lon_swap[indices] = lon_swap[indices] + 360
        else:
            raise ValueError("Only longitude axis orientation 180 or 360 is supported.")

        return lon_swap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants