Skip to content

Commit

Permalink
Make ArrayStore data methods more flexible (#402)
Browse files Browse the repository at this point in the history
## Description

<!-- Provide a brief description of the PR's purpose here. -->

This PR modifies ArrayStore as follows:

- Introduces a tuple return type to the retrieve and as_dict methods
- Renames as_dict to data()
- Removes `as_pandas()` in favor of `data(return_type="pandas")` and
`retrieve(return_type="pandas")`

## TODO

<!-- Notable points that this PR has either accomplished or will
accomplish. -->

## Questions

<!-- Any concerns or points of confusion? -->

## Status

- [x] I have read the guidelines in

[CONTRIBUTING.md](https://github.com/icaros-usc/pyribs/blob/master/CONTRIBUTING.md)
- [x] I have formatted my code using `yapf`
- [x] I have tested my code by running `pytest`
- [x] I have linted my code with `pylint`
- [x] I have added a one-line description of my change to the changelog
in
      `HISTORY.md`
- [x] This PR is ready to go
  • Loading branch information
btjanaka authored Nov 6, 2023
1 parent ae207b4 commit 80fbd3e
Show file tree
Hide file tree
Showing 3 changed files with 263 additions and 168 deletions.
4 changes: 2 additions & 2 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@
({pr}`397`)
- **Backwards-incompatible:** Rename `measure_*` columns to `measures_*` in
`as_pandas` ({pr}`396`)
- Add ArrayStore data structure ({pr}`395`, {pr}`398`, {pr}`400`)
- Add ArrayStore data structure ({pr}`395`, {pr}`398`, {pr}`400`, {pr}`402`)
- Add GradientOperatorEmitter to support OMG-MEGA and OG-MAP-Elites ({pr}`348`)

#### Improvements

- Use chunk computation in CVT brute force calculation to reduce memory usage
({pr}`394`)
- Test pyribs installation in tutorials ({pr}`384`)
- Add cron job for testing installation ({pr}`389`)
- Add cron job for testing installation ({pr}`389`, {pr}`401`)
- Fix broken cross-refs in docs ({pr}`393`)

## 0.6.3
Expand Down
211 changes: 114 additions & 97 deletions ribs/archives/_array_store.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""Provides ArrayStore."""
import itertools
from collections import OrderedDict
from enum import IntEnum

import numpy as np
Expand Down Expand Up @@ -173,14 +172,16 @@ def occupied_list(self):
return readonly(
self._props["occupied_list"][:self._props["n_occupied"]])

def retrieve(self, indices, fields=None):
"""Collects the data at the given indices.
def retrieve(self, indices, fields=None, return_type="dict"):
"""Collects data at the given indices.
Args:
indices (array-like): List of indices at which to collect data.
fields (array-like of str): List of fields to include. By default,
all fields will be included. In addition to fields in the store,
"index" is also a valid field.
all fields will be included, with an additional "index" as the
last field ("index" can also be placed anywhere in this list).
return_type (str): Type of data to return. See the ``data`` returned
below.
Returns:
tuple: 2-element tuple consisting of:
Expand All @@ -189,42 +190,136 @@ def retrieve(self, indices, fields=None):
in, have an associated data entry. For instance, if ``indices`` is
``[0, 1, 2]`` and only index 2 has data, then ``occupied`` will be
``[False, False, True]``.
- **data**: Dict mapping from the field name to the field data at
the given indices. For instance, if we have an ``objective`` field
and request data at indices ``[4, 1, 0]``, we might get ``data``
that looks like ``{"index": [4, 1, 0], "objective": [1.5, 6.0,
2.3]}``. Observe that we also return the indices as an ``index''
entry in the dict. The keys in this dict can be modified using the
``fields`` arg.
Note that if a given index is not marked as occupied, it can have
any data value associated with it. For instance, if index 1 was
not occupied, then the 6.0 returned above should be ignored.
not occupied, then the 6.0 returned in the ``dict`` example below
should be ignored.
- **data**: The data at the given indices. This can take the
following forms, depending on the ``return_type`` argument:
- ``return_type="dict"``: Dict mapping from the field name to the
field data at the given indices. For instance, if we have an
``objective`` field and request data at indices ``[4, 1, 0]``,
we would get ``data`` that looks like ``{"objective": [1.5, 6.0,
2.3], "index": [4, 1, 0]}``. Observe that we also return the
indices as an ``index`` entry in the dict. The keys in this dict
can be modified using the ``fields`` arg; duplicate keys will be
ignored since the dict stores unique keys.
- ``return_type="tuple"``: Tuple of arrays matching the order
given in ``fields``. For instance, if ``fields`` was
``["objective", "measures"]``, we would receive a tuple of
``(objective_arr, measures_arr)``. In this case, the results
from ``retrieve`` could be unpacked as::
occupied, (objective, measures) = store.retrieve(...)
Unlike with the ``dict`` return type, duplicate fields will show
up as duplicate entries in the tuple, e.g.,
``fields=["objective", "objective"]`` will result in two
objective arrays being returned.
By default, (i.e., when ``fields=None``), the fields in the
tuple will be ordered according to the ``field_desc`` argument
in the constructor, along with ``index`` as the last field.
- ``return_type="pandas"``: A :class:`pandas.DataFrame` with the
following columns (by default):
- For fields that are scalars, a single column with the field
name. For example, ``objective`` would have a single column
called ``objective``.
- For fields that are 1D arrays, multiple columns with the name
suffixed by its index. For instance, if we have a ``measures``
field of length 10, we create 10 columns with names
``measures_0``, ``measures_1``, ..., ``measures_9``. We do not
currently support fields with >1D data.
- 1 column of integers (``np.int32``) for the index, named
``index``.
In short, the dataframe might look like this:
+-----------+------------+------+-------+
| objective | measures_0 | ... | index |
+===========+============+======+=======+
| | | ... | |
+-----------+------------+------+-------+
Like the other return types, the columns can be adjusted with
the ``fields`` parameter.
All data returned by this method will be a readonly copy, i.e., the
data will not update as the store changes.
Raises:
ValueError: Invalid field name provided.
ValueError: Invalid return_type provided.
"""
indices = np.asarray(indices, dtype=np.int32)
occupied = readonly(self._props["occupied"][indices])

data = {}
fields = (itertools.chain(["index"], self._fields)
if return_type in ("dict", "pandas"):
data = {}
elif return_type == "tuple":
data = []
else:
raise ValueError(f"Invalid return_type {return_type}.")

fields = (itertools.chain(self._fields, ["index"])
if fields is None else fields)
for name in fields:
# Collect array data.
#
# Note that fancy indexing with indices already creates a copy, so
# only `indices` needs to be copied explicitly.
if name == "index":
data[name] = readonly(np.copy(indices))
continue
if name not in self._fields:
arr = readonly(np.copy(indices))
elif name in self._fields:
arr = readonly(self._fields[name][indices])
else:
raise ValueError(f"`{name}` is not a field in this ArrayStore.")
data[name] = readonly(self._fields[name][indices])

# Accumulate data into the return type.
if return_type == "dict":
data[name] = arr
elif return_type == "tuple":
data.append(arr)
elif return_type == "pandas":
if len(arr.shape) == 1: # Scalar entries.
data[name] = arr
elif len(arr.shape) == 2: # 1D array entries.
for i in range(arr.shape[1]):
data[f"{name}_{i}"] = arr[:, i]
else:
raise ValueError(
f"Field `{name}` has shape {arr.shape[1:]} -- "
"cannot convert fields with shape >1D to Pandas")

# Postprocess return data.
if return_type == "tuple":
data = tuple(data)
elif return_type == "pandas":
# Data above are already copied, so no need to copy again.
data = DataFrame(data, copy=False)

return occupied, data

def data(self, fields=None, return_type="dict"):
"""Retrieves data for all entries in the store.
Equivalent to calling :meth:`retrieve` with :attr:`occupied_list`.
Args:
fields (array-like of str): See :meth:`retrieve`.
Returns:
dict or tuple: See ``data`` in :meth:`retrieve`. ``occupied`` is not
returned since all indices are known to be occupied in this
method.
"""
return self.retrieve(self.occupied_list, fields, return_type)[1]

def add(self, indices, new_data, extra_args, transforms):
"""Adds new data to the store at the given indices.
Expand Down Expand Up @@ -431,81 +526,3 @@ def from_raw_dict(d):
store._fields = fields

return store

def as_dict(self, fields=None):
"""Creates a dict containing all data entries in the store.
Equivalent to calling :meth:`retrieve` with :attr:`occupied_list`.
Args:
fields (array-like of str): See :meth:`retrieve`.
Returns:
dict: See ``data`` in :meth:`retrieve`. ``occupied`` is not returned
since all indices are known to be occupied in this method.
"""
return self.retrieve(self.occupied_list, fields)[1]

def as_pandas(self, fields=None):
"""Creates a DataFrame containing all data entries in the store.
The returned DataFrame has:
- 1 column of integers (``np.int32``) for the index, named ``index``.
- For fields that are scalars, a single column with the field name. For
example, ``objective'' would have a single column called
``objective``.
- For fields that are 1D arrays, multiple columns with the name suffixed
by its index. For instance, if we have a ``measures'' field of length
10, we create 10 columns with names ``measures_0``, ``measures_1``,
..., ``measures_9``.
- We do not currently support fields with >1D data.
In short, the dataframe might look like this:
+-------+------------+------+-----------+
| index | measures_0 | ... | objective |
+=======+============+======+===========+
| | | ... | |
+-------+------------+------+-----------+
Args:
fields (array-like of str): List of fields to include. By default,
all fields will be included. In addition to fields in the store,
"index" is also a valid field.
Returns:
pandas.DataFrame: See above.
Raises:
ValueError: Invalid field name provided.
ValueError: There is a field with >1D data.
"""
data = OrderedDict()
indices = self._props["occupied_list"][:self._props["n_occupied"]]

fields = (itertools.chain(["index"], self._fields)
if fields is None else fields)

for name in fields:
if name == "index":
data[name] = np.copy(indices)
continue

if name not in self._fields:
raise ValueError(f"`{name}` is not a field in this ArrayStore.")

arr = self._fields[name]
if len(arr.shape) == 1: # Scalar entries.
data[name] = arr[indices]
elif len(arr.shape) == 2: # 1D array entries.
arr = arr[indices]
for i in range(arr.shape[1]):
data[f"{name}_{i}"] = arr[:, i]
else:
raise ValueError(
f"Field `{name}` has shape {arr.shape[1:]} -- "
"cannot convert fields with shape >1D to Pandas")

return DataFrame(
data,
copy=False, # Fancy indexing above copies all fields, and
# indices is explicitly copied.
)
Loading

0 comments on commit 80fbd3e

Please sign in to comment.