Make ArrayStore data methods more flexible (#402)

## Description  This PR modifies ArrayStore as follows: - Introduces a tuple return type to the retrieve and as_dict methods - Renames as_dict to data() - Removes `as_pandas()` in favor of `data(return_type="pandas")` and `retrieve(return_type="pandas")` ## TODO  ## Questions  ## Status - [x] I have read the guidelines in [CONTRIBUTING.md](https://github.com/icaros-usc/pyribs/blob/master/CONTRIBUTING.md) - [x] I have formatted my code using `yapf` - [x] I have tested my code by running `pytest` - [x] I have linted my code with `pylint` - [x] I have added a one-line description of my change to the changelog in `HISTORY.md` - [x] This PR is ready to go
icaros-usc · Nov 6, 2023 · 80fbd3e · 80fbd3e
1 parent ae207b4
commit 80fbd3e
Show file tree

Hide file tree

Showing 3 changed files with 263 additions and 168 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -10,15 +10,15 @@
   ({pr}`397`)
 - **Backwards-incompatible:** Rename `measure_*` columns to `measures_*` in
   `as_pandas` ({pr}`396`)
-- Add ArrayStore data structure ({pr}`395`, {pr}`398`, {pr}`400`)
+- Add ArrayStore data structure ({pr}`395`, {pr}`398`, {pr}`400`, {pr}`402`)
 - Add GradientOperatorEmitter to support OMG-MEGA and OG-MAP-Elites ({pr}`348`)
 
 #### Improvements
 
 - Use chunk computation in CVT brute force calculation to reduce memory usage
   ({pr}`394`)
 - Test pyribs installation in tutorials ({pr}`384`)
-- Add cron job for testing installation ({pr}`389`)
+- Add cron job for testing installation ({pr}`389`, {pr}`401`)
 - Fix broken cross-refs in docs ({pr}`393`)
 
 ## 0.6.3

diff --git a/ribs/archives/_array_store.py b/ribs/archives/_array_store.py
@@ -1,6 +1,5 @@
 """Provides ArrayStore."""
 import itertools
-from collections import OrderedDict
 from enum import IntEnum
 
 import numpy as np
@@ -173,14 +172,16 @@ def occupied_list(self):
         return readonly(
             self._props["occupied_list"][:self._props["n_occupied"]])
 
-    def retrieve(self, indices, fields=None):
-        """Collects the data at the given indices.
+    def retrieve(self, indices, fields=None, return_type="dict"):
+        """Collects data at the given indices.
 
         Args:
             indices (array-like): List of indices at which to collect data.
             fields (array-like of str): List of fields to include. By default,
-                all fields will be included. In addition to fields in the store,
-                "index" is also a valid field.
+                all fields will be included, with an additional "index" as the
+                last field ("index" can also be placed anywhere in this list).
+            return_type (str): Type of data to return. See the ``data`` returned
+                below.
 
         Returns:
             tuple: 2-element tuple consisting of:
@@ -189,42 +190,136 @@ def retrieve(self, indices, fields=None):
               in, have an associated data entry. For instance, if ``indices`` is
               ``[0, 1, 2]`` and only index 2 has data, then ``occupied`` will be
               ``[False, False, True]``.
-            - **data**: Dict mapping from the field name to the field data at
-              the given indices. For instance, if we have an ``objective`` field
-              and request data at indices ``[4, 1, 0]``, we might get ``data``
-              that looks like ``{"index": [4, 1, 0], "objective": [1.5, 6.0,
-              2.3]}``. Observe that we also return the indices as an ``index''
-              entry in the dict. The keys in this dict can be modified using the
-              ``fields`` arg.
 
               Note that if a given index is not marked as occupied, it can have
               any data value associated with it. For instance, if index 1 was
-              not occupied, then the 6.0 returned above should be ignored.
+              not occupied, then the 6.0 returned in the ``dict`` example below
+              should be ignored.
+
+            - **data**: The data at the given indices. This can take the
+              following forms, depending on the ``return_type`` argument:
+
+              - ``return_type="dict"``: Dict mapping from the field name to the
+                field data at the given indices. For instance, if we have an
+                ``objective`` field and request data at indices ``[4, 1, 0]``,
+                we would get ``data`` that looks like ``{"objective": [1.5, 6.0,
+                2.3], "index": [4, 1, 0]}``. Observe that we also return the
+                indices as an ``index`` entry in the dict. The keys in this dict
+                can be modified using the ``fields`` arg; duplicate keys will be
+                ignored since the dict stores unique keys.
+
+              - ``return_type="tuple"``: Tuple of arrays matching the order
+                given in ``fields``. For instance, if ``fields`` was
+                ``["objective", "measures"]``, we would receive a tuple of
+                ``(objective_arr, measures_arr)``. In this case, the results
+                from ``retrieve`` could be unpacked as::
+
+                    occupied, (objective, measures) = store.retrieve(...)
+
+                Unlike with the ``dict`` return type, duplicate fields will show
+                up as duplicate entries in the tuple, e.g.,
+                ``fields=["objective", "objective"]`` will result in two
+                objective arrays being returned.
+
+                By default, (i.e., when ``fields=None``), the fields in the
+                tuple will be ordered according to the ``field_desc`` argument
+                in the constructor, along with ``index`` as the last field.
+
+              - ``return_type="pandas"``: A :class:`pandas.DataFrame` with the
+                following columns (by default):
+
+                - For fields that are scalars, a single column with the field
+                  name. For example, ``objective`` would have a single column
+                  called ``objective``.
+                - For fields that are 1D arrays, multiple columns with the name
+                  suffixed by its index. For instance, if we have a ``measures``
+                  field of length 10, we create 10 columns with names
+                  ``measures_0``, ``measures_1``, ..., ``measures_9``. We do not
+                  currently support fields with >1D data.
+                - 1 column of integers (``np.int32``) for the index, named
+                  ``index``.
+
+                In short, the dataframe might look like this:
+
+                +-----------+------------+------+-------+
+                | objective | measures_0 | ...  | index |
+                +===========+============+======+=======+
+                |           |            | ...  |       |
+                +-----------+------------+------+-------+
+
+                Like the other return types, the columns can be adjusted with
+                the ``fields`` parameter.
 
             All data returned by this method will be a readonly copy, i.e., the
             data will not update as the store changes.
 
         Raises:
             ValueError: Invalid field name provided.
+            ValueError: Invalid return_type provided.
         """
         indices = np.asarray(indices, dtype=np.int32)
         occupied = readonly(self._props["occupied"][indices])
 
-        data = {}
-        fields = (itertools.chain(["index"], self._fields)
+        if return_type in ("dict", "pandas"):
+            data = {}
+        elif return_type == "tuple":
+            data = []
+        else:
+            raise ValueError(f"Invalid return_type {return_type}.")
+
+        fields = (itertools.chain(self._fields, ["index"])
                   if fields is None else fields)
         for name in fields:
+            # Collect array data.
+            #
             # Note that fancy indexing with indices already creates a copy, so
             # only `indices` needs to be copied explicitly.
             if name == "index":
-                data[name] = readonly(np.copy(indices))
-                continue
-            if name not in self._fields:
+                arr = readonly(np.copy(indices))
+            elif name in self._fields:
+                arr = readonly(self._fields[name][indices])
+            else:
                 raise ValueError(f"`{name}` is not a field in this ArrayStore.")
-            data[name] = readonly(self._fields[name][indices])
+
+            # Accumulate data into the return type.
+            if return_type == "dict":
+                data[name] = arr
+            elif return_type == "tuple":
+                data.append(arr)
+            elif return_type == "pandas":
+                if len(arr.shape) == 1:  # Scalar entries.
+                    data[name] = arr
+                elif len(arr.shape) == 2:  # 1D array entries.
+                    for i in range(arr.shape[1]):
+                        data[f"{name}_{i}"] = arr[:, i]
+                else:
+                    raise ValueError(
+                        f"Field `{name}` has shape {arr.shape[1:]} -- "
+                        "cannot convert fields with shape >1D to Pandas")
+
+        # Postprocess return data.
+        if return_type == "tuple":
+            data = tuple(data)
+        elif return_type == "pandas":
+            # Data above are already copied, so no need to copy again.
+            data = DataFrame(data, copy=False)
 
         return occupied, data
 
+    def data(self, fields=None, return_type="dict"):
+        """Retrieves data for all entries in the store.
+
+        Equivalent to calling :meth:`retrieve` with :attr:`occupied_list`.
+
+        Args:
+            fields (array-like of str): See :meth:`retrieve`.
+        Returns:
+            dict or tuple: See ``data`` in :meth:`retrieve`. ``occupied`` is not
+                returned since all indices are known to be occupied in this
+                method.
+        """
+        return self.retrieve(self.occupied_list, fields, return_type)[1]
+
     def add(self, indices, new_data, extra_args, transforms):
         """Adds new data to the store at the given indices.
 
@@ -431,81 +526,3 @@ def from_raw_dict(d):
         store._fields = fields
 
         return store
-
-    def as_dict(self, fields=None):
-        """Creates a dict containing all data entries in the store.
-
-        Equivalent to calling :meth:`retrieve` with :attr:`occupied_list`.
-
-        Args:
-            fields (array-like of str): See :meth:`retrieve`.
-        Returns:
-            dict: See ``data`` in :meth:`retrieve`. ``occupied`` is not returned
-                since all indices are known to be occupied in this method.
-        """
-        return self.retrieve(self.occupied_list, fields)[1]
-
-    def as_pandas(self, fields=None):
-        """Creates a DataFrame containing all data entries in the store.
-
-        The returned DataFrame has:
-
-        - 1 column of integers (``np.int32``) for the index, named ``index``.
-        - For fields that are scalars, a single column with the field name. For
-          example, ``objective'' would have a single column called
-          ``objective``.
-        - For fields that are 1D arrays, multiple columns with the name suffixed
-          by its index. For instance, if we have a ``measures'' field of length
-          10, we create 10 columns with names ``measures_0``, ``measures_1``,
-          ..., ``measures_9``.
-        - We do not currently support fields with >1D data.
-
-        In short, the dataframe might look like this:
-
-        +-------+------------+------+-----------+
-        | index | measures_0 | ...  | objective |
-        +=======+============+======+===========+
-        |       |            | ...  |           |
-        +-------+------------+------+-----------+
-
-        Args:
-            fields (array-like of str): List of fields to include. By default,
-                all fields will be included. In addition to fields in the store,
-                "index" is also a valid field.
-        Returns:
-            pandas.DataFrame: See above.
-        Raises:
-            ValueError: Invalid field name provided.
-            ValueError: There is a field with >1D data.
-        """
-        data = OrderedDict()
-        indices = self._props["occupied_list"][:self._props["n_occupied"]]
-
-        fields = (itertools.chain(["index"], self._fields)
-                  if fields is None else fields)
-
-        for name in fields:
-            if name == "index":
-                data[name] = np.copy(indices)
-                continue
-
-            if name not in self._fields:
-                raise ValueError(f"`{name}` is not a field in this ArrayStore.")
-
-            arr = self._fields[name]
-            if len(arr.shape) == 1:  # Scalar entries.
-                data[name] = arr[indices]
-            elif len(arr.shape) == 2:  # 1D array entries.
-                arr = arr[indices]
-                for i in range(arr.shape[1]):
-                    data[f"{name}_{i}"] = arr[:, i]
-            else:
-                raise ValueError(
-                    f"Field `{name}` has shape {arr.shape[1:]} -- "
-                    "cannot convert fields with shape >1D to Pandas")
-
-        return DataFrame(
-            data,
-            copy=False,  # Fancy indexing above copies all fields, and
-            # indices is explicitly copied.
-        )