Feature 325 aggregation support (#346)

* Update README.md (#321) update required Python version to 3.10+ * Added aggregation features * Test * removed folders * Added aggregation features * Updates settings and improved folder search algorithm; Added README * Corrected FBIAS stat fields * Create Aggregation.rst creating file from:https://github.com/dtcenter/METcalcpy/blob/feature_325_aggregation_support/metcalcpy/pre_processing/aggregation/README.md Copied Vertical Interpolation as a template * adding aggregation * Rename Aggregation.rst to aggregation.rst * first pass at cleaning up warnings * changing to 3rd person * issue #325 CTS data from RRFS to test aggregation * Issue #325 added background on agg_stat.py * issue #325 added instructions for bash and csh, added links to external references * issue #325 fix syntax for subsection * Issue #325 fix grammar, add instructions for importing and invoked by another script * issue #325 more fixes to grammar for import instructions * issue #325 added corrected instructions for running via command-line (included the path to the agg_stat.py module) * Issue #325 modify config file to specify valid paths for input and output files. * Issue #325 modified for User's Guide instructions * issue #325 added reformatted data for ECNT and compatible for METcalcpy agg_stat input * Delete test/data/rrfs_cts_reformatted.data not used for testing. Using the ECNT data instead. * issue #325 pytest on ECNT data reformatted with METdataio METreformat and aggregation statistics calculated * issue #325 added latest test for ECNT aggregation * Issue #325 address pandas future warning that causes current pytests to fail. Remove pandas chaining such as: df['column_name'][index] = var_name with: df.loc[index, 'column_name'] = var_name * Issue #325 address pandas future warning that causes current pytests to fail. Remove pandas chaining such as: df['column_name'][index] = var_name with: df.loc[index, 'column_name'] = var_name * Issue #325 updated input data to ECNT data, corrected the explanation of expected input format for agg_stat. * Issue #325 modify config file to use RRFS ECNT .stat data reformatted by METdataio * issue #325 point to actual config file via literalinclude * issue #325 replace reference to the CTS output file with ECNT * replace pandas append with concat * Update unit_tests.yml added test_reformatted_for_agg.py * fixed syntax error with list * issue #325 update test data with correctly reformatted ECNT line data * issue #325 removed some unneccessary text --------- Co-authored-by: VanderleiVargas-NOAA <[email protected]> Co-authored-by: lisagoodrich <[email protected]>
dtcenter · Feb 2, 2024 · e76a606 · e76a606
1 parent 34dcfd8
commit e76a606
Show file tree

Hide file tree

Showing 24 changed files with 9,692 additions and 13 deletions.
diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml
@@ -65,7 +65,8 @@ jobs:
         pytest test_validate_mv_python.py
         pytest test_future_warnings.py
         pytest test_sl1l2.py
-        coverage run -m pytest test_agg_eclv.py test_agg_stats_and_boot.py test_agg_stats_with_groups.py test_calc_difficulty_index.py test_convert_lon_indices.py test_event_equalize.py test_event_equalize_against_values.py test_lon_360_to_180.py test_statistics.py test_tost_paired.py test_utils.py test_future_warnings.py
+        pytest test_reformatted_for_agg.py
+        coverage run -m pytest test_agg_eclv.py test_agg_stats_and_boot.py test_agg_stats_with_groups.py test_calc_difficulty_index.py test_convert_lon_indices.py test_event_equalize.py test_event_equalize_against_values.py test_lon_360_to_180.py test_statistics.py test_tost_paired.py test_utils.py test_future_warnings.py test_reformatted_for_agg.py
         coverage html
         
     - name: Archive code coverage results

diff --git a/README.md b/README.md
@@ -29,6 +29,6 @@ Instructions for installing the metcalcpy package locally
 Instructions for installing the metcalcpy package from PyPI
 -----------------------------------------------------------
 
-- activate your Python 3.8.6+ conda environment
+- activate your Python 3.10+ conda environment
 - run the following from the command line:
    -  pip install metcalcpy==x.y.z  where x.y.z is the version number of interest
diff --git a/docs/Users_Guide/aggregation.rst b/docs/Users_Guide/aggregation.rst
@@ -0,0 +1,198 @@
+***********
+Aggregation
+***********
+
+Aggregation is an option that can be applied to MET stat output (in
+the appropriate format) to calculate aggregation statistics and confidence intervals.
+Input data must first be reformatted using the METdataio METreformat module to
+label all the columns with the corresponding statistic name specified in the
+`MET User's Guide <https://met.readthedocs.io/en/develop/Users_Guide/index.html>`_
+for `point-stat <https://met.readthedocs.io/en/develop/Users_Guide/point-stat.html>`_,
+`grid-stat <https://met.readthedocs.io/en/develop/Users_Guide/grid-stat.html>`_, or
+`ensemble-stat <https://met.readthedocs.io/en/develop/Users_Guide/ensemble-stat.html>`_ .stat output data.
+
+Python Requirements
+===================
+
+The third-party Python packages and the corresponding version numbers are found
+in the requirements.txt and nco_requirements.txt files:
+
+**For Non-NCO systems**:
+
+* `requirements.txt <https://github.com/dtcenter/METcalcpy/blob/develop/requirements.txt>`_
+
+**For NCO systems**:
+
+* `nco_requirements.txt <https://github.com/dtcenter/METcalcpy/blob/develop/nco_requirements.txt>`_
+
+
+Retrieve Code
+=============
+
+Refer to the `Installation Guide <https://metcalcpy.readthedocs.io/en/develop/Users_Guide/installation.html>`_
+for instructions.
+
+
+Retrieve Sample Data
+====================
+
+The sample data used for this example is located in the $METCALCPY_BASE/test directory,
+where **$METCALCPY_BASE** is the full path to the location of the METcalcpy source code
+(e.g. /User/my_dir/METcalcpy).
+The example data file used for this example is **rrfs_ecnt_for_agg.data**.
+This data was reformatted from the MET .stat output using the METdataio METreformat module.
+The reformatting step labels the columns with the corresponding statistics, based on the MET tool (point-stat,
+grid-stat, or ensemble-stat).  The ECNT linetype of
+the MET grid-stat output has been reformatted to include the statistics names for all
+`ECNT <https://met.readthedocs.io/en/develop/Users_Guide/ensemble-stat.html#id2>`_ specific columns.
+
+
+Input data **must** be in this format prior to using the aggregation
+module, agg_stat.py.
+
+The example data can be copied to a working directory, or left in this directory.  The location
+of the data will be specified in the YAML configuration file.
+
+Please refer to the METdataio User's Guide for instructions for reformatting MET .stat files :
+https://metdataio.readthedocs.io/en/develop/Users_Guide/reformat_stat_data.html
+
+
+Aggregation
+===========
+
+The agg_stat module, **agg_stat.py** to is used to calculate aggregated statistics and confidence intervals.
+This module can be run as a script at the command-line, or imported in another Python script.
+
+A required YAML configuration file,  **config_agg_stat.yaml** file is used to define the location of
+input data and the name and location of the output file.
+
+The agg_stat module support the ECNT linetype that are output from the MET
+**ensemble-stat** tool
+
+The input to the agg_stat module must have the appropriate format.  The ECNT linetype must first be
+`reformatted via the METdataio METreformat module <https://metdataio.readthedocs.io/en/develop/Users_Guide/reformat_stat_data.html>`_
+by following the instructions under the **Reformatting for computing aggregation statistics with METcalcpy agg_stat**
+header.
+
+Modify the YAML configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The config_agg_stat.yaml is required to perform aggregation statistics calculations. This
+configuration file is located in the $METCALCPY_BASE/metcalcpy/pre_processing/aggregation/config
+directory. The $METCALCPY_BASE is the directory where the METcalcpy source code is
+saved (e.g. /Users/my_acct/METcalcpy). Change directory to $METCALCPY_BASE/metcalcpy/pre_processing/aggregation/config
+and modify the config_agg_stat.yaml file.
+
+1.  Specify the input and output files
+
+.. code-block:: yaml
+
+  agg_stat_input: /path-to/test/data/rrfs_ecnt_for_agg.data
+  agg_stat_output: /path-to/ecnt_aggregated.data
+
+Replace the *path-to* in the above two settings to the location where the input data
+was stored (either in a working directory or the $METCALCPY_BASE/test directory). **NOTE**:
+Use the **full path** to the input and output directories (no environment variables).
+
+2.  Specify the meteorological and the stat variables:
+
+.. code-block:: yaml
+
+  fcst_var_val_1:
+    TMP:
+      - ECNT_RMSE
+      - ECNT_SPREAD_PLUS_OERR
+
+3.  Specify the selected models/members:
+
+.. code-block:: yaml
+
+  series_val_1:
+    model:
+     - RRFS_GEFS_GF.SPP.SPPT
+
+4.  Specify the selected statistics to be aggregated, in this case, the RMSE and SPREAD_PLUS_OERR
+    statistics from the ECNT ensemble-stat tool output are to be calculated.  The aggregated statistics
+    are named ECNT_RMSE and ECNT_SPREAD_PLUS_OERR (append original statistic name with the linetype):
+
+    list_stat_1:
+     - ECNT_RMSE
+     - ECNT_SPREAD_PLUS_OERR
+
+The full **config_agg_stat.yaml** file is shown below:
+
+
+.. literalinclude:: ../../metcalcpy/pre_processing/aggregation/config/config_agg_stat.yaml
+
+
+
+**NOTE**: Use full directory paths when specifying the location of the input file and output
+file.
+
+
+Set the Environment and PYTHONPATH
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+bash shell:
+
+.. code-block:: ini
+
+ export METCALCPY_BASE=/path-to-METcalcpy
+
+csh shell:
+
+.. code-block:: ini
+
+ setenv METCALCPY_BASE /path-to-METcalcpy
+
+
+where *path-to-METcalcpy* is the full path to where the METcalcpy source code is located
+(e.g. /User/my_dir/METcalcpy)
+
+bash shell:
+
+.. code-block:: ini
+
+ export PYTHONPATH=$METCALCPY_BASE/:$METCALCPY_BASE/metcalcpy
+
+csh shell
+
+.. code-block:: ini
+
+ setenv PYTHONPATH $METCALCPY_BASE/:$METCALCPY_BASE/metcalcpy
+
+
+Where $METCALCPY_BASE is the full path to where the METcalcpy code resides (e.g. /User/
+my_dir/METcalcpy).
+
+Run the python script:
+^^^^^^^^^^^^^^^^^^^^^^
+
+The following are instructions for performing aggregation from the command-line:
+
+.. code-block:: yaml
+
+
+  python $METCALCPY_BASE/metcalcpy/agg_stat.py $METCALCPY_BASE/metcalcpy/pre_processing/aggregation/config/config_stat_agg.yaml
+
+
+This will generate the file **ecnt_aggregated.data** (from the agg_stat_output setting) which now contains the
+aggregated statistics data.
+
+
+Additionally, the agg_stat.py module can be invoked by another script or module
+by importing the package:
+
+.. code-block:: ini
+
+  from metcalcpy.agg_stat import AggStat
+
+  AGG_STAT = AggStat(PARAMS)
+  AGG_STAT.calculate_stats_and_ci()
+
+where PARAMS is a dictionary containing the parameters indicating the
+location of input and output data. The structure is similar to the
+original Rscript template from which this Python implementation was derived.
+
+**NOTE**: Remember to use the same PYTHONPATH defined above to ensure that the agg_stat module is found by
+the Python import process.
diff --git a/docs/Users_Guide/index.rst b/docs/Users_Guide/index.rst
@@ -65,6 +65,7 @@ National Center for Atmospheric Research (NCAR) is sponsored by NSF.
    installation
    vertical_interpolation
    difficulty_index
+   aggregation
    release-notes
 
 **Indices and tables**

diff --git a/metcalcpy/agg_stat.py b/metcalcpy/agg_stat.py
@@ -1101,11 +1101,12 @@ def _proceed_with_axis(self, axis="1"):
                     n_stats = 0
 
                 # save results to the output data frame
-                out_frame['fcst_var'][point_ind] = fcst_var
-                out_frame['stat_value'][point_ind] = bootstrap_results.value
-                out_frame['stat_btcl'][point_ind] = bootstrap_results.lower_bound
-                out_frame['stat_btcu'][point_ind] = bootstrap_results.upper_bound
-                out_frame['nstats'][point_ind] = n_stats
+                out_frame.loc[point_ind, 'fcst_var'] = fcst_var
+                out_frame.loc[point_ind, 'stat_value'] = bootstrap_results.value
+                out_frame.loc[point_ind, 'stat_btcl'] = bootstrap_results.lower_bound
+                out_frame.loc[point_ind, 'stat_btcu'] = bootstrap_results.upper_bound
+                out_frame.loc[point_ind, 'nstats'] = n_stats
+
 
         else:
             out_frame = pd.DataFrame()

diff --git a/metcalcpy/agg_stat_bootstrap.py b/metcalcpy/agg_stat_bootstrap.py
@@ -209,11 +209,11 @@ def _proceed_with_axis(self, axis="1"):
                         index = rows_with_mask_indy_var.index[0]
 
                         # save results to the output data frame
-                        out_frame['fcst_var'][index] = fcst_var
-                        out_frame['stat_value'][index] = bootstrap_results.value
-                        out_frame['stat_btcl'][index] = bootstrap_results.lower_bound
-                        out_frame['stat_btcu'][index] = bootstrap_results.upper_bound
-                        out_frame['nstats'][index] = n_stats
+                        out_frame.loc[index, 'fcst_var'] = fcst_var
+                        out_frame.loc[index, 'stat_value'] = bootstrap_results.value
+                        out_frame.loc[index, 'stat_btcl'] = bootstrap_results.lower_bound
+                        out_frame.loc[index, 'stat_btcu'] = bootstrap_results.upper_bound
+                        out_frame.loc[index, 'nstats'] = n_stats
         else:
             out_frame = pd.DataFrame()
         return out_frame

diff --git a/metcalcpy/pre_processing/aggregation/.gitignore b/metcalcpy/pre_processing/aggregation/.gitignore
@@ -0,0 +1,3 @@
+workdir/
+temp/
+plots/