Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested Improvements to Python Embedding across MET #2414

Open
10 of 50 tasks
DanielAdriaansen opened this issue Jan 20, 2023 · 1 comment
Open
10 of 50 tasks

Suggested Improvements to Python Embedding across MET #2414

DanielAdriaansen opened this issue Jan 20, 2023 · 1 comment
Assignees
Labels
alert: NEED ACCOUNT KEY Need to assign an account key to this issue alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle alert: NEED MORE DEFINITION Not yet actionable, additional definition required MET: Python Embedding priority: low Low Priority requestor: NCAR/RAL NCAR Research Applications Laboratory type: enhancement Improve something that it is currently doing
Milestone

Comments

@DanielAdriaansen
Copy link
Contributor

DanielAdriaansen commented Jan 20, 2023

Describe the Enhancement

This is a general issue with no due date to document suggested changes and/or improvements to how Python Embedding works within MET, with the primary goal of improving the user experience and ability to communicate and teach Python Embedding to users.

Time Estimate

TBD

Sub-Issues

Consider breaking the enhancement down into sub-issues.

  • Revisit the Python embedding strings PYTHON_NUMPY and PYTHON_XARRAY. I think we can generalize these more to be something like PYTHON_DATAPLANE or PYTHON_GRID and PYTHON_POINT. Then, for PYTHON_DATAPLANE (or _GRID), have MET deal with deciphering whether it is an Xarray object or a NumPy N-D array. The point data is easier, and really only comes in one way. The largest confusion in my opinion is that PYTHON_NUMPY is used for point data currently, which has nothing to do with NUMPY whatsoever. Alternatively, we could add PYTHON_POINT and maintain the two dataplane instances the way they are.
  • If --enable_python is used when building, have make_test check if the version of python specified can successfully load all the required packages from a list of required packages that is maintained somewhere. (See Enhance make_test to test Python embedding functionality Enhance make_test to test Python embedding functionality #2788)
  • Add to Appendix F or the contributors guide much more detailed information about running with and without MET_PYTHON_EXE set. Include wire diagrams to illustrate. Also document the data structures that are populated in the 3 different types of python embedding (grid, point, pairs).
  • This section in the METplus Wrappers docs: https://metplus.readthedocs.io/en/latest/Users_Guide/overview.html#metplus-components-python-requirements, is confusing because it seems that "cfgrib" is a required Python package for METplus Wrappers, but then later in chapter 4 it does not list "cfgrib", and says "Minimum Requirements To run most of the METplus wrappers, the following packages are required: dateutil (2.8)". I think there could be some documentation work in the METplus Wrappers also.
  • When attempting to generalize where Python Embedding should be put for METplus Wrappers configuration, I was able to determine the following. For gridded data, the *_INPUT_TEMPLATE conf item is set to just PYTHON_NUMPY or PYTHON_XARRAY, and then the Python script and any arguments is included in the *_VAR<n>_NAME conf item. However, for point data, the *_INPUT_TEMPLATE conf item contains both the PYTHON_NUMPY string, and another "=" followed by the Python script and any script arguments. It would be nice if these worked similarly, where for point data we could set the *_INPUT_TEMPLATE to just PYTHON_NUMPY, and then use the *_VAR<n>_NAME conf items for the Python script and any arguments to it, which would make configuring METplus wrappers for point and gridded data similar with respect to where elements of Python Embedding should be put in the configuration file. After talking to John HG, it sounds like it might not be feasible to change the way MET works -or- to change the wrappers to include both items in a single conf for gridded data like obs data due to how the MET tools can be called for gridded data. Some of our discussion:

Q: Hi John. Is there a good reason why point data works a little differently than gridded data for Python Embedding for METplus Wrappers? It looks like for point data, the *_INPUT_TEMPLATE conf item is used to specify the entire string "PYTHON_NUMPY=/my/script.py arg1 arg2", whereas for gridded data, the *_INPUT_TEMPLATE conf item simply only requires "PYTHON_NUMPY" (or _XARRAY), and then the _VAR_NAME conf entry requires the "/my/script.py arg1 arg2" part of the Python Embedding elements.
A: For gridded data, you typically define the script to be run along with arguments in the "field.name" config setting. There is no "field.name" setting for obs data really... since we supply all the point obs (consisting of all the types) in a single spot. So we replace the obs file name with the full python command to retrieve those obs from python.

  • How can we expose the value of MET_PYTHON_BIN_EXE to the user? I was able to run this command:
    seneca:~$ /d1/projects/MET/MET_regression/develop/NB20230123/MET-develop/bin/plot_data_plane PYTHON_NUMPY fcst.ps 'name="test.py";' -v 12
    which showed the value. However, it might be nice to make a symbolic link in /usr/local/met-11.0.0/bin to the python3.8 exe that was used, OR have a utility script in /usr/local/met-11.0.0/bin the user can run to echo the value of the Python install for them. This way the user can directly instantiate the same Python MET will use, and also verify whether the Python packages they need are available there or whether they will need to handle installing their own MET_PYTHON_EXE to use.
  • Add more documentation or examples of how to use point_stat with Python Embedding. David Fillmore was using Python Embedding for both fcst and obs data, with the fcst being gridded data and the obs being point data. This led to constructing a command for MET and configuring METplus for both "point" data and "gridded" data, which work differently. Maybe more documentation, but maybe this is another data point for changing how it works for gridded/point to be more similar?
  • There are some Python scripts located on this DTC webpage: https://dtcenter.org/community-code/model-evaluation-tools-met/sample-analysis-scripts. Is this the best place for them? Or should this be included in the Appendix F, and linked to from there? From John HG:
  1. MET repo/RTD (https://met.readthedocs.io/en/latest/)
  2. METplus repo/RTD (https://metplus.readthedocs.io/en/latest/)
  3. METplus-Training repo/RTD (https://metplus-training.readthedocs.io/en/latest/)
    I've been looking for more opportunities to expand METplus-Training... so perhaps that's a good landing spot.
  • Should we document what other elements are allowed in the plot_data_plane field_string argument? name = "TMP"; level = "P500"; convert(x) = K_to_C(x); censor_thresh = lt0; censor_val = -9999; set_attr_name = "Temperature"; set_attr_level = "500mb"; set_attr_valid = 20210708_12;. See here for the best resource: https://met.readthedocs.io/en/latest/Users_Guide/config_options.html#fcst.
  • Added 2/10/23: On this line:
    status = system(command.text());
    , the system is calling Python but it may not be using the environment that the user has set. For example, PYTHONPATH could be set but when that call is executed it might not know about PYTHONPATH.
  • Added while testing Bugfix: Fix the MET vx_pointdata_python library to handle MET_PYTHON_EXE for python embedding of point observations #2428 on 2/17/23: We should consider making the Python requirements for METplus the same as the Python requirements for building MET with support for Python embedding. For using MET with Python embedding, it appears based on Appendix F the modules sys, os, argparse, importlib, numpy, and netCDF4 are required. Some of those are stdlib, but numpy (and maybe netCDF4) are not. (see Clarify MET Compile Time Python requirements #2490)
  • Other open issues: Create interface for translating Xarray object attributes to MET attributes #1470, Enhance MPR Python embedding to support all MET STAT line types #2539
  • Can we re-order the debugging of the Initializing/using/running for the Python Embedding messages?
  • It seems like PYTHON_XARRAY is not needed? When I have an Xarray DataArray object in memory named "met_data", and use PYTHON_NUMPY, it works just fine. Why is this?
  • Data ordered intuitively inside a hypercube read in by Python need to be flipped before sending to MET because MET populates the 2D dataplane from upper left to lower left rather than lower left first. This seems dangerous since it is not intuitive.
  • Should we even have many examples of using MET with Python Embedding directly? Like on the DTC webpage with sample scripts and data, those are using MET directly but don't we want folks to use METplus wrappers?
  • Enhance MET Python embedding to report an error if MET_PYTHON_EXE is set but does not exist, and warn that it is using the compile time instance.
  • Decide how to support missing data values in MET for cases of compile time Python and MET_PYTHON_EXE. Met tools expect a missing data value/FillValue of -9999.0. Any other value will be treated as valid data. See here: Bugfix: Fix the fill value setting used in the write_tmp_dataplane internal Python embedding script #2525 for more info.
  • The MPR Python Embedding example assumes users are reading an MPR file generated by another MET tool, and uses Pandas read_csv() to read it. It specifies the type of data as dtype='str', thus all columns of the MPR file are cast as strings prior to sending to MET. If a user curates their own MPR data in Python (for example, massaging JSON data into the MPR line type format), they may have columns that are not string but rightfully numeric in their type (e.g. QC, FCST, OBS). Should we allow mixed data types in this case? At a minimum, perhaps we return a more informative error message than bad object type.
  • Added 5/8/2024 during Feature #2781 Convert MET NetCDF point obs to Pandas DataFrame #2877 review: add documentation of the MET python module as needed. This is mostly intended for internal use though, so it is debatable how useful documentation would be for users. It might be more for developers.
  • Added 9/26/2024: In the dictionary of grid attributes used for Python embedding for dataplanes, the type attribute appears to use CamelCase (e.g. LatLon) as opposed to all lower case (e.g. latlon). We use all lower case for grid specification strings, so why is it different for Python embedding? We should make it the same for both, or allowing case-insensitive options for Python embedding to preserve backwards compatibility. See Python embedding to read geotiff data (sentinel 3 products) METplus#2702.

Completed Items

Relevant Deadlines

NONE

Funding Source

NONE

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Select component(s)
  • Select priority
  • Select requestor(s)

Projects and Milestone

  • Select Repository and/or Organization level Project(s) or add alert: NEED PROJECT ASSIGNMENT label
  • Select Milestone as the next official version or Future Versions

Define Related Issue(s)

Consider the impact to the other METplus components.

Enhancement Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding Source.
  • Fork this repository or create a branch of develop.
    Branch name: feature_<Issue Number>_<Description>
  • Complete the development and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into develop.
    Pull request: feature <Issue Number> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issues
    Select: Repository level development cycle Project for the next official release
    Select: Milestone as the next official version
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Close this issue.
@DanielAdriaansen DanielAdriaansen added type: enhancement Improve something that it is currently doing priority: low Low Priority requestor: NCAR/RAL NCAR Research Applications Laboratory alert: NEED MORE DEFINITION Not yet actionable, additional definition required alert: NEED ACCOUNT KEY Need to assign an account key to this issue alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle MET: Python Embedding labels Jan 20, 2023
@DanielAdriaansen DanielAdriaansen self-assigned this Jan 20, 2023
@georgemccabe
Copy link
Collaborator

georgemccabe commented Feb 10, 2023

os.path.abspath can be used to get the full path from a relative path. This could be used to add the directory containing the python embedding script if the user does not specify the full path. See: https://github.com/dtcenter/METplus/blob/aa9d7b3451bad720dd8a9928a735b3bf12318585/ush/run_metplus.py#L24

A potential fix (untested) would be to change this line from:
pyembed_module_name = sys.argv[2]
to:
pyembed_module_name = os.path.abspath(sys.argv[2])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alert: NEED ACCOUNT KEY Need to assign an account key to this issue alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle alert: NEED MORE DEFINITION Not yet actionable, additional definition required MET: Python Embedding priority: low Low Priority requestor: NCAR/RAL NCAR Research Applications Laboratory type: enhancement Improve something that it is currently doing
Projects
Status: 🩺 Needs Triage
Development

No branches or pull requests

3 participants