Skip to content
KenSuber edited this page Feb 24, 2016 · 19 revisions

Building your first workflow

This tutorial will guide you through preparing a workflow to take in downloaded CMIP5 sea surface temperature data and output plots of the Niño 3.4 index. The tutorial workflow uses processing scripts from the collection that can be accessed at the CWSLab climate tools repository.

This tutorial assumes that you have connected to the CWSLab virtual machine and performed the first three steps in the Getting Started section.

Where does the system write out files?

You can use this system to create a file location and structure of your choosing. However, for this tutorial we will use the default data structure based on the CMIP5 data reference syntax and path /short/$PROJECT/$USER/ where $USER is your username and $PROJECT your NCI project code.

Step 1: Add the Input Dataset

The first step is to set up the input dataset. For this case, we are starting with CMIP5 data from the NCI downloaded archive. In the module select area, select the CMIP5 module (found under DataSets\GCM in the Climate and Weather Science Laboratory package) and drag it onto the workspace. You can select the modules easily by typing its name into the module select box. This module represents the entire collection of downloaded CMIP5 data at NCI.

Selecting the CMIP5 module

As the workflow stands, this dataset will include the entire CMIP5 archive! We need to add constraints to restrict the number of files down to a more reasonable level. Do this using the Constraint Builder module. This module adds restrictions to a dataset so it only includes the data that you require.

Select the Constraint Builder from the Modules panel and drag it to the work space. Connect its output to the input of the CMIP5 module. To add constraints, you will need to type into the constraint_string box with the Constraint Builder selected. For a first example, we will restrict our Dataset to monthly 'tos' (temperature at ocean surface) data from the 'inmcm4' model, experiments 'rcp45' and 'rcp85'. These constraints can be added by typing the following string into the constraint_string box: experiment = rcp85, rcp45 ; variable = tos ; model = inmcm4 ; frequency = mon

Setting the constraint_string

The name of the attribute you want to constrain is followed by an equal sign then a list of values that the constraint can take, separated by commas. Different constraints are separated by a semicolon. If you do not want to restrict the values that a constraint can take then leave it out of the string completely.

###Step 2: Add an operation to the workflow: Create Python CDAT Catalogue Files

We have now created our input Dataset, the next step is to begin running processing tasks on it.

Downloaded CMIP5 data is usually split into separate files for different time slices. In our example workflow, we now need to join these individual downloaded netCDF files into a single catalogue for each 'model run'. Once we perform this operation we will have two catalogue files: one for the rcp85 experiment and one for the rcp45 experiment.

In this workflow we use a module called Merge Timeseries, which uses the Python CDAT library to create a single-file catalogue of these files. This module can be found under the Aggregation group in the Climate and Weather Science Laboratory package, or again by typing its name into the module search box.

Add the Merge Timeseries module to your workflow and connect it to the output of the CMIP5 module. Your workflow should look similar to this: A VT Workflow to merge tos data

The workflow is now in a state that it can be run.

Step 3: Adding the final modules

The next step is to add the remaining modules to the workflow. To the output of the Merge Timeseries add a Crop module. This module selects data between two timepoints and/or lat/lon limits.

When you drag the Crop into the workflow you will notice that it has multiple input ports, unlike the modules you have used so far. This is because this module requires extra input from the user; a timeend string, and a timestart string; and lat/lon limits 'latnorth', 'latsouth', 'loneast', 'lonwest'. These parameters require input - they set the time bounds of the aggregation. In the attached screenshot I have set the aggregation to begin at 2060 and end at 2080.

Setting parameters on the Crop module

To complete the workflow, add a Nino3.4 module to the output of the netCDF from CDML. Then add a Plot Timeseries module. This module requires you to enter a variable name to be plotted, for this workflow the variable needs to be set as tos. Finally, add an Image Viewer module. These modules calculate the Niño 3.4 index from the input, plot the result and then diplay results in the Vistrails Spreadsheet Module. Your completed workflow should look much like this:A completed Niño 3.4 workflow.

Click on the 'Execute' button and the workflow will execute, one module at a time. You can simulate the workflow before running it by changing the simulate_execution setting in your user configuration. If you have started VisTrails from a command line then any error or info messages will appear in the terminal. If a step in the workflow fails the module will turn red.

Step 4: Exploration

Now you can experiment by changing some parameters in the workflow. Try including other models like ACCESS1-0 or MIROC5 by adding them to the Constraint Builder, or altering the year_start and year_end strings in the netCDF from CDML.

You can also check the metadata in the output netCDF files. If you run ncdump -h on one of the Niño 3.4 output files (they will be found at a path like /short/$PROJECT/$USER/CMIP5/GCM/native/INM/inmcm4/rcp85/mon/ocean/tos/r1i1p1/tos_Omon_inmcm4_rcp85_r1i1p1_2060-2080_nino34_native.nc). You should see a vistrails_history metadata attribute with a record of the workflow and scripts run on that file, with their git versions if available.