index.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="generator" content="pandoc">
  <meta name="author" content="Marshall Ward and Aidan Heerdegen">
  <title>Payu</title>
  <meta name="apple-mobile-web-app-capable" content="yes">
  <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
  <link rel="stylesheet" href="./reveal.js/css/reveal.css">
  <style type="text/css">
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <link rel="stylesheet" href="./reveal.js/css/theme/smallsky.css" id="theme">
  <!-- Explicitly add zenburn for highlight support -->
  <link rel="stylesheet" href="./reveal.js/lib/css/zenburn.css" id="theme">
  <!-- Printing and PDF exports -->
  <script>
    var link = document.createElement( 'link' );
    link.rel = 'stylesheet';
    link.type = 'text/css';
    link.href = window.location.search.match( /print-pdf/gi ) ? './reveal.js/css/print/pdf.css' : './reveal.js/css/print/paper.css';
    document.getElementsByTagName( 'head' )[0].appendChild( link );
  </script>
  <!--[if lt IE 9]>
  <script src="./reveal.js/lib/js/html5shiv.js"></script>
  <![endif]-->
  <base href="https://payu-org.github.io/payucourse/index.html">
</head>
<body>
  <div class="reveal">
       <!-- style="background: url(img/bg_nci.png);
	       background-size: cover;"> -->

    <!-- 
    <header style="width: 10%; position: absolute; top: 2%; left: 2%;">
      <img src="img/nci_logo_small.png">
    </header>
    -->

    <footer style="font-size: 1pc; position: absolute; bottom: 2%; left: 2%;">
      <code>https://payu-org.github.io/payucourse/index.html</code>
    </footer>

    <div class="slides">

<section id="title-slide">
  <div class="reveal" style="text-align: right;">
    <!-- <img src="img/nci_logo.png"
         style="background: none; border: none; box-shadow: none;
         width: 30%"
         alt="NCI">
    -->
  </div>
  <h1 class="title">Payu</h1>
  <p class="subtitle" style="text-align: left;"><p>A climate model workflow manager</p></p>
  <p class="author" style="text-align: right;">Marshall Ward and Aidan Heerdegen</p>
  <p class="date" style="text-align: right;">25th May 2021</p>
</section>

<section>
<section id="what-is-payu" class="title-slide slide level1">
<h1>What is Payu?</h1>
<p>Payu is a climate model workflow manager</p>
<p>It is open source, the code is available on GitHub</p>
<p><a href="https://github.com/payu-org/payu">https://github.com/payu-org/payu</a></p>
<p>and there is full documentation</p>
<p><a href="https://payu.readthedocs.io/en/latest/">https://payu.readthedocs.io/en/latest/</a></p>
</section>
<section id="what-is-a-climate-model-workflow-manager" class="slide level2">
<h2>What is a climate model workflow manager?</h2>
<p>That means it runs your model for you. In short:</p>
<ul>
<li>Setup model run directory (<code>work</code>)</li>
<li>Run the model</li>
<li>Move outputs/restarts to <code>archive</code> directory</li>
<li>Clean up the run directory</li>
<li>Run again (if instructed to do so)</li>
</ul>
</section>
<section id="features" class="slide level2">
<h2>Features</h2>
<ul>
<li>Simple YAML based configuration</li>
<li>Full automatic version control of experimental configuration using <code>git</code></li>
<li>Hash based versioning of all model inputs and executables</li>
<li>Supports many models: MOM5, MOM6, ACCESS-OM, ACCESS-ESM, MITgcm, CICE4, CICE5, Qgcm ...</li>
<li>Driver based architecture allows support for other models to be added</li>
</ul>
</section>
<section id="etymology" class="slide level2">
<h2>Etymology</h2>
<ul>
<li><p><em>P</em>ython on v<em>AYU</em></p>
<p>Vayu was Gadi's two times antecedent and though it has passed away...</p>
<p>...Payu lives on!</p></li>
</ul>
</section>
<section id="motivation" class="slide level2">
<h2>Motivation</h2>
<aside class="notes">
<p>Not only were we using scripts for all of our models, but we also were using more and more variations of scripts for various tasks.</p>
<p>Version control had not quite caught on yet, but I don't think it would have addressed our problems anyway.</p>
</aside>
<p>Long ago, we managed many, many jobscripts.</p>
<dl>
<dt>Many kinds of scripts:</dt>
<dd><p>bash, tcsh, ksh, ...</p>
</dd>
<dt>Many different models:</dt>
<dd><p>MOM, MITgcm, Q-GCM, ACCESS, ...</p>
</dd>
</dl>
<p>Juggling and sharing scripts was becoming a problem. Many of steps were duplicated between different scripts for the various models. Payu is an attempt to generalise the task of running a model.</p>
</section>
<section class="slide level2">

<aside class="notes">
<p>Andy Hogg is probably the progenitor of Payu. As Python became more popular in the group we started using it for non-scientific tasks, Andy asked me why we don't just use Python for everything?</p>
<p>I probably took this suggestion too literally.</p>
<p>Here is a picture of Andy doing some important scientific computing.</p>
</aside>
<p><img data-src="img/andy_at_nci.png" alt="image" /></p>
<p>"Why not just do everything in Python?"</p>
</section>
<section id="prehistoric-payu" class="slide level2">
<h2>Prehistoric Payu</h2>
<aside class="notes">
<p>There were even older versions which looked like bash scripts written in Python, which are so terrible that they're not even worth sharing.</p>
</aside>
<pre class="python"><code>from payu import gold

def main():
    expt = gold(forcing=&#39;andrew&#39;)

    expt.setup()
    expt.run()
    expt.archive()
    if expt.counter &lt; expt.max_counter:
        expt.resubmit()

if __name__ == &#39;__main__&#39;:
    main()</code></pre>
</section></section>
<section>
<section id="using-payu" class="title-slide slide level1">
<h1>Using Payu</h1>
<p>Load the CMS conda environment containing payu:</p>
<pre><code>module use /g/data3/hh5/public/modules
module load conda</code></pre>
</section>
<section id="invoking-payu" class="slide level2">
<h2>Invoking payu</h2>
<p>payu has multiple subcommands. Invoking with the <code>-h</code> option will print helpful usage message</p>
<pre class="sh"><code>$ payu -h
usage: payu [-h] [--version] {archive,build,collate,ghsetup,init,list,profile,push,run,setup,sweep} ...

positional arguments:
{archive,build,collate,ghsetup,init,list,profile,push,run,setup,sweep}

optional arguments:
-h, --help            show this help message and exit
--version             show program&#39;s version number and exit</code></pre>
</section>
<section id="subcommands" class="slide level2">
<h2>Subcommands</h2>
<table>
<tbody>
<tr class="odd">
<td><code>init</code></td>
<td>Initialise laboratory directories</td>
</tr>
<tr class="even">
<td><code>setup</code></td>
<td>Initialise work directory</td>
</tr>
<tr class="odd">
<td><code>sweep</code></td>
<td>Remove ephemeral work directory and copy logs to archive</td>
</tr>
<tr class="even">
<td><code>run</code></td>
<td>Submit model to queue to run</td>
</tr>
<tr class="odd">
<td><code>collate</code></td>
<td>Join tiled outputs (and potentially restarts). Not all models</td>
</tr>
</tbody>
</table>
</section>
<section class="slide level2">

<p>Less used</p>
<table>
<tbody>
<tr class="odd">
<td></td>
<td></td>
</tr>
<tr class="even">
<td><code>list</code></td>
<td>List all supported models</td>
</tr>
<tr class="odd">
<td><code>archive</code></td>
<td>Clean up work directory and copy files to archive</td>
</tr>
<tr class="even">
<td><code>ghsetup</code></td>
<td>Setup GitHub repository</td>
</tr>
<tr class="odd">
<td><code>push</code></td>
<td>Push local repo to GitHub</td>
</tr>
</tbody>
</table>
<p>Never typically used</p>
<table>
<tbody>
<tr class="odd">
<td></td>
<td></td>
</tr>
<tr class="even">
<td><code>profile</code></td>
<td>Profile model (not typically used)</td>
</tr>
<tr class="odd">
<td><code>build</code></td>
<td>Build model executable (not currently supported)</td>
</tr>
</tbody>
</table>
</section>
<section class="slide level2">

<p>Pass <code>-h</code> to subcommands for subcommand specific help and options</p>
<pre class="sh"><code>$ payu run -h
usage: payu run [-h] [--model MODEL_TYPE] [--config CONFIG_PATH] [--initial INIT_RUN] [--nruns N_RUNS] [--laboratory LAB_PATH] [--reproduce] [--force]

Run the model experiment

optional arguments:
-h, --help            show this help message and exit
--model MODEL_TYPE, -m MODEL_TYPE
                        Model type
--config CONFIG_PATH, -c CONFIG_PATH
                        Configuration file path
--initial INIT_RUN, -i INIT_RUN
                        Starting run counter
--nruns N_RUNS, -n N_RUNS
                        Number of successive experiments ro run
--laboratory LAB_PATH, --lab LAB_PATH, -l LAB_PATH
                        The laboratory, this will over-ride the value given in config.yaml
--reproduce, --repro, -r
                        Only run if manifests are correct
--force, -f           Force run to proceed, overwriting existing directories</code></pre>
</section></section>
<section>
<section id="clone-experiment" class="title-slide slide level1">
<h1>Clone Experiment</h1>
<ul>
<li>The payu configuration file, <code>config.yaml</code>, the model configuration files and manifests are tracked directly by <code>git</code></li>
<li>When an experiment is cloned only the files tracked by <code>git</code> are copied</li>
<li>Other files can be added and changes to those files will be automatically added to the experiment repo</li>
</ul>
</section>
<section class="slide level2">

<p>Clone an existing experiment (usually in a subdirectory in <code>$HOME</code>):</p>
<pre class="sh"><code>cd $HOME
mkdir -p payu/mom
cd payu/mom
git clone https://github.com/payu-org/bowl1.git
cd bowl1</code></pre>
<p>This is the "<em>control directory</em>" for <code>bowl1</code></p>
</section>
<section id="your-experiment" class="slide level2">
<h2>Your experiment</h2>
<ul>
<li>The newly cloned experiment consists of the <code>config.yaml</code>, the config files for this model (MOM) and, optionally, file manifests</li>
</ul>
<pre class="sh"><code>bowl1/
   ├── config.yaml
   ├── data_table
   ├── diag_table
   ├── field_table
   ├── input.nml
   └── manifests
      ├── exe.yaml
      ├── input.yaml
      └── restart.yaml</code></pre>
</section>
<section id="experiment-configuration" class="slide level2">
<h2>Experiment Configuration</h2>
<pre class="yaml"><code># PBS
queue: express
ncpus: 4
walltime: 0:10:00
mem: 1GB
jobname: bowl1

# Model
model: mom
input: /g/data/hh5/tmp/mom-test/input/bowl1
exe: /g/data/hh5/tmp/mom-test/bin/mom51_solo_default

# Config
collate: False</code></pre>
</section>
<section id="run-the-experiment" class="slide level2">
<h2>Run the experiment</h2>
<ul>
<li>This job is pre-configured, so can just run it!</li>
</ul>
<pre class="sh"><code>cd bowl1
payu run</code></pre>
<ul>
<li>Model will run in <code>work/</code> (an ephemeral directory created in the laboratory just for that model run)</li>
<li>Output moved to <code>archive/</code> when run completes without error</li>
</ul>
</section>
<section id="inspecting-the-output" class="slide level2">
<h2>Inspecting the output</h2>
<aside class="notes">
<p>Not all the files exist in these locations at all time. Model output and error files will reside in control while model is running, but will be archived when a run succesfully finishes. PBS output files won't appear until job completed</p>
</aside>
<table>
<tbody>
<tr class="odd">
<td><code>mom.out</code></td>
<td>Model output</td>
</tr>
<tr class="even">
<td><code>mom.err</code></td>
<td>Model error</td>
</tr>
<tr class="odd">
<td><code>bowl1.o${jobid}</code></td>
<td>PBS (payu) output</td>
</tr>
<tr class="even">
<td><code>bowl1.e${jobid}</code></td>
<td>PBS (payu) error</td>
</tr>
<tr class="odd">
<td><code>archive/output000</code></td>
<td>Model output files</td>
</tr>
<tr class="even">
<td><code>archive/restart000</code></td>
<td>Restart (pickup) files</td>
</tr>
<tr class="odd">
<td><code>manifests</code></td>
<td>Manifest files for tracking executables and inputs</td>
</tr>
</tbody>
</table>
</section>
<section id="cleaning-up" class="slide level2">
<h2>Cleaning up</h2>
<p>To clear <code>work/</code> and save PBS logs:</p>
<pre><code>payu sweep</code></pre>
<p>Or to completely delete the experiment:</p>
<pre><code>payu sweep --hard</code></pre>
<p>This wipes output, restarts, and logs! (<code>archive/$expt</code>)</p>
</section></section>
<section>
<section id="anatomy-of-an-experiment" class="title-slide slide level1">
<h1>Anatomy of an Experiment</h1>

</section>
<section id="control-vs-laboratory" class="slide level2">
<h2>Control vs Laboratory</h2>
<dl>
<dt>Control path: <code>${HOME}/mom/bowl1</code></dt>
<dd><p>User-configured (text) input</p>
</dd>
<dt>Laboratory: <code>/scratch/$PROJECT/$USER/$MODEL/</code></dt>
<dd><p>Executables, data input, output, etc.</p>
</dd>
</dl>
<p>You "control" the laboratory externally</p>
</section>
<section id="laboratory-overview" class="slide level2">
<h2>Laboratory overview</h2>
<aside class="notes">
<p>work is where the ephemeral work directories are created. Named for the experiment name, usually the same as the directory name. archive where outputs/restarts saved in directory same as experiment name means experiment names must be unique for each laboratory, which typically means for each model codebase not typically used by payu, for convenience. bin and input for user convenience, a place where payu looks</p>
</aside>
<table>
<tbody>
<tr class="odd">
<td><code>archive</code></td>
<td>Experiment output and restarts</td>
</tr>
<tr class="even">
<td><code>bin</code></td>
<td>Model executables</td>
</tr>
<tr class="odd">
<td><code>codebase</code></td>
<td>Model Source code repository</td>
</tr>
<tr class="even">
<td><code>input</code></td>
<td>Static input files</td>
</tr>
<tr class="odd">
<td><code>work</code></td>
<td>Ongoing (or failed) experiments</td>
</tr>
</tbody>
</table>
<p>"<code>payu init -m mom</code>" will create these directories</p>
</section>
<section id="configuring-your-experiment" class="slide level2">
<h2>Configuring your experiment</h2>
<table>
<thead>
<tr class="header">
<th>Config</th>
<th>Description</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>model</code></td>
<td>Model type</td>
<td>(required!)</td>
</tr>
<tr class="even">
<td><code>name</code></td>
<td>Experiment name</td>
<td>Expt directory</td>
</tr>
<tr class="odd">
<td><code>exe</code></td>
<td>Executable</td>
<td>Set by driver</td>
</tr>
<tr class="even">
<td><code>input</code></td>
<td>Model inputs</td>
<td>-</td>
</tr>
</tbody>
</table>
<p>Paths can be absolute or relative to the lab path</p>
</section>
<section id="scheduler-configuration" class="slide level2">
<h2>Scheduler configuration</h2>
<table>
<thead>
<tr class="header">
<th>Config</th>
<th>Description</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>queue</code></td>
<td>PBS Queue</td>
<td><code>normal</code></td>
</tr>
<tr class="even">
<td><code>project</code></td>
<td>SU Account</td>
<td><code>$PROJECT</code></td>
</tr>
<tr class="odd">
<td><code>jobname</code></td>
<td>Queue job name</td>
<td><code>name</code></td>
</tr>
<tr class="even">
<td><code>walltime</code></td>
<td>Time request</td>
<td>(From PBS)</td>
</tr>
<tr class="odd">
<td><code>ncpus</code></td>
<td>CPU request</td>
<td>1</td>
</tr>
<tr class="even">
<td><code>mem</code></td>
<td>RAM request</td>
<td>Max node mem</td>
</tr>
</tbody>
</table>
<p><code>qsub_flags</code> for everything else</p>
</section>
<section id="cpu-requests" class="slide level2">
<h2>CPU requests</h2>
<p>Normally <code>ncpus</code> will increase itself to match the node, but more control is available:</p>
<table style="width:51%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 19%" />
<col style="width: 15%" />
</colgroup>
<tbody>
<tr class="odd">
<td>Config</td>
<td>Description</td>
<td>Default</td>
</tr>
<tr class="even">
<td>platform</td>
<td></td>
<td></td>
</tr>
<tr class="odd">
<td>→nodesize</td>
<td>Node CPUs</td>
<td>48</td>
</tr>
<tr class="even">
<td>→nodemem</td>
<td>Node RAM</td>
<td>192 (GB)</td>
</tr>
</tbody>
</table>
</section>
<section class="slide level2">

<p>This is currently required to use the Broadwell nodes:</p>
<pre class="yaml"><code>platform:
   nodesize: 28
   nodemem: 128</code></pre>
</section>
<section id="the-work-directory" class="slide level2">
<h2>The work directory</h2>
<p>Run the following to inspect (and test) the setup of your run:</p>
<pre class="sh"><code>payu setup</code></pre>
<p>This will create your <code>work</code> directory in the laboratory and a symbolic link to it in your control directory.</p>
<aside class="notes">
<p>This is the first step once a model run begins, but it is exposed separately to allow testing all input and executable paths are correct without having to actually run the model.</p>
<p>By default payu will not submit a job if there is an existing <code>work/</code> directory. After running <code>payu setup</code> you must run <code>payu sweep</code> or use the <code>-f</code> option to run the model <code>payu run -f</code></p>
</aside>
</section>
<section id="inside-the-work-directory" class="slide level2">
<h2>Inside the work directory</h2>
<aside class="notes">
<p>Executable linked into work: for tracking and so obvious which exe being used</p>
</aside>
<p>Inspect the symbolic link to <code>work</code> and its contents:</p>
<pre><code>work
├── config.yaml
├── data_table
├── diag_table
├── field_table
├── INPUT
│   ├── gotmturb.inp -&gt; /g/data/hh5/tmp/mom-test/input/bowl1/gotmturb.inp
│   ├── grid_spec.nc -&gt; /g/data/hh5/tmp/mom-test/input/bowl1/grid_spec.nc
│   ├── ocean_barotropic.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_barotropic.res.nc
│   ├── ocean_bih_friction.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_bih_friction.res.nc
│   ├── ocean_density.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_density.res.nc
│   ├── ocean_pot_temp.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_pot_temp.res.nc
│   ├── ocean_sbc.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_sbc.res.nc
│   ├── ocean_solo.res -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_solo.res
│   ├── ocean_temp_salt.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_temp_salt.res.nc
│   ├── ocean_thickness.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_thickness.res.nc
│   ├── ocean_tracer.res -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_tracer.res
│   ├── ocean_velocity_advection.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_velocity_advection.res.nc
│   └── ocean_velocity.res.nc -&gt; /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_velocity.res.nc
├── input.nml
├── log
├── manifests
│   ├── exe.yaml
│   ├── input.yaml
│   └── restart.yaml
├── mom51_solo_default -&gt; /g/data/hh5/tmp/mom-test/bin/mom51_solo_default
└── RESTART</code></pre>
<p>Your config files are copied, and sometimes modified. Your input data is symlinked.</p>
</section>
<section id="a-simple-configuration" class="slide level2">
<h2>A simple configuration</h2>
<p>The configuration file (<code>config.yaml</code>) uses the YAML format</p>
<pre class="yaml"><code>model: mom6
name: om4_gm_test

queue: normal
jobname: mom6_om4
walltime: 20:00
ncpus: 960
mem: 1500GB

exe: mom6_intel17
input:
    - om4_grid
    - om4_atm</code></pre>
<p>Most variables have "sensible" defaults</p>
</section>
<section id="a-more-complex-configuration" class="slide level2">
<h2>A more complex configuration</h2>
<pre class="yaml"><code># PBS configuration
queue: normal
project: fp0
walltime: 02:30:00
jobname: om2_jra55
ncpus: 1153
mem: 2000GB

#platform:
#   nodesize: 28

laboratory: /scratch/fp0/mxw900/cosima
repeat: True

collate:
    walltime: 4:00:00
    mem: 30GB
    ncpus: 4
    queue: express
    flags: -n4 -z -m -r

# Model configuration
model: access
submodels:
    - name: coupler
      model: oasis
      input: oasis_025
      ncpus: 0

    - name: atmosphere
      model: matm
      exe: matm
      #input: jra55-0.8_025
      input: /scratch/v45/mxw900/cosima/nc64
      ncpus: 1

    - name: ocean
      model: mom
      exe: mom
      input:
          - mom
          # - iaf-sw21d
      ncpus: 960

    - name: ice
      model: cice
      exe: cice_nohalo
      input: cice
      ncpus: 192

calendar:
    runtime:
        years: 0
        months: 0
        days: 30
    start:
        year: 1
        month: 1
        days: 1</code></pre>
</section></section>
<section>
<section id="feature-overview" class="title-slide slide level1">
<h1>Feature overview</h1>

</section>
<section id="multiple-runs" class="slide level2">
<h2>Multiple runs</h2>
<aside class="notes">
<p>This does not affect the total number of runs which is goverened entirely by the value passed to the <code>-n</code> option.</p>
</aside>
<ul>
<li>To do multiple runs in sequence:</li>
</ul>
<pre class=""><code>payu run -n 20</code></pre>
<ul>
<li><p>We save every output, and every 5th restart. To change the rate restart files are saved:</p>
<pre><code>restart_freq: 1</code></pre></li>
<li><p>To run the model multiple times for each submission to the queue:</p>
<pre><code>runspersub: 5</code></pre></li>
</ul>
<p>will run the model 5 times during each queued job. Can help reduce overhead spent waiting in queue</p>
</section>
<section id="path-control" class="slide level2">
<h2>Path control</h2>
<aside class="notes">
<p>Just about every path can be explicitly set, though at some point it does get a bit weird to, say, change the control path...</p>
</aside>
<p>Default paths can be set explicitly</p>
<table>
<tbody>
<tr class="odd">
<td><code>shortpath</code></td>
<td>Root ("scratch") path</td>
</tr>
<tr class="even">
<td><code>laboratory</code></td>
<td>Laboratory path</td>
</tr>
<tr class="odd">
<td><code>control</code></td>
<td>Control path</td>
</tr>
</tbody>
</table>
<p>e.g. to run under multiple project codes but keep all files in the same laboratory location specify <code>shortpath</code></p>
</section>
<section id="mpi-support" class="slide level2">
<h2>MPI support</h2>
<p>MPI support is very explicit at the moment:</p>
<aside class="notes">
<p>We rely on NCI wrapper scripts to fix the MPI module.</p>
<p>Also, these particular settings are nonsense together, they're just various examples. Usually do not set an explicit mpi module, this is done automatically by the <code>mpirun</code> script.</p>
</aside>
<pre class="yaml"><code>mpi:
    module: openmpi/2.1.1-debug
    modulepath: /home/157/mxw157/modules
    flags:
       - -mca orte_output_filename log
       - -mca pml yalla
    runcmd: map --profile mpiexec</code></pre>
<p>It is not recommended to change these without understanding and good reason</p>
</section>
<section id="userscript-support" class="slide level2">
<h2>Userscript support</h2>
<ul>
<li>Subcommands and scripts can be injected after key steps</li>
</ul>
<pre class="yaml"><code>userscripts:
   init: &#39;echo &quot;some_data&quot; &gt; input.nml&#39;
   setup: patch_inputs.py
   run: &#39;qsub postprocess.sh&#39;

postscript: sync_output_to_gdata.sh</code></pre>
<ul>
<li>These will run after the prescribed section.</li>
<li><code>postscript</code> runs when a run finishes, if the output is collated it runs after that completes.</li>
</ul>
</section>
<section id="supported-models" class="slide level2">
<h2>Supported models</h2>
<p>To see the supported models:</p>
<pre><code>payu list</code></pre>
<p>But expect some atrophy...</p>
</section>
<section id="file-tracking" class="slide level2">
<h2>File Tracking</h2>
<aside class="notes">
<p>Very early in this job, there was a "dodgy aerosol file" that had been used in some simulations, but hard/impossible to say which runs/files were affected</p>
</aside>
<ul>
<li>Track input files used for each model run</li>
<li>Completely automatic. User intervention not required</li>
<li>Reproducibly re-run previous experiment</li>
<li>Share experiments more easily as input files all specified</li>
<li>Flexibility with specifying path to input files</li>
<li>Identify all runs using specified file (possible future feature)</li>
</ul>
</section>
<section id="what-is-tracked" class="slide level2">
<h2>What is tracked?</h2>
<aside class="notes">
<p>Executables and inputs are not expected to change. Can specify a flag to either warn if they do and stop, or update manifest and continue</p>
<p>Restarts are the opposite, and by default are always expected to be different for each run, unless a flag is specified to reproduce a run, in which case any difference will flag an error and stop</p>
</aside>
<table>
<tbody>
<tr class="odd">
<td>Executables</td>
<td><code>manifests/exe.yaml</code></td>
</tr>
<tr class="even">
<td>Inputs</td>
<td><code>manifests/inputs.yaml</code></td>
</tr>
<tr class="odd">
<td>Restarts</td>
<td><code>manifests/restarts.yaml</code></td>
</tr>
</tbody>
</table>
</section>
<section id="how-is-it-tracked" class="slide level2">
<h2>How is it tracked?</h2>
<ul>
<li>Uses <a href="https://github.com/aidanheerdegen/yamanifest">yamanifest</a></li>
<li>Creates a manifest file which uses <code>YAML</code> format</li>
<li>Each file (symlink) in <code>work</code> is a dictionary key in manifest file</li>
<li>Manifests files are tracked by <code>git</code>, so the unique hash for every tracked file is associated with each run using version control</li>
</ul>
</section>
<section id="example" class="slide level2">
<h2>Example</h2>
<aside class="notes">
<p>Note there is a header and a version string, can ignore All files in work are either config files (which are tracked by git) or symbolic links to files elsewhere on filesystem Issues with getting this working has to do with enforcing this for all models - can be difficult with hardwired paths etc</p>
</aside>
<ul>
<li><code>fullpath</code> is the actual location of the file</li>
<li>The hashes uniquely identify file</li>
</ul>
<pre class="yaml"><code>format: yamanifest
version: 1.0
---
work/mom51_solo_default:
   fullpath: /g/data/hh5/tmp/mom-test/bin/mom51_solo_default
   hashes:
      binhash: 423d9cf92c887fe3145c727c4fbda4d0
      md5: 989580773079ede1109b292c2bb2ad17</code></pre>
</section>
<section id="hierachy-of-hashes" class="slide level2">
<h2>Hierachy of hashes</h2>
<aside class="notes">
<p>binhash uses datestamp and size combined with first 100MB of a file. Not guaranteed unique, but likely to detect if the file has changed</p>
</aside>
<ul>
<li>yamanifest supports multiple hashes =&gt; hierarchy of hashes</li>
<li>Unique hashes (md5, sha128) take too long on large files</li>
<li>Fast hashing to check for file changes</li>
<li>Use unique hash check when necessary</li>
<li>Running <code>payu setup</code> for experiments with large number and size of input files can be useful: precalculate expensive hashes saves time when job runs on queue</li>
</ul>
</section></section>
<section>
<section id="forking-and-sharing-experiments" class="title-slide slide level1">
<h1>Forking and sharing experiments</h1>

</section>
<section id="creating-a-new-experiment" class="slide level2">
<h2>Creating a new experiment</h2>
<aside class="notes">
<p>Sharing experiments is new, but we are working on improving this experience</p>
</aside>
<p>Let's have some <strong>FUN</strong> and increase the timestep:</p>
<pre><code>git clone bowl1 bowl2
cd bowl2</code></pre>
<p>We are in a hurry, so let's make <code>dt_ocean</code> in <code>input.nml</code> very large:</p>
<pre><code>&amp;ocean_model_nml
    dt_ocean = 86400</code></pre>
</section>
<section id="recording-your-progress" class="slide level2">
<h2>Recording your progress</h2>
<p>See changes to the run by utilising <code>git</code>:</p>
<pre><code>git log</code></pre>
<p>and responsible people <em>always</em> document their changes:</p>
<pre><code>git commit -am &quot;Testing a large timestep&quot;</code></pre>
<p>But if you're lazy then payu will commit upon completion.</p>
<p>Let's run it!</p>
</section>
<section id="failure" class="slide level2">
<h2>FAILURE</h2>
<p><img data-src="img/angry.jpg" alt="image" /></p>
<p>Your run crashed!!!</p>
</section>
<section id="inspecting-failed-jobs" class="slide level2">
<h2>Inspecting failed jobs</h2>
<p>Failed jobs retain output and error files (<code>mom.out</code>, <code>mom.err</code> in this case), and a <code>work/</code> directory</p>
<p>From <code>mom.err</code>:</p>
<pre><code>FATAL from PE    2: ==&gt;Error: time step instability detected for baroclinic gravity waves in ocean_model_mod

forrtl: error (78): process killed (SIGTERM)</code></pre>
<p>Errors are saved to <code>archive/error_logs</code> with PBS job IDs</p>
<p>(Note: Error logs can get big fast!)</p>
</section>
<section id="github-integration" class="slide level2">
<h2>GitHub integration</h2>
<aside class="notes">
<p>Here, show off git remote as well as the .ssh directory</p>
</aside>
<p>You can sync your experiment on GitHub:</p>
<pre><code>payu ghsetup
payu push</code></pre>
<p>Visit your experiment in GitHub!</p>
<p>NOTE: This will create SSH keys in your <code>$HOME/.ssh</code> directory.</p>
</section>
<section id="other-github-features" class="slide level2">
<h2>Other GitHub features</h2>
<aside class="notes">
<p>Remember to do a bit of a tech demo to mxw900-raijin here.</p>
<dl>
<dt>The unmentioned features are:</dt>
<dd><ul>
<li>name (on github)</li>
<li>username</li>
<li>sshid (ssh key path)</li>
<li>private</li>
<li>remote name (payu)</li>
</ul>
</dd>
</dl>
</aside>
<p>You should set a description for your run:</p>
<pre class="yaml"><code>description: A very fun experiment</code></pre>
<p>You can also save jobs to an organization:</p>
<pre class="yaml"><code>runlog:
   organization: mxw900-raijin</code></pre>
<p>There are a few other features here, and someday they may be documented!</p>
<p>Currently "<code>payu push</code>" is manual, but we could make it automatic.</p>
</section></section>
<section>
<section id="coupled-models" class="title-slide slide level1">
<h1>Coupled Models</h1>

</section>
<section id="coupled-configuration" class="slide level2">
<h2>Coupled configuration</h2>
<p>Yes, Payu supports coupled models!</p>
<p><a href="https://github.com/COSIMA/01deg_jra55_iaf">https://github.com/COSIMA/01deg_jra55_iaf</a></p>
<pre class="yaml"><code>model: access-om2
input: /g/data/ik11/inputs/access-om2/input_08022019/common_01deg_jra55
submodels:
    - name: atmosphere
      model: yatm
      exe: /g/data/ik11/inputs/access-om2/bin/yatm_4198e150.exe
      input:
            - /g/data/ik11/inputs/access-om2/input_08022019/yatm_01deg
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/rsds/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/rlds/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/prra/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/prsn/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/psl/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/land/day/friver/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/tas/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/huss/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/uas/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/vas/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/landIce/day/licalvf/gr/v20190429
      ncpus: 1

    - name: ocean
      model: mom
      exe: /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_e837d05d_libaccessom2_4198e150.x
      input: /g/data/ik11/inputs/access-om2/input_08022019/mom_01deg
      ncpus: 4358

    - name: ice
      model: cice5
      exe: /g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_597e4561_libaccessom2_4198e150.exe
      input: /g/data/ik11/inputs/access-om2/input_20200422/cice_01deg
      ncpus: 799</code></pre>
</section>
<section id="layout-of-a-coupled-model" class="slide level2">
<h2>Layout of a coupled model</h2>
<p>Inputs, work and output are separated by the <code>name</code>:</p>
<pre><code>libom2_1deg
|-- accessom2.nml
|-- archive-&gt;/scratch/fp0/mxw900/access-om2/archive/libom2_1deg
|-- atmosphere
|   |-- atm.nml
|   |-- checksums.txt
|   `-- forcing.json
|-- config.yaml
|-- ice
|   |-- cice_in.nml
|   |-- input_ice.nml
|   |-- input_ice_gfdl.nml
|   `-- input_ice_monin.nml
|-- namcouple
|-- ocean
|   |-- checksums.txt
|   |-- data_table
|   |-- diag_table
|   |-- field_table
|   `-- input.nml
|-- sync_output_to_gdata.sh
`-- sync_restarts_to_gdata.sh</code></pre>
<p>Common files (<code>accessom2.nml</code>, <code>namcouple</code>) are in the top directory.</p>
<p>Work directories have a similar structure.</p>
</section></section>
<section>
<section id="troubleshooting" class="title-slide slide level1">
<h1>Troubleshooting</h1>

</section>
<section id="model-crashes" class="slide level2">
<h2>Model crashes</h2>
<p>If you see this error in your PBS log:</p>
<pre><code>payu: Model exited with error code 134; aborting.&quot;</code></pre>
<p>then it means the model crashed and payu has halted execution.</p>
<p>Check your error logs to figure out the problem.</p>
</section>
<section id="various-python-errors" class="slide level2">
<h2>Various Python errors</h2>
<p>Sometimes a missing file or misconfigured experiment will cause an error in Python:</p>
<pre><code>Traceback (most recent call last):
  File &quot;/home/157/mxw157/python/payu/bin/payu&quot;, line 8, in &lt;module&gt;
    cli.parse()
  File &quot;/home/157/mxw157/python/payu/payu/cli.py&quot;, line 61, in parse
    run_cmd(**args)
  File &quot;/home/157/mxw157/python/payu/payu/subcommands/setup_cmd.py&quot;, line 18, in runcmd
    expt.setup(force_archive=force_archive)
  File &quot;/home/157/mxw157/python/payu/payu/experiment.py&quot;, line 353, in setup
    model.setup()
  File &quot;/home/157/mxw157/python/payu/payu/models/mom.py&quot;, line 69, in setup
    super(Mom, self).setup()
  File &quot;/home/157/mxw157/python/payu/payu/models/model.py&quot;, line 151, in setup
    shutil.copy(f_path, self.work_path)
  File &quot;/apps/python/2.7.6/lib/python2.7/shutil.py&quot;, line 119, in copy
    copyfile(src, dst)
  File &quot;/apps/python/2.7.6/lib/python2.7/shutil.py&quot;, line 82, in copyfile
    with open(src, &#39;rb&#39;) as fsrc:
IOError: [Errno 2] No such file or directory: &#39;/home/157/mxw157/mom/course/bowl2_course/input.nml&#39;</code></pre>
<p>Run <code>payu setup</code> and test that the experiment is set up properly.</p>
</section>
<section id="cloning-issues" class="slide level2">
<h2>Cloning issues</h2>
<p>* Some <code>config.yaml</code> contain relative paths for inputs and executables, which can make sharing difficult.</p>
<p>* Either copy (or symlink) their files into your local lab, or modify your <code>config.yaml</code> to use absolute paths.</p>
<ul>
<li>Manifests can largely overcome this issue, but there are subtleties in how to implement this</li>
</ul>
</section></section>
<section>
<section id="summary" class="title-slide slide level1">
<h1>Summary</h1>

</section>
<section id="what-does-payu-do" class="slide level2">
<h2>What does Payu do?</h2>
<ul>
<li>Common interface for many models</li>
<li>Manages your inputs, executables, outputs, and restarts</li>
<li>Tracks changes to your experiment</li>
<li>Provides hooks to configure and control your job</li>
<li>Enables sharing of configurations and inputs</li>
</ul>
</section>
<section id="special-thanks" class="slide level2">
<h2>Special Thanks</h2>
<table style="width:81%;">
<colgroup>
<col style="width: 27%" />
<col style="width: 25%" />
<col style="width: 27%" />
</colgroup>
<tbody>
<tr class="odd">
<td><img data-src="img/paul.jpg" alt="paul" /></td>
<td><img data-src="img/nic.jpeg" alt="nic" /></td>
<td><img data-src="img/aidan.jpeg" alt="aidan" /></td>
</tr>
</tbody>
</table>
<p>Paul Spence suffered through the earliest version of Payu, and helped mold it into a usable application.</p>
<p>Nic Hannah and Aidan Heerdegen have been top-level contributors and maintainers for a very long time.</p>
<p>Thank you!!</p>
</section>
<section id="future-works" class="slide level2">
<h2>Future works</h2>
<p>Work only stops when people stop asking for features.</p>
<p>Questions?</p>
<p>What would you like to see?</p>
<p>Even better, contribute!</p>
</section></section>
    </div>
  </div>

  <script src="./reveal.js/lib/js/head.min.js"></script>
  <script src="./reveal.js/js/reveal.js"></script>

  <script>

      // Full list of configuration options available at:
      // https://github.com/hakimel/reveal.js#configuration
      Reveal.initialize({
        // Display the page number of the current slide
        slideNumber: true,
        // Push each slide change to the browser history
        history: true,

        // Optional reveal.js plugins
        dependencies: [
          { src: './reveal.js/lib/js/classList.js', condition: function() { return !document.body.classList; } },
          { src: './reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: './reveal.js/plugin/highlight/highlight.js',
            async: true,
            condition: function() {
              return !!document.querySelector( 'pre code' ); },
            callback: function() {
              hljs.initHighlightingOnLoad(); }
          },
          { src: './reveal.js/plugin/notes/notes.js', async: true }
        ]
      });
    </script>
    </body>
</html>