Skip to content

Commit

Permalink
Lay out major functionality and interface changes.
Browse files Browse the repository at this point in the history
  • Loading branch information
eirrgang committed Dec 31, 2018
1 parent b1f7b3f commit 94fd8b2
Show file tree
Hide file tree
Showing 6 changed files with 599 additions and 19 deletions.
110 changes: 110 additions & 0 deletions docs/layers/workspec_schema_0_2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
=========================
Work specification schema
=========================

Goals
=====

- Serializeable representation of a molecular simulation and analysis workflow.
- Simple enough to be robust to API updates and uncoupled from implementation details.
- Complete enough to unambiguously direct translation of work to API calls.
- Facilitate easy integration between independent but compatible implementation code in Python or C++
- Verifiable compatibility with a given API level.
- Provide enough information to uniquely identify the "state" of deterministic inputs and outputs.

The last point warrants further discussion.

One point to make is that we need to be able to recover the state of an
executing graph after an interruption, so we need to be able to identify whether or not work has been partially completed
and how checkpoint data matches up between nodes, which may not all (at least initially) be on the same computing host.

The other point that is not completely unrelated is how to minimize duplicated data or computation. Due to numerical
optimizations, molecular simulation results for the exact same inputs and parameters may not produce output that is
binary identical, but which should be treated as scientifically equivalent. We need to be able to identify equivalent
rather than identical output. Input that draws from the results of a previous operation should be able to verify whether
valid results for any identically specified operation exists, or at what state it is in progress.

The degree of granularity and room for optimization we pursue affects the amount of data in the work specification, its
human-readability / editability, and the amount of additional metadata that needs to be stored in association with a
Session.

If one element is added to the end of a work specification, results of the previous operations should not be invalidated.

If an element at the beginning of a work specification is added or altered, "downstream" data should be easily invalidated.

Serialization format
====================

The work specification record is valid JSON serialized data, restricted to the latin-1 character set, encoded in utf-8.

Uniqueness
==========

Goal: results should be clearly mappable to the work specification that led to them, such that the same work could be
repeated from scratch, interrupted, restarted, etcetera, in part or in whole, and verifiably produce the same results
(which can not be artificially attributed to a different work specification) without requiring recomputing intermediate
values that are available to the Context.

The entire record, as well as individual elements, have a well-defined hash that can be used to compare work for
functional equivalence.

State is not contained in the work specification, but state is attributable to a work specification.

If we can adequately normalize utf-8 Unicode string representation, we could checksum the full text, but this may be more
work than defining a scheme for hashing specific data or letting each operation define its own comparator.

Question: If an input value in a workflow is changed from a verifiably consistent result to an equivalent constant of a
different "type", do we invalidate or preserve the downstream output validity? E.g. the work spec changes from
"operationB.input = operationA.output" to "operationB.input = final_value(operationA)"

The question is moot if we either only consider final values for terminating execution or if we know exactly how many
iterations of sequenced output we will have, but that is not generally true.

Maybe we can leave the answer to this question unspecified for now and prepare for implementation in either case by
recording more disambiguating information in the work specification (such as checksum of locally available files) and
recording initial, ongoing, and final state very granularly in the session metadata. It could be that this would be
an optimization that is optionally implemented by the Context.

It may be that we allow the user to decide what makes data unique. This would need to be very clearly documented, but
it could be that provided parameters always become part of the unique ID and are always not-equal to unprovided/default
values. Example: a ``load_tpr`` operation with a ``checksum`` parameter refers to a specific file and immediately
produces a ``final`` output, but a ``load_tpr`` operation with a missing ``checksum`` parameter produces non-final
output from whatever file is resolved for the operation at run time.

It may also be that some data occurs as a "stream" that does not make an operation unique, such as log file output or
trajectory output that the user wants to accumulate regardless of the data flow scheme; or as a "result" that indicates
a clear state transition and marks specific, uniquely produced output, such as a regular sequence of 1000 trajectory
frames over 1ns, or a converged observable. "result"s must be mapped to the representation of the
workflow that produced them. To change a workflow without invalidating results might be possible with changes that do
not affect the part of the workflow that fed those results, such as a change that only occurs after a certain point in
trajectory time. Other than the intentional ambiguity that could be introduced with parameter semantics in the previous
paragraph,

Heuristics
==========

Dependency order affects order of instantiation and the direction of binding operations at session launch.

Rules of thumb
--------------

An element can not depend on another element that is not in the work specification.
*Caveat: we probably need a special operation just to expose the results of a different work flow.*

Dependency direction affects sequencing of Director calls when launching a session, but also may be used at some point
to manage checkpoints or data flow state checks at a higher level than the execution graph.

Middleware API
==============

Specification
-------------

.. automodule:: gmx._workspec_0_2
:members:

Helpers
-------

.. automodule:: gmx._workspec_0_2.util
:members:
198 changes: 198 additions & 0 deletions docs/release-notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
====================
Release Notes: 0.0.8
====================

Feature additions
=================

Convenient access to trajectory data
------------------------------------

In addition to operationally manipulating trajectory data handles between work
elements, users are able to extract trajectory data output in the local environment.

Trajectory data from static sources (e.g. files) can be read with a compatible
numpy-friendly interface.

Interfaces for accessing trajectory data will converge on compatibility with
concurrent updates to the GROMACS Trajectory Analysis Framework.
Supersede https://gerrit.gromacs.org/c/6567/

See also `Session and client need access to trajectory step #56 <https://github.com/kassonlab/gmxapi/issues/56>`_

Override MDP options
--------------------

Parameters may be specified as part of the work graph. Specified parameters
override defaults or previously set values and become part of the unique
identifying information for data in the execution graph.

Generically, key-value entries compatible with the current MDP file format may
be provided as part of a single parameters dictionary. Future work will provide
better integration with the MDP options expression in GROMACS and allow for
better detection of equivalent work graphs.

Parameters that can be specified with their own key-word arguments can provide
constant data or can reference named outputs of gmxapi operations already in
the work graph.

Multiple simulations per work graph
-----------------------------------

gmxapi 0.0.7 requires a new work graph to be launched for each simulation
operation. Updates to the WorkSpec, Context, and Session implementations allow
multiple simulation nodes, not just parallel arrays of simulations.

This functionality simultaneously

* simplifies user management of data flow
* separates the user from filesystem management

See also `multiple MD elements in a single workflow #39 <https://github.com/kassonlab/gmxapi/issues/39>`_

More flexible asynchronous work
-------------------------------

Asynchronous elements of work may be run serially, if appropriate for the
execution environment, even if the work is part of a trajectory ensemble.

Session-level data flow is distinguished from lower-level data flow to allow
interaction between operation nodes between updates to the execution graph state.
This is a formalization of the distinction between (a) the plugin force-provider
interface or simulation stop signal facility and (b) data edges on the execution
graph.

Named inputs and outputs in work graph
--------------------------------------

Instead of automatic subscription between work graph nodes and dependent nodes,
operations have named inputs and outputs that can be referenced in the params
for other operations.

File utilities
--------------

Outside of the work graph that is dispatched to run in a session, simple tools
provide equivalent functionality to ``gmx`` command line tools to

* build or modify run-input files (like ``grompp``, ``convert-tpr``, and such)
* read file data (like ``gmx dump``)

Better data flow
----------------

See also `Tag artifacts #76 <https://github.com/kassonlab/gmxapi/issues/76>`_,
`place external data object #96 <https://github.com/kassonlab/gmxapi/issues/96>`_,
`reusable output node #40 <https://github.com/kassonlab/gmxapi/issues/40>`_

Procedural interface
====================

``gmx.make_input()`` generates node(s) providing source(s) of

* structure
* topology
* simulation parameters
* generic data (catch-all options or data streams not specified in the API)

Python object-oriented API
==========================

WorkElement objects are now views into WorkSpec work graph objects.

WorkSpec objects contain the work graph and are owned by exactly one Context
object.

Though implementation classes exist in gmx.workflow, WorkElement and WorkSpec
objects only need to implement a specified interface and do not need to be of
any specific type. These interfaces are specified as part of `workspec 2`.

See also `Add proxy access to data graph through WorkElement handles #94 <https://github.com/kassonlab/gmxapi/issues/94>`_

workspec 2
==========

See :doc:`layers/workspec_schema_0_2` and
`gmxapi_workspec_0_2 milestone <https://github.com/kassonlab/gmxapi/milestone/3>`_

See also `resolve protocol for API operation map #42 <https://github.com/kassonlab/gmxapi/issues/42>`_

C++ API
=======

Canonical gmxapi C++ API is now in GROMACS master.
Pre-release and experimental features are still available through the kassonlab
GitHub fork.
The non-canonical nature of the fork is expressed by the presence of the CMake
variable ``GMXAPI_EXPERIMENTAL``.

Hierarchical object ownership
-----------------------------

gmxapi code must occur within the scope of a gmxapi::Context object lifetime.
Allocated resources are owned by a Context or by objects ultimately owned by
the Context. Work is launched in a Session, owned by the Context, which owns the
objects performing actual computation in a configured execution environment.

This means that gmxapi 0.0.8 necessarily enforces the proxy-object concept
intended for gmxapi 1.0. Client code interacts with a work graph through a
Context, and local objects are non-owning handles to resources owned and
managed by the Context.

This is also an inversion of the previous ownership model, in which ownership
of resources was shared by the objects depending on those resources and object
lifetimes were managed exclusively through reference-counting handles / smart
pointers. Consequently, a handle to the Context, Session, or other resource
owner must always be passed down into functions or shorter-lived objects that
use the resources.

See also `Context chain of responsibility <https://github.com/kassonlab/gmxapi/milestone/5>`_

Plugin development improvements
===============================

Automatic Python interface generation
-------------------------------------

The developer no longer has to explicitly write a "builder." The operation
launching protocol is managed with the help of included headers.

Users no longer interact directly with gmx.workflow.WorkElement objects to
interact with a plugin. Helper functions add operations to the work graph.
Helper functions are automatically generated for plugins built on the provided
sample code.

See also `Remove boilerplate for plugin instantiation #78 <https://github.com/kassonlab/gmxapi/issues/78>`_

Templated registration of inputs and outputs
--------------------------------------------

Reduced boiler plate, improved error checking, and compatibility with automatic
workflow checkpointing. Input, output, and state data are managed by the
framework. Instead of writing a class to contain a plugin's functions, the
functions are written as free functions and use a SessionResources handle to
interact with gmxapi and data on the execution graph.

See also `clean up input parameter specification for plugins #47 <https://github.com/kassonlab/gmxapi/issues/47>`_

More templating to minimize implementation
------------------------------------------

Plugin developers no longer implement an entire class, but only the functions
they need.

More call signatures are available for MD plugin operations to allow more
intuitive implementation code.

Input, output, and state data is no longer specified as class data members, but
as resources to be managed through SessionResources.

See also `restraint potential calculator inputs are confusing #140 <https://github.com/kassonlab/gmxapi/issues/140>`_

Integrated sample code
----------------------

Sample MD plugin code is still provided as a standalone repository, but it is
also included as a ``git`` *submodule* for convenience and to allow development
documentation to be integrated with the primary ``gmx`` Python package documentation.

Loading

0 comments on commit 94fd8b2

Please sign in to comment.