Lay out major functionality and interface changes.

eirrgang · Dec 31, 2018 · 94fd8b2 · 94fd8b2
1 parent b1f7b3f
commit 94fd8b2
Show file tree

Hide file tree

Showing 6 changed files with 599 additions and 19 deletions.
diff --git a/docs/layers/workspec_schema_0_2.rst b/docs/layers/workspec_schema_0_2.rst
@@ -0,0 +1,110 @@
+=========================
+Work specification schema
+=========================
+
+Goals
+=====
+
+- Serializeable representation of a molecular simulation and analysis workflow.
+- Simple enough to be robust to API updates and uncoupled from implementation details.
+- Complete enough to unambiguously direct translation of work to API calls.
+- Facilitate easy integration between independent but compatible implementation code in Python or C++
+- Verifiable compatibility with a given API level.
+- Provide enough information to uniquely identify the "state" of deterministic inputs and outputs.
+
+The last point warrants further discussion.
+
+One point to make is that we need to be able to recover the state of an
+executing graph after an interruption, so we need to be able to identify whether or not work has been partially completed
+and how checkpoint data matches up between nodes, which may not all (at least initially) be on the same computing host.
+
+The other point that is not completely unrelated is how to minimize duplicated data or computation. Due to numerical
+optimizations, molecular simulation results for the exact same inputs and parameters may not produce output that is
+binary identical, but which should be treated as scientifically equivalent. We need to be able to identify equivalent
+rather than identical output. Input that draws from the results of a previous operation should be able to verify whether
+valid results for any identically specified operation exists, or at what state it is in progress.
+
+The degree of granularity and room for optimization we pursue affects the amount of data in the work specification, its
+human-readability / editability, and the amount of additional metadata that needs to be stored in association with a
+Session.
+
+If one element is added to the end of a work specification, results of the previous operations should not be invalidated.
+
+If an element at the beginning of a work specification is added or altered, "downstream" data should be easily invalidated.
+
+Serialization format
+====================
+
+The work specification record is valid JSON serialized data, restricted to the latin-1 character set, encoded in utf-8.
+
+Uniqueness
+==========
+
+Goal: results should be clearly mappable to the work specification that led to them, such that the same work could be
+repeated from scratch, interrupted, restarted, etcetera, in part or in whole, and verifiably produce the same results
+(which can not be artificially attributed to a different work specification) without requiring recomputing intermediate
+values that are available to the Context.
+
+The entire record, as well as individual elements, have a well-defined hash that can be used to compare work for
+functional equivalence.
+
+State is not contained in the work specification, but state is attributable to a work specification.
+
+If we can adequately normalize utf-8 Unicode string representation, we could checksum the full text, but this may be more
+work than defining a scheme for hashing specific data or letting each operation define its own comparator.
+
+Question: If an input value in a workflow is changed from a verifiably consistent result to an equivalent constant of a
+different "type", do we invalidate or preserve the downstream output validity? E.g. the work spec changes from
+"operationB.input = operationA.output" to "operationB.input = final_value(operationA)"
+
+The question is moot if we either only consider final values for terminating execution or if we know exactly how many
+iterations of sequenced output we will have, but that is not generally true.
+
+Maybe we can leave the answer to this question unspecified for now and prepare for implementation in either case by
+recording more disambiguating information in the work specification (such as checksum of locally available files) and
+recording initial, ongoing, and final state very granularly in the session metadata. It could be that this would be
+an optimization that is optionally implemented by the Context.
+
+It may be that we allow the user to decide what makes data unique. This would need to be very clearly documented, but
+it could be that provided parameters always become part of the unique ID and are always not-equal to unprovided/default
+values. Example: a ``load_tpr`` operation with a ``checksum`` parameter refers to a specific file and immediately
+produces a ``final`` output, but a ``load_tpr`` operation with a missing ``checksum`` parameter produces non-final
+output from whatever file is resolved for the operation at run time.
+
+It may also be that some data occurs as a "stream" that does not make an operation unique, such as log file output or
+trajectory output that the user wants to accumulate regardless of the data flow scheme; or as a "result" that indicates
+a clear state transition and marks specific, uniquely produced output, such as a regular sequence of 1000 trajectory
+frames over 1ns, or a converged observable. "result"s must be mapped to the representation of the
+workflow that produced them. To change a workflow without invalidating results might be possible with changes that do
+not affect the part of the workflow that fed those results, such as a change that only occurs after a certain point in
+trajectory time. Other than the intentional ambiguity that could be introduced with parameter semantics in the previous
+paragraph,
+
+Heuristics
+==========
+
+Dependency order affects order of instantiation and the direction of binding operations at session launch.
+
+Rules of thumb
+--------------
+
+An element can not depend on another element that is not in the work specification.
+*Caveat: we probably need a special operation just to expose the results of a different work flow.*
+
+Dependency direction affects sequencing of Director calls when launching a session, but also may be used at some point
+to manage checkpoints or data flow state checks at a higher level than the execution graph.
+
+Middleware API
+==============
+
+Specification
+-------------
+
+..  automodule:: gmx._workspec_0_2
+    :members:
+
+Helpers
+-------
+
+..  automodule:: gmx._workspec_0_2.util
+    :members:
diff --git a/docs/release-notes.rst b/docs/release-notes.rst
@@ -0,0 +1,198 @@
+====================
+Release Notes: 0.0.8
+====================
+
+Feature additions
+=================
+
+Convenient access to trajectory data
+------------------------------------
+
+In addition to operationally manipulating trajectory data handles between work
+elements, users are able to extract trajectory data output in the local environment.
+
+Trajectory data from static sources (e.g. files) can be read with a compatible
+numpy-friendly interface.
+
+Interfaces for accessing trajectory data will converge on compatibility with
+concurrent updates to the GROMACS Trajectory Analysis Framework.
+Supersede https://gerrit.gromacs.org/c/6567/
+
+See also `Session and client need access to trajectory step #56 <https://github.com/kassonlab/gmxapi/issues/56>`_
+
+Override MDP options
+--------------------
+
+Parameters may be specified as part of the work graph. Specified parameters
+override defaults or previously set values and become part of the unique
+identifying information for data in the execution graph.
+
+Generically, key-value entries compatible with the current MDP file format may
+be provided as part of a single parameters dictionary. Future work will provide
+better integration with the MDP options expression in GROMACS and allow for
+better detection of equivalent work graphs.
+
+Parameters that can be specified with their own key-word arguments can provide
+constant data or can reference named outputs of gmxapi operations already in
+the work graph.
+
+Multiple simulations per work graph
+-----------------------------------
+
+gmxapi 0.0.7 requires a new work graph to be launched for each simulation
+operation. Updates to the WorkSpec, Context, and Session implementations allow
+multiple simulation nodes, not just parallel arrays of simulations.
+
+This functionality simultaneously
+
+* simplifies user management of data flow
+* separates the user from filesystem management
+
+See also `multiple MD elements in a single workflow #39 <https://github.com/kassonlab/gmxapi/issues/39>`_
+
+More flexible asynchronous work
+-------------------------------
+
+Asynchronous elements of work may be run serially, if appropriate for the
+execution environment, even if the work is part of a trajectory ensemble.
+
+Session-level data flow is distinguished from lower-level data flow to allow
+interaction between operation nodes between updates to the execution graph state.
+This is a formalization of the distinction between (a) the plugin force-provider
+interface or simulation stop signal facility and (b) data edges on the execution
+graph.
+
+Named inputs and outputs in work graph
+--------------------------------------
+
+Instead of automatic subscription between work graph nodes and dependent nodes,
+operations have named inputs and outputs that can be referenced in the params
+for other operations.
+
+File utilities
+--------------
+
+Outside of the work graph that is dispatched to run in a session, simple tools
+provide equivalent functionality to ``gmx`` command line tools to
+
+* build or modify run-input files (like ``grompp``, ``convert-tpr``, and such)
+* read file data (like ``gmx dump``)
+
+Better data flow
+----------------
+
+See also `Tag artifacts #76 <https://github.com/kassonlab/gmxapi/issues/76>`_,
+`place external data object #96 <https://github.com/kassonlab/gmxapi/issues/96>`_,
+`reusable output node #40 <https://github.com/kassonlab/gmxapi/issues/40>`_
+
+Procedural interface
+====================
+
+``gmx.make_input()`` generates node(s) providing source(s) of
+
+* structure
+* topology
+* simulation parameters
+* generic data (catch-all options or data streams not specified in the API)
+
+Python object-oriented API
+==========================
+
+WorkElement objects are now views into WorkSpec work graph objects.
+
+WorkSpec objects contain the work graph and are owned by exactly one Context
+object.
+
+Though implementation classes exist in gmx.workflow, WorkElement and WorkSpec
+objects only need to implement a specified interface and do not need to be of
+any specific type. These interfaces are specified as part of `workspec 2`.
+
+See also `Add proxy access to data graph through WorkElement handles #94 <https://github.com/kassonlab/gmxapi/issues/94>`_
+
+workspec 2
+==========
+
+See :doc:`layers/workspec_schema_0_2` and
+`gmxapi_workspec_0_2 milestone <https://github.com/kassonlab/gmxapi/milestone/3>`_
+
+See also `resolve protocol for API operation map #42 <https://github.com/kassonlab/gmxapi/issues/42>`_
+
+C++ API
+=======
+
+Canonical gmxapi C++ API is now in GROMACS master.
+Pre-release and experimental features are still available through the kassonlab
+GitHub fork.
+The non-canonical nature of the fork is expressed by the presence of the CMake
+variable ``GMXAPI_EXPERIMENTAL``.
+
+Hierarchical object ownership
+-----------------------------
+
+gmxapi code must occur within the scope of a gmxapi::Context object lifetime.
+Allocated resources are owned by a Context or by objects ultimately owned by
+the Context. Work is launched in a Session, owned by the Context, which owns the
+objects performing actual computation in a configured execution environment.
+
+This means that gmxapi 0.0.8 necessarily enforces the proxy-object concept
+intended for gmxapi 1.0. Client code interacts with a work graph through a
+Context, and local objects are non-owning handles to resources owned and
+managed by the Context.
+
+This is also an inversion of the previous ownership model, in which ownership
+of resources was shared by the objects depending on those resources and object
+lifetimes were managed exclusively through reference-counting handles / smart
+pointers. Consequently, a handle to the Context, Session, or other resource
+owner must always be passed down into functions or shorter-lived objects that
+use the resources.
+
+See also `Context chain of responsibility <https://github.com/kassonlab/gmxapi/milestone/5>`_
+
+Plugin development improvements
+===============================
+
+Automatic Python interface generation
+-------------------------------------
+
+The developer no longer has to explicitly write a "builder." The operation
+launching protocol is managed with the help of included headers.
+
+Users no longer interact directly with gmx.workflow.WorkElement objects to
+interact with a plugin. Helper functions add operations to the work graph.
+Helper functions are automatically generated for plugins built on the provided
+sample code.
+
+See also `Remove boilerplate for plugin instantiation #78 <https://github.com/kassonlab/gmxapi/issues/78>`_
+
+Templated registration of inputs and outputs
+--------------------------------------------
+
+Reduced boiler plate, improved error checking, and compatibility with automatic
+workflow checkpointing. Input, output, and state data are managed by the
+framework. Instead of writing a class to contain a plugin's functions, the
+functions are written as free functions and use a SessionResources handle to
+interact with gmxapi and data on the execution graph.
+
+See also `clean up input parameter specification for plugins #47 <https://github.com/kassonlab/gmxapi/issues/47>`_
+
+More templating to minimize implementation
+------------------------------------------
+
+Plugin developers no longer implement an entire class, but only the functions
+they need.
+
+More call signatures are available for MD plugin operations to allow more
+intuitive implementation code.
+
+Input, output, and state data is no longer specified as class data members, but
+as resources to be managed through SessionResources.
+
+See also `restraint potential calculator inputs are confusing #140 <https://github.com/kassonlab/gmxapi/issues/140>`_
+
+Integrated sample code
+----------------------
+
+Sample MD plugin code is still provided as a standalone repository, but it is
+also included as a ``git`` *submodule* for convenience and to allow development
+documentation to be integrated with the primary ``gmx`` Python package documentation.
+