Merge pull request #8 from phac-nml/docs

Update Documentation
phac-nml · Jan 17, 2025 · 2ace9e3 · 2ace9e3
2 parents 0f943a7 + 9d21149
commit 2ace9e3
Show file tree

Hide file tree

Showing 3 changed files with 85 additions and 110 deletions.
diff --git a/README.md b/README.md
@@ -2,30 +2,31 @@
 
 # FastMatch IRIDA Workflow
 
-This workflow takes provided JSON-formatted MLST profiles and converts them into a phylogenetic tree with associated flat cluster codes for use in [Irida Next](https://github.com/phac-nml/irida-next). The workflow also generates an interactive tree for visualization.
+This workflow takes query and reference JSON-formatted MLST profiles and reports query-reference pairs that are sufficiently within a specified distance of each other.
 
-A brief overview of the usage of this pipeline is given below. Detailed documentation can be found in the [docs/](docs/) directory.
+A brief overview of the usage of this pipeline is given below. Further documentation can be found in the [docs](docs/) directory.
 
 # Input
 
 The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:
 
-| sample  | mlst_alleles      | metadata_1 | metadata_2 | metadata_3 | metadata_4 | metadata_5 | metadata_6 | metadata_7 | metadata_8 |
-| ------- | ----------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
-| SampleA | sampleA.mlst.json | meta1      | meta2      | meta3      | meta4      | meta5      | meta6      | meta7      | meta8      |
+| sample  | fastmatch_category | mlst_alleles      | metadata_1 | metadata_2 | metadata_3 | metadata_4 | metadata_5 | metadata_6 | metadata_7 | metadata_8 |
+| ------- | ------------------ | ----------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
+| SampleA | query              | sampleA.mlst.json | meta1      | meta2      | meta3      | meta4      | meta5      | meta6      | meta7      | meta8      |
+| SampleB | reference          | sampleB.mlst.json | meta1      | meta2      | meta3      | meta4      | meta5      | meta6      | meta7      | meta8      |
 
-The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/). Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.
+Note that each sample must be defined as a `query` or `reference`. Samples designated with `query` will have their distance calculated to every sample in the sample sheet (`query` and `reference` samples), whereas `reference`-`reference` sample pairings do not have their distances calculated or reported.
 
-## IRIDA-Next Optional Input Configuration
+The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/). Details on the columns can be found in the [Full Samplesheet](docs/usage.md#full-standard-samplesheet) documentation.
 
-`fastmatchirida` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`
+## Irida Next Optional Sample Name Configuration
+
+`fastmatchirida` accepts the [IRIDA Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`
 
 `sample_name`: An **optional** column, that overrides `sample` for outputs (filenames and sample names) and reference assembly identification.
 
 `sample_name` allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.
 
-An [example samplesheet](../tests/data/samplesheets/samplesheet-samplename.csv) has been provided with the pipeline.
-
 # Parameters
 
 The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.
@@ -34,32 +35,31 @@ The main parameters are `--input` as defined above and `--output` for specifying
 
 In order to customize metadata headers, the parameters `--metadata_1_header` through `--metadata_8_header` may be specified. These parameters are used to re-name the headers in the final metadata table from the defaults (e.g., rename `metadata_1` to `country`).
 
-## Distance Method and Thresholds
+## Distance Threshold
 
-The Genomic Address Service Clustering workflow can use two distance methods: Hamming or scaled.
+A distance threshold parameter may be used to constrain the maximum distances between reported sample pairs in the final reports. This can be accomplished by specifying `--threshold DISTANCE`, where `DISTANCE` is a non-negative integer when using Hamming distances or a float between [0.0, 100.0] when using scaled distances. See below for more information on these distance methods.
 
-### Hamming Distances
+## Distance Methods
 
-Hamming distances are integers representing the number of differing loci between two sequences and will range between [0, n], where `n` is the total number of loci. When using Hamming distances, you must specify `--pd_distm hamming` and provide Hamming distance thresholds as integers between [0, n]: `--gm_thresholds "10,5,0"` (10, 5, and 0 loci).
+The distance measurement used can be one of two methods: Hamming or scaled.
 
-### Scaled Distances
+### Hamming Distances
 
-Scaled distances are floats representing the percentage of differing loci between two sequences and will range between [0.0, 100.0]. When using scaled distances, you must specify `--pd_distm scaled` and provide percentages between [0.0, 100.0] as thresholds: `--gm_thresholds "50,20,0"` (50%, 20%, and 0% of loci).
+Hamming distances are integers representing the number of differing loci between two sequences and will range between [0, n], where `n` is the total number of loci. When using Hamming distances, you must specify `--pd_distm hamming`.
 
-### Thresholds
+### Scaled Distances
 
-The `--gm_thresholds` parameter is used to set thresholds for each cluster level, which in turn are used to assign cluster codes at each level. When specifying `--pd_distm hamming` and `--gm_thresholds "10,5,0"`, all sequences that have no more than 10 loci differences will be assigned the same cluster code for the first level, no more than 5 for the second level, and only sequences that have no loci differences will be assigned the same cluster code for the third level.
+Scaled distances are floats representing the percentage of differing loci between two sequences and will range between [0.0, 100.0]. When using scaled distances, you must specify `--pd_distm scaled`.
 
 ## profile_dists
 
 The following can be used to adjust parameters for the [profile_dists][] tool.
 
-- `--pd_outfmt`: The output format for distances. For this pipeline the only valid value is _matrix_ (required by [gas mcluster][]).
-- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0.0 and 100.0. Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
-- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0 to 1.
-- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0 to 1.
+- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0.0 and 100.0. Please see the [Distance Method](#distance-method) section for more information.
+- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0.0 to 1.0.
+- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0.0 to 1.0.
 - `--pd_file_type`: Output format file type. One of _text_ or _parquet_.
-- `--pd_mapping_file`: A file used to map allele codes to integers for internal distance calculations. This is the same file as produced from the _profile dists_ step (the [allele_map.json](docs/output.md#profile-dists) file). Normally, this is unneeded unless you wish to override the automated process of mapping alleles to integers.
+- `--pd_mapping_file`: A file used to map allele codes to integers for internal distance calculations. Normally, this is unneeded unless you wish to override the automated process of mapping alleles to integers.
 - `--pd_skip`: Skip QA/QC steps. Can be used as a flag, `--pd_skip`, or passing a boolean, `--pd_skip true` or `--pd_skip false`.
 - `--pd_columns`: Defines the loci to keep within the analysis (default when unset is to keep all loci). Formatted as a single column file with one locus name per line. For example:
   - **Single column format**
@@ -70,17 +70,9 @@ The following can be used to adjust parameters for the [profile_dists][] tool.
     ```
 - `--pd_count_missing`: Count missing alleles as different. Can be used as a flag, `--pd_count_missing`, or passing a boolean, `--pd_count_missing true` or `--pd_count_missing false`. If true, will consider missing allele calls for the same locus between samples as a difference, increasing the distance counts.
 
-## GAS mcluster
-
-The following can be used to adjust parameters for the [gas mcluster][] tool.
-
-- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_). Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
-- `--gm_method`: The linkage method to use for clustering. Value should be one of _single_, _average_, or _complete_.
-- `--gm_delimiter`: Delimiter desired for nomenclature code. Must be alphanumeric or one of `._-`.
-
 ## Other
 
-Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schmea.json).
+Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schema.json).
 
 # Running
 
@@ -103,37 +95,25 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
     "files": {
         "global": [
             {
-                "path": "ArborView/clustered_data_arborview.html"
-            },
-            {
-                "path": "clusters/run.json"
-            },
-            {
-                "path": "clusters/tree.nwk"
-            },
-            {
-                "path": "clusters/clusters.text"
-            },
-            {
-                "path": "clusters/thresholds.json"
+                "path": "process/results.xlsx"
             },
             {
-                "path": "distances/run.json"
+                "path": "process/results.tsv"
             },
             {
-                "path": "distances/results.text"
+                "path": "distances/profile_dists.run.json"
             },
             {
-                "path": "distances/ref_profile.text"
+                "path": "distances/profile_dists.results.text"
             },
             {
-                "path": "distances/query_profile.text"
+                "path": "distances/profile_dists.ref_profile.text"
             },
             {
-                "path": "distances/allele_map.json"
+                "path": "distances/profile_dists.query_profile.text"
             },
             {
-                "path": "merged/profile.tsv"
+                "path": "distances/profile_dists.allele_map.json"
             }
         ],
         "samples": {
@@ -148,11 +128,11 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
 }
 ```
 
-Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "ArborView/clustered_data_arborview.html"` refers to a file located within `outdir/ArborView/clustered_data_arborview.html`.
+Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "process/results.xlsx"` refers to a file located within `outdir/process/results.xlsx`.
 
-Details on the individual output files can be found in the [Output documentation](docs/output.md).
+Details on the individual output files can be found in the [Output Documentation](docs/output.md).
 
-## Test profile
+## Test Profile
 
 To run with the test profile, please do:
 
@@ -170,10 +150,7 @@ License at:
 
 https://opensource.org/license/mit/
 
-Unless required by applicable law or agreed to in writing, software distributed
-under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
-CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
+CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
 
 [profile_dists]: https://github.com/phac-nml/profile_dists
-[gas mcluster]: https://github.com/phac-nml/genomic_address_service
diff --git a/docs/output.md b/docs/output.md
@@ -6,38 +6,48 @@ This document describes the output produced by the pipeline.
 
 The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
 
-- append: The passed metadata to the pipeline appended to cluster addresses defined by the clustering component.
-- ArborView: The ArborView visualization of a dendrogram alongside metadata.
-- clusters: The identified clusters from the [genomic_address_service](https://github.com/phac-nml/genomic_address_service).
-- distances: Distances between genomes from [profile_dists](https://github.com/phac-nml/profile_dists).
-- merged: The merged MLST JSON files into a single MLST profiles file.
-- pipeline_info: Information about the pipeline's execution
+- **append**: The passed metadata to the pipeline appended to sample-sample distance pairings.
+- **distances**: Distances between genomes from [profile_dists](https://github.com/phac-nml/profile_dists).
+- **input**: MLST JSON files processed to ensure that the sample ID provided in the sample sheet matches the IDs provided in the MLST JSON file.
+- **merged**: The merged MLST JSON files into a single MLST profiles file.
+- **pipeline_info**: Information about the pipeline's execution.
+- **process**: Processed sample-sample distance pairings.
 
 The IRIDA Next-compliant JSON output file will be named `iridanext.output.json.gz` and will be written to the top-level of the results directory. This file is compressed using GZIP and conforms to the [IRIDA Next JSON output specifications](https://github.com/phac-nml/pipeline-standards#42-irida-next-json).
 
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [Locidex merge](#locidex-merge) - Merges MLST profile JSON files into a single profiles file.
-- [Profile dists](#profile-dists) - Computes pairwise distances between genomes using MLST allele differences.
-- [GAS mcluster](#gas-mcluster) - Generates a hierarchical cluster tree alongside cluster addresses.
-- [Append metadata](#append-metadata) - Appends the passed input metadata to the identified cluster addresses.
-- [ArborView](#arborview) - Generates a visualization of the cluster tree alongside metadata.
-- [IRIDA Next Output](#irida-next-output) - Generates a JSON output file that is compliant with IRIDA Next
-- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
+- [Input Assure](#input-assure) - Assures that the sample IDs provided in the sample sheet match the IDs provided in the MLST JSON files associated with each sample.
+- [Locidex Merge Query](#locidex-merge) - Merges query MLST profile JSON files into a single profiles file.
+- [Locidex Merge References](#locidex-merge) - Merges reference MLST profile JSON files into a single profiles file.
+- [Profile Dists](#profile-dists) - Computes pairwise distances between genomes using MLST allele differences.
+- [Append Metadata](#append-metadata) - Appends the passed input metadata to the pairwise distances.
+- [Process Output](#process-output) - Processes sample-sample distance pairings by distance threshold.
 
-### Locidex merge
+### Input Assure
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `input/`
+  - ID-corrected MLST JSON files: `sample1.mlst.json.gz`
+
+</details>
+
+### Locidex Merge
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `merged/`
-  - Merged MLST profiles: `profile.tsv`
+  - Merged MLST query profiles: `locidex.merge.profile_query.tsv`
+  - Merged MLST query and reference profiles: `locidex.merge.profile_reference.tsv`
 
 </details>
 
-### Profile dists
+### Profile Dists
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -66,36 +76,24 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 </details>
 
-### GAS mcluster
-
-<details markdown="1">
-<summary>Output files</summary>
-
-- `clusters/`
-  - The computed cluster addresses: `clusters.text`
-  - Information on the GAS mcluster run: `run.json`
-  - Thesholds used to compute cluster addresses: `thresholds.json`
-  - Hierarchical clusters as a newick file: `tree.nwk`
-
-</details>
-
-### Append metadata
+### Append Metadata
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `append/`
-  - The passed input metadata columns appended to the cluster addresses file: `clusters_and_metadata.tsv`
+  - The passed input metadata columns appended to the pairwise distances: `distances_and_metadata.tsv`
 
 </details>
 
-### ArborView
+### Process Output
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `ArborView/`
-  - The ArborView visualization of clusters and metadata: `clustered_data_arborview.html`
+- `process/`
+  - Pairwise distance results meeting specifications in TSV-format: `results.tsv`
+  - Pairwise distance results meeting specifications in XLSX-format: `results.xlsx`
 
 </details>
 
@@ -109,7 +107,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 </details>
 
-### Pipeline information
+### Pipeline Information
 
 <details markdown="1">
 <summary>Output files</summary>