Skip to content

Commit

Permalink
Merge pull request #124 from sanger-tol/fixes
Browse files Browse the repository at this point in the history
Module updates but tests are not running
  • Loading branch information
priyanka-surana authored Aug 8, 2023
2 parents d592835 + af026dc commit f959061
Show file tree
Hide file tree
Showing 182 changed files with 1,979 additions and 1,645 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install nf-core
pip install nf-core==2.8.0
- name: Run nf-core lint
env:
Expand Down
2 changes: 2 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ lint:
- docs/images/nf-core-treeval_logo_light.png
- docs/images/nf-core-treeval_logo_dark.png
files_unchanged:
- .github/workflows/linting.yml
- .github/CONTRIBUTING.md
- LICENSE
- .github/ISSUE_TEMPLATE/bug_report.yml
- assets/sendmail_template.txt
Expand Down
60 changes: 52 additions & 8 deletions CHANGELOG.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,60 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0.0] - Ancient Atlantis - [2023-06-12]
## [1.0.0] - Ancient Atlantis - [2023-06-27]

Initial release of sanger-tol/treeval, created with the [nf-core](https://nf-co.re/) template.

The essential pathways of the gEVAL pipeline have now been converted to Nextflow DSL2 from vr-runner, snakemake and wr. Of the original pipeline there is only Bionano left to implement.

### `Added`

### `Fixed`

### `Dependencies`

### `Deprecated`
### Enhancements & Fixes

- Updated to nf-core/tools template v2.8.0.
- Subworkflow to generate channels from input yaml.
- Subworkflow to generate genome summary file using samtools
- Subworkflow to generate busco gene tracks and ancestral busco mapping.
- Subworkflow to generate HiC maps with cooler, juicebox and pretext.
- Subworkflow to generate gene alignments using miniprot and minimap2.
- Subworkflow to generate insilico digest tracks.
- Subworkflow to generate longread coverage tracks from pacbio data.
- Subworkflow to generate punchlists detailing regions of interest in the genome.
- Subworkflow to generate repeat density tracks.
- Subworkflow to generate tracks detailing self complementary regions.
- Subworkflow to generate syntenic alignments to high quality genomes.
- Subworkflow to generate tracks containing telomeric sites.
- Custom Groovy for reporting to provide file metrics and resource usage.

### Parameters

| Old Parameter | New Parameter |
| ------------- | ------------- |
| - | --input |

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.

| Module | Old Version | New Versions |
| ------------------------------ | ----------- | ---------------- |
| bedtools | - | 2.31.0 |
| busco | - | 5.4.3 |
| bwa-mem2 | - | 2.2.1 |
| cat | - | 2.3.4 |
| cooler | - | 0.9.2 |
| gnu-sort | - | 8.25 |
| minimap2 + samtools | - | 2.24 + 1.14 |
| miniprot | - | 0.11--he4a0461_2 |
| mummer | - | 3.23 |
| paftools (minimap2 + samtools) | - | 2.24 + 1.14 |
| pretextmap + samtools | - | 0.1.9 + 1.17 |
| samtools | - | 1.17 |
| seqtk | - | 1.4 |
| tabix | - | 1.11 |
| ucsc | - | 377 |
| windowmasker (blast) | - | 2.14.0 |

### Fixed

### Dependencies

### Deprecated
Empty file modified CITATIONS.md
100644 → 100755
Empty file.
Empty file modified CODE_OF_CONDUCT.md
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion LICENSE
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022-2023 Genome Research Ltd.
Copyright (c) 2022 - 2023 Genome Research Ltd.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
80 changes: 22 additions & 58 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -7,63 +7,25 @@

## Introduction

**sanger-tol/treeval** is a bioinformatics best-practice analysis pipeline for the generation of data supplemental to the curation of reference quality genomes. This pipeline has been written to generate flat files compatable with [JBrowse2](https://jbrowse.org/jb2/).
**sanger-tol/treeval** is a bioinformatics best-practice analysis pipeline for the generation of data supplemental to the curation of reference quality genomes. This pipeline has been written to generate flat files compatible with [JBrowse2](https://jbrowse.org/jb2/).

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

## Pipeline summary

The version 1 pipeline will be made up of the following steps, (r) = Steps run in Rapid:

- INPUT_READ (r)

> The reading of the input yaml and conversion into channels for the sub-workflows.
- GENERATE_GENOME (r)

> Generate .genome for the input genome using SAMTOOLS FAIDX.
- GENERATE_ALIGNMENT

> Peptides will run pep_alignment.nf with Miniprot.
> CDNA, RNA and CDS will run through nuc_alignment.nf with Minimap2.
- INSILICO DIGEST

> Generates a map of enzymatic digests using 3 Bionano enzymes.
- SELFCOMP

> Identifies regions of self-complementary sequencs using Mummer.
- SYNTENY

> Generates syntenic alignments between other high quality genomes via Minimap2.
- BUSCO_ANNOTATION

> Lepidopteran Element Analysis. Using BUSCO and custom python scripts to parse ancestral Lepidoptera gene. This will eventually have a number of clade specific sub-workflows.
> BUSCO genes extraction based on BUSCO full_table.tsv.
- LONGREAD_COVERAGE (r)

> Calculating the coverage of reads across the genome.
- FIND_GAPS (r)

> Identifying gaps in the input genome using seqtk cutn.
- FIND_TELOMERE (r)

> Identify sites of a given telomeric sequence.
- REPEAT_DENSITY (r)

> Generate a graph showing the relative amount of repeat in a given chunk.
- HIC_MAPPING (r)
> Generation of HiC maps for the curation of a genome, these include: pretext_hires, pretext_lowres and cooler maps.
The treeval pipeline has a sister pipeline currently named [curationpretext](https://github.com/sanger-tol/curationpretext) which acts to regenerate the pretext maps and accessory files during genomic curation in order to confirm interventions. This pipeline is sufficiently different to the treeval implementation that it is written as it's own pipeline.

1. Parse input yaml ( YAML_INPUT )
2. Generate my.genome file ( GENERATE_GENOME )
3. Generate insilico digests of the input assembly ( INSILICO_DIGEST )
4. Generate gene alignments with high quality data against the input assembly ( GENE_ALIGNMENT )
5. Generate a repeat density graph ( REPEAT_DENSITY )
6. Generate a gap track ( GAP_FINDER )
7. Generate a map of self complementary sequence ( SELFCOMP )
8. Generate syntenic alignments with a closely related high quality assembly ( SYNTENY )
9. Generate a coverage track using PacBio data ( LONGREAD_COVERAGE )
10. Generate HiC maps, pretext and higlass using HiC cram files ( HIC_MAPPING )
11. Generate a telomere track based on input motif ( TELO_FINDER )
12. Run Busco and convert results into bed format ( BUSCO_ANNOTATION )
13. Ancestral Busco linkage if available for clade ( BUSCO_ANNOTATION:ANCESTRAL_GENE )

## Usage

Expand All @@ -72,15 +34,17 @@ The version 1 pipeline will be made up of the following steps, (r) = Steps run i
> to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline)
> with `-profile test` before running the workflow on actual data.
Currently, it is advised to run the pipeline with docker or singularity as a small number of major modules do not currently have a conda env associated with them.

Now, you can run the pipeline using:

```bash
nextflow run main.nf -profile singularity --input treeval.yaml -entry {FULL|RAPID} --outdir {OUTDIR}
```

## Documentation
An example treeval.yaml can be found [here](assets/local_testing/nxOscDF5033.yaml).

The sanger-tol/treeval pipeline comes with documentation about the pipeline [usage](https://nf-co.re/treeval/usage), [parameters](https://nf-co.re/treeval/parameters) and [output](https://nf-co.re/treeval/output).
Further documentation about the pipeline can be found in the following files: [usage](https://nf-co.re/treeval/usage), [parameters](https://nf-co.re/treeval/parameters) and [output](https://nf-co.re/treeval/output).

> **Warning:**
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
Expand All @@ -94,11 +58,11 @@ sanger-tol/treeval has been written by Damon-Lee Pointon (@DLBPointon), Yumi Sim
We thank the following people for their extensive assistance in the development of this pipeline:

<ul>
<li>@muffato - For code reviews and code support</li>
<li>@gq1 - For building the infrastructure around TreeVal</li>
<li>@ksenia-krasheninnikova - For help with C code implementation and YAML parsing</li>
<li>@priyanka-surana - For help with the majority of code reviews and code support</li>
<li>@mcshane - For guidance on algorithms </li>
<li>@muffato - For code reviews and code support</li>
<li>@priyanka-surana - For help with the majority of code reviews and code support</li>
</ul>

## Contributions and Support
Expand Down
2 changes: 1 addition & 1 deletion assets/adaptivecard.json
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"type": "TextBlock",
"size": "Large",
"weight": "Bolder",
"color": "<% if (success) { %>Good<% } else { %>Attention<%} %>",
"color": "<% if ( success ) { %>Good<% } else { %>Attention<%} %>",
"text": "sanger-tol/treeval v${version} - ${runName}",
"wrap": true
},
Expand Down
10 changes: 5 additions & 5 deletions assets/digest/digest.as
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
table insilico_digest
"bionano digest cut sites"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of feature on chromosome"
uint chromEnd; "End position of feature on chromosome"
string name; "Name of enzyme"
string length; "length of fragment"
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of feature on chromosome"
uint chromEnd; "End position of feature on chromosome"
string name; "Name of enzyme"
string length; "length of fragment"
)
1 change: 0 additions & 1 deletion assets/full_s3_treeval_test.yaml
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
assembly:
sizeClass: "" # S if {genome => 4Gb} else L
level: scaffold
sample_id: nxOscDoli1
classT: nematode
Expand Down
2 changes: 1 addition & 1 deletion assets/gene_alignment/assm_cdna.as
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ string name; "Name of gene"
uint score; "Score"
char[1] strand; "+ or - for strand"
string geneSymbol; "Gene Symbol"
string ensemblId; "Ensembl Accession number"
string ensemblId; "Ensembl Accession number"
)
2 changes: 1 addition & 1 deletion assets/gene_alignment/assm_cds.as
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ string name; "Name of gene"
uint score; "Score"
char[1] strand; "+ or - for strand"
string geneSymbol; "Gene Symbol"
string ensemblId; "Ensembl Accession number"
string ensemblId; "Ensembl Accession number"
)
2 changes: 1 addition & 1 deletion assets/gene_alignment/assm_pep.as
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ string name; "Name of gene"
uint score; "Score"
char[1] strand; "+ or - for strand"
string geneSymbol; "Gene Symbol"
string ensemblId; "Ensembl Accession number"
string ensemblId; "Ensembl Accession number"
)
2 changes: 1 addition & 1 deletion assets/gene_alignment/assm_rna.as
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ string name; "Name of gene"
uint score; "Score"
char[1] strand; "+ or - for strand"
string geneSymbol; "Gene Symbol"
string ensemblId; "Ensembl Accession number"
string ensemblId; "Ensembl Accession number"
)
31 changes: 0 additions & 31 deletions assets/local_testing/nxOsc-2023-05-02.dp.TEST.md

This file was deleted.

3 changes: 1 addition & 2 deletions assets/local_testing/nxOscDF5033.yaml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ assembly:
sample_id: Oscheius_DF5033
latin_name: to_provide_taxonomic_rank
classT: nematode
asmVersion: Oscheius_DF5033_1
dbVersion: "1"
asmVersion: 1
gevalType: DTOL
reference_file: /lustre/scratch123/tol/resources/treeval/nextflow_test_data/Oscheius_DF5033/assembly/draft/DF5033.hifiasm.noTelos.20211120/DF5033.noTelos.hifiasm.purged.noCont.noMito.fasta
#/lustre/scratch123/tol/resources/treeval/nextflow_test_data/Oscheius_DF5033/assembly/draft/DF5033.hifiasm.noTelos.20211120/DF5033.noTelos.hifiasm.purged.noCont.noMito.fasta
Expand Down
3 changes: 1 addition & 2 deletions assets/local_testing/nxOscSUBSET.yaml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ assembly:
sample_id: OscheiusSUBSET
latin_name: to_provide_taxonomic_rank
classT: nematode
asmVersion: OscheiusSUBSET_1
dbVersion: "1"
asmVersion: 1
gevalType: DTOL
reference_file: /lustre/scratch123/tol/resources/treeval/nextflow_test_data/Oscheius_SUBSET/assembly/draft/SUBSET_genome/Oscheius_SUBSET.fasta
#/lustre/scratch123/tol/resources/treeval/nextflow_test_data/Oscheius_DF5033/assembly/draft/DF5033.hifiasm.noTelos.20211120/DF5033.noTelos.hifiasm.purged.noCont.noMito.fasta
Expand Down
4 changes: 2 additions & 2 deletions assets/methods_description_template.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ description: "Suggested text and references to use when describing pipeline usag
section_name: "sanger-tol/treeval Methods Description"
section_href: "https://github.com/sanger-tol/treeval"
plot_type: "html"
## TODO nf-core: Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline
## You inject any metadata in the Nextflow '${workflow}' object
## TODO nf-core: Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline
## You inject any metadata in the Nextflow '${workflow}' object
data: |
<h4>Methods</h4>
<p>Data was processed using sanger-tol/treeval v${workflow.manifest.version} ${doi_text} of the sanger-tol collection of workflows, created using nf-core (<a href="https://doi.org/10.1038/s41587-020-0439-x">Ewels <em>et al.</em>, 2020</a>).</p>
Expand Down
Empty file modified assets/multiqc_config.yml
100644 → 100755
Empty file.
10 changes: 5 additions & 5 deletions assets/nematode/csv_data/s3_Gae_Host.Gae-data.csv
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
org,type,data_file
Gae_host.Gae,cdna,https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host5000cdna.MOD.fa
Gae_host.Gae,cds,https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host12003cds.MOD.fa
Gae_host.Gae,pep,https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host12005pep.MOD.fa
Gae_host.Gae,rna,https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host18005rna.MOD.fa
org, type, data_file
Gae_host.Gae, cdna, https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host5000cdna.MOD.fa
Gae_host.Gae, cds, https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host12003cds.MOD.fa
Gae_host.Gae, pep, https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host12005pep.MOD.fa
Gae_host.Gae, rna, https://tolit.cog.sanger.ac.uk/test-data/Gae_host/genomic_data/gene_alignment/Gae_host18005rna.MOD.fa
Binary file removed assets/nf-core-treeval_logo_light.png
Binary file not shown.
1 change: 0 additions & 1 deletion assets/s3_treeval_test.yaml
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
assembly:
sizeClass: "" # S if {genome => 4Gb} else L
level: scaffold
sample_id: nxOscDoli1
classT: nematode
Expand Down
Loading

0 comments on commit f959061

Please sign in to comment.