Skip to content

Latest commit

 

History

History
47 lines (37 loc) · 1.8 KB

pipelines.md

File metadata and controls

47 lines (37 loc) · 1.8 KB

Ensembl Genomio Pipelines:

Genomio prepare pipeline

Module [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_prepare_conf]

Genome prepare pipeline for BRC/Metazoa

Description

Retrieve data for a genome from INSDC and prepare the following files in a separate folder for each genome:

  • FASTA for DNA sequences
  • FASTA for protein sequences
  • GFF gene models
  • JSON functional annotation
  • JSON seq_region
  • JSON genome
  • JSON manifest

The JSON files follow the schemas defined in the src/python/ensembl/io/genomio/data/schemas folder.

These files can then be fed to the Genome loader pipeline.

How to run

init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_prepare_conf \
    --host $HOST --port $PORT --user $USER --pass $PASS \
    --hive_force_init 1 \
    --pipeline_dir temp/prepare \
    --data_dir $INPUT \
    --output_dir $OUTPUT \
    ${OTHER_OPTIONS}

Parameters

option default value meaning
--pipeline_name brc4_genome_prepare name of the hive pipeline
--pipeline_dir temp directory for this pipeline run
--data_dir directory with json files for each genome to prepare, following the format set by src/python/ensembl/io/genomio/data/schemas/genome.json
--output_dir directory where the prepared files are to be stored
--merge_split_genes 0 Sometimes the gene features are split in a gff file. Ensembl expects genes to be contiguous, so this option merge the parts into 1.
--exclude_seq_regions Do not include those seq_regions (apply to all genomes, this should be seldom used)
--validate_gene_id 0 Enforce a strong gene ID pattern (replace by GeneID if available)
--ensembl_mode 0 By default, set additional metadata for BRC genomes. With this parameter, use vanilla Ensembl metadata.