A repository of scripts/code used across multiple projects.
- Calculates common assembly statistics for one or more FASTA files.
- Input: takes either a single FASTA or a directory of FASTA files.
Usage:
asm_stats.py --fasta <FASTA>
asm_stats.py --fasta_dir <fasta_dir/>
Parameters:
-f FILE, --file FILE Input FASTA file
--fasta_dir FASTA_DIR Directory containing FASTA files
--fasta FASTA Single FASTA file
--gap GAP Minimum gap length to be considered a scaffold (optional) [2]
--output OUTPUT Output file prefix (optional) ['sample']
--version Show program's version number and exit
- Output: CSV file containing assembly metrics with columns:
- sampleid : Based on the first part of the input fasta (delimeter: '.')
- assembly_length_bp : Assembly length (bp)
- scaffold_count : Number of scaffolds
- scaffold_N50_bp : Scaffold N50 (bp)
- scaffold_N90_bp : Scaffold N90 (bp)
- contig_count : Number of contigs
- contig_N50_bp : Contig N50 (bp)
- contig_N90_bp : Contig N90 (bp)
- GC_perc : GC content (%)
- gaps_count : Number of gaps (min length to define a gap can be changed with --gap parameter)
- gaps_sum_bp : Total gap length (bp)
- gaps_perc : Proportion of genome composed of gaps (%)
- Calculates the Adjusted Rand Index between clustering schemes.
- Input: takes tab or comma separated files (works on any number of files ≥ 2).
- Input format: sample_name,cluster_id
Usage:
ari.py <file1> <file2> <file3>
- Output: pairwise ARI values in both list and matrix format
- Calculates the Cosine similarity, Jaccard similarity, longest common substring and Abstract Syntax Tree (optional, not fully tested) between different input files.
- Input: two directories of files to be compared (at present must be .md, .R or .py files)
Usage:
code_similarity.py --dir1 <first directory/> --dir2 <second directory/>
Parameters:
options:
-h, --help show this help message and exit
--dir1 DIR1 Path to the first directory of scripts
--dir2 DIR2 Path to the second directory of scripts
--include-comments Include comments in the similarity comparison
--enable-ast Enable AST-based similarity comparison for Python files
--keep-duplicates Keep duplicate lines in the similarity comparison
- Output: pairwise ARI values in both list and matrix format
- Counts the pairwise row and column differences between files.
- Input: two directories of CSV files to be compared.
Usage:
code_similarity.py --dir1 <first directory/> --dir2 <second directory/> --output <output file prefix>
Parameters:
options:
-h, --help show this help message and exit
--dir1 DIR1 Path to the first directory
--dir2 DIR2 Path to the second directory
--output OUTPUT Output TSV file for comparison results
- Output: 3 files:
- {prefix}.tsv: Pairwise comparisons between two files, with filenames (cols 1&2), counts of differences by row (col3) or column (col4), names of rows that have changed (col5) and names of columns that have changed (col6)
- {prefix}.row_diff_matrix.tsv: Matrix of pariwise row differences (counts)
- {prefix}.col_diff_matrix.tsv: Matrix of pariwise column differences (counts)
- Renames, subsets and/or sorts FASTA files.
- Input: FASTA file
Usage:
fa_select.py -f <FASTA>
Parameters:
-f FILE, --file FILE Input FASTA file
-s, --sort Sort by header name
-l LENGTH, --length LENGTH Keep only contigs longer than the specified length
-p PREFIX, --prefix PREFIX Append prefix to each contig header
-i INCLUDE, --include INCLUDE File with list of contig headers to include
-e EXCLUDE, --exclude EXCLUDE File with list of contig headers to exclude
-o OUTPUT, --output OUTPUT Output file name
-v, --version Show program's version number and exit
- Output: FASTA file
- Converts an excel spreadsheet into individual sheets, named after each tab name.
Usage:
split_spreadsheet.py -e <spreadsheet> -o <output directory name>
Parameters:
options:
-h, --help show this help message and exit
-e EXCEL_FILE_PATH, --excel_file_path EXCEL_FILE_PATH
Path to the Excel file.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Directory to save the output files.
-f {csv,tsv}, --format {csv,tsv}
Output format: csv or tsv (default: csv)
- Output: Directory with individual TSV or CSV files
- Switches the names of two files.
Usage:
switch_names.py --file1 <first_file> --file2 <second_file>
- For a specified column, returns one representative row containing each unique value in the dataframe. Similiar to 'uniq' on a single column but returns the whole row.
- Input: takes a tab separated file with a header (relevant column must be specified with the --col parameter) and --mode determines whether the first row in a dataframe containing the value is retained or a row is randomly selected.
Usage:
unique_values.py --input <input file> --col <column name> --mode <first|random>