Skip to content
james hadfield edited this page Aug 14, 2017 · 15 revisions

Data formats in bioinformatics can be problematic so I have tried to make this detailed enough. While we try to identify errors upon parsing, this is hard to do. Please get in touch if your data doesn't work.

The files mentioned in this page are taken from the examples on phandango.net. Please note that in most cases one cannot load individual files - for instance, metadata cannot be loaded without a phylogeny. Individual files are referenced here for the sole purposes of understanding the required file formats. The example datasets in their entirety are available here.

Phylogenies (Trees)

Phylogenies form the backbone of the visualisation as they link together all the other data. While it is possible to use Phandango without them for GWAS-type graphs, all metadata, recombination blocks and pan-genome content relies on them.

Trees must be in Newick format and must end in .tre or .tree (example here). Newick is the standard output from most tree drawing software (e.g. RAxML), but not all. If you need to convert your tree to a different format try using FigTree but watch out - often single quotations are added around taxon names which must be manually removed! Regrettably Nexus files are not currently supported.

Metadata

Metadata is displayed to the right of the tree. A corresponding tree with matching taxon names must be loaded for the metadata to be displayed. Which columns are displayed can be controlled in the settings menu, and a key can be toggled by pressing k.

Format:

  • comma separated values (CSV) file (example here)
  • File ending in .csv
  • The first line is used for the column headers
  • The first column contains the taxon names, which must match those in the tree

Colour selection:

  • The colour scale depends on the type of data in each column (binary, ordinal or continuous), which is inferred from the data, but this is far from perfect!
  • Adding on :o or :c to the end of the name (in the first row) forces the choice to be ordinal or continuous, respectively. E.g. a header named year:c forces the colours to be drawn from a continuous scale.
  • If you want multiple columns to use the same colours for the same values (e.g. so that the value 42 is the same colour in each column), then group these columns by adding an integer to the suffix - e.g. :o1. You can have as many groups as you like.
  • You can specify your own colours as hex values in a separate column. The column header for the hex values must be the same as for the data with :colour attached, and to come after the data column. E.g. if you have a column named year you can add a second column titled year:colour containing hex values to use as colours.

Genome Annotations

Annotations appear in the top right of the display and are required for visualising recombination / GWAS results. They must be in GFF3 format and end in .gff or .gff3. Parsing GFF files is error prone so it's worth looking at an example file, especially the first two lines:

##gff-version 3
##sequence-region <chromosome name> 1 <chromosome length>

Converting to GFF3:

  • Can often be done using Artemis
  • Can be done on the command line with seqret via seqret -sequence EMBL_FILE_NAME -feature -fformat embl -fopenfile GFF_FILE_NAME -osformat gff –auto

Display:

  • All of the semi-colon separated fields are read and displayed when you hover over a gene / region.
  • If colour appears in the info field then genes are coloured similarly to Artemis.

Genomic data (recombination blocks, pan genome output)

Currently three different file types are parsed, but it shouldn't be too hard to convert any block-like data into one of these formats.

Gubbins

Gubbins output is in GFF3 format and must end in .gff or .gff3, similar to the genome annotation (example here). If you have an old gubbins output file (e.g. *rec.tab) then there is a simple python script here which will convert it for you. The Gubbins software is available here

Essential fields:

  • The second field of each line (except the headers) must be GUBBINS, to distinguish these files from annotation GFFs.
  • The semi-colon separated info string (field 9) must contain the following strings neg_log_likelihood, taxa and snp_count
  • Values are surrounded with double quotes, e.g. snp_count="7";
  • The taxa field is a list of whitespace separated taxon names which must match taxa in the tree in order to be displayed.

BRAT NextGen

A tab separated txt file (i.e. ending in .txt) - this has a default file name like segments_tabular.txt (example here). BRATNextGen software is available here

Format:

  • The first line must be LIST OF FOREIGN GENOMIC SEGMENTS:
  • The second line (the header) is not used
  • Subsequent lines have 6 fields corresponding to (1) block start co-ordinate (integer), (2) block end co-ordinate (integer), (3) origin cluster (integer), (4) home cluster (integer), (5) not used, (6) taxon name (string).

ROARY pan genome

The output file gene_presence_absence.csv is used and this contributes both the annotation data and the block data (example file). The ROARY software is available here.

This CSV file is often huge and can cause browsers to crash. There is a simple python script here which minimises this file. It also seems to cause the output to SVG to crash - if it's this big then consider a screenshot instead!

Manhattan plots

GWAS results are in plink format, i.e. a tab delimited file with header line similar to #CHR SNP BP minLOG10(P) log10(p) r^2

  • The 3rd column is as the genome co-ordinate
  • For seer output, which is a k-mer not a single base, the 3rd column should be x1..x2, e.g. 140..160
  • The 5th column - r^2 - contributes the colour.