-
Notifications
You must be signed in to change notification settings - Fork 19
sub-optimal performance on fungal genome #76
Comments
Your config looks OK at first glance. To get a feeling of what went wrong I would start by looking at the reference annotation vs. the Companion annotation of the reference sequence in Artemis. Is there a pattern to the gene mismatches? E.g. are real genes miscalled as pseudogenes? Are they missing completely? Are the predictions too short compared to the real genes or missing exons? It might be useful to look at the intermediate GFFs created by RATT, AUGUSTUS, LAST etc. and check which one of the results is picked for a locus, e.g. by looking at them in Artemis etc. For a legitimate gene, there are usually multiple sources of evidence at the same locus:
The integration step tries to pick the 'best' explanation for a locus from all of these by comparing length, reading frame consistency, source etc. (unfortunately not all of which are exposed in the config file). It looks like that for a substantial number of genes the wrong source is picked or left out. I would probably also open all of these evidence tracks in Artemis to try and figure out what happened there, maybe one of the sources is wrong but favoured too much. it might also help disabling RATT and/or Exonerate in an attempt to isolate the culprit. There are also quite a few heuristics in Companion that flag genes with slightly "weird" intron/exon structures (missing splice site motifs, ...) as pseudogenes late in the pipeline. This might also be the reason for so many pseudogenes. |
That's great, thanks for the prompt response. I will take the steps you recommend and get back with any progress I make after the Vietnamese Tet holiday. |
So, I think it is a problem with the version of the annotation being used by Companion. For example, in the annotation.gff3 in I downloaded the same reference genome from NCBI, here, and CNAG_07303 gene still starts at 5928, but the first CDS starts at 6209, which gives a much more sensible looking CDS. I'm not sure what happened, I guess there must have been an erroneous version of this annotation in refseq, but I'm sure it isn't helping Companion!! Is there an easy way to swap in the current, correct annotation? |
Any idea of while running /home/xin/.nextflow/assets/sanger-pathogens/companion/bin/update_references.lua |
@xinliu005 Please open a separate issue here: https://github.com/sanger-pathogens/companion/issues/new |
Hello,
Not sure who is monitoring these issues now @satta has left Sanger, but I wasn't sure where else to send it.
Sascha showed me how to run companion on the command line using a fungidb reference. I have attached the config I used.
crypto.config.txt
It took me some time to come around to quality check the annotation and there seems to be a bit of a problem. I ran a couple of my own samples through and there were a large number of pseudo-genes (3372 of total 5237 without exonerate, 2727 of 5514 with exonerate). These were not only just slightly miscalled as pseudo-genes, when I took the protein output of companion and blasted it vs the reference proteome, only 3515 of 5514 proteins had 60% reciprocal coverage (i.e. 60% of the query protein was covered by a hit which covered 60% of the reference protein).
As another quality check, I ran the reference genome fasta through companion, which is identical to the sequence in teh fungidb and should have a very very similar annotation. there were 2693 pseudo-genes and only 3811 of 5681 proteins had 60% reciprocal coverage vs the reference proteome.
I was just wondering if I could get some pointers as to where to start de-bugging. Perhaps in the RATT parameters, as the fact that given an identical reference genome, there are still lots of pseudo-genes called indicates the transfer is not working well?
Best,
Phil
The text was updated successfully, but these errors were encountered: