Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VEP112 predicts "inframe_insertion, stop_retained_variant" in cases where previously was predicted as "frameshift_variant, stop_gained" #1710

Open
jperales opened this issue Jun 27, 2024 · 8 comments
Assignees

Comments

@jperales
Copy link

jperales commented Jun 27, 2024

I noticed that VEP 112 predicts an opposite consequence as compared to previous versions for certain insertions (rare cases). It predicts inframe_insertion,stop_retained_variant in cases where frameshift_variant,stop_gained was predicted before. Notably, these affect protein coding transcripts on exons, and the insertion length is not divisible by 3, so I would expect the frameshift. Moreover, I doubt whether stop_retained_variant makes sense in the region as I have seen this happening near splicing donor sites of first exons of protein coding transcripts, hence I would not expect a stop codon in the region. See below for 1 example case. Thank you!

Example case:

Variant: 3:56591278-56591278 T>TGGGGTAAGCA. It would be a 10-bp insertion on CCDC66 gene. Let's focus on the canonical transcript ENST00000394672. Then this variant would affect the last part of the 1st exon, almost at the splicing site.

VEP command line for VEP 111 & its output

$vep --no_stats -id "3 56591278 . T TGGGGTAAGCA . . ." -o "STDOUT" --tab --assembly GRCh37 --symbol --numbers --cache --offline | grep 'ENST00000394672'
3_56591279_-/GGGGTAAGCA 3:56591278-56591279     GGGGTAAGCA      ENSG00000180376 ENST00000394672 Transcript      stop_gained,frameshift_variant  78-79   8-93L/LG*AX ttg/ttGGGGTAAGCAg       -       HIGH    -       1       -       CCDC66  HGNC    27709   1/18    -

VEP command line for VEP 112 & its output

$vep --no_stats -id "3 56591278 . T TGGGGTAAGCA . . ." -o "STDOUT" --tab --assembly GRCh37 --symbol --numbers --cache --offline | grep 'ENST00000394672'
3_56591279_-/GGGGTAAGCA 3:56591278-56591279     GGGGTAAGCA      ENSG00000180376 ENST00000394672 Transcript      inframe_insertion,stop_retained_variant 78-79       8-9     3       L/LG*AX ttg/ttGGGGTAAGCAg       -       MODERATE        -       1       -       CCDC66  HGNC    27709   1/18    -

System

  • VEP version: 112 vs 111
  • VEP Cache version: 112 GRCh37 & 111 GRCh37
  • Perl version: v5.34.0
  • OS: Ubuntu
  • tabix installed yes

Full error message

None

Data files (if applicable)

They include:

  • The input file : id in command line
  • The output file: output shown above
  • The custom file(s) : none
    NOTE: Beyond that example, I can provide a more complete list of cases found upon request
@nuno-agostinho
Copy link
Contributor

Hi @jperales,

Thanks for reporting this case. What do you think it would be the expected consequences for this example?

If you have more cases like so, please send us so we can take a look at them and see how they behave.

Cheers,
Nuno

@nuno-agostinho nuno-agostinho self-assigned this Jun 27, 2024
@jperales
Copy link
Author

jperales commented Jun 27, 2024

Regarding the consequence of the example above, it looks the expected consequence would be an inframe_insertion, stop_gained. The stop_gained would be the most relevant consequence, and it was correctly predicted by previous versions of VEP (tested with VEP 104 & 111). The the insertion variant creates a stop gained within the insertion, thus the variant would be an inframe_insertion because the reading frame is kept and stops prematurely before the shift. The inframe_insertion is better predicted by VEP 112. Thus the issue is mainly with the annotation for the stop.

The reference CCDS sequence data for that transcript would be (source: https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS46852 , 2847 nt , 948 aa):

# Reference. 1st nucleotides, 2nd translation into aminoacids
atg aac ttg gga gat ggt tta aag ctt [...]
 M   N   L   G   D   G   L   K   L [...]

If we insert the variant in the transcript, the sequence changes as follows. Please note that it leads to a frameshift and stop gained codon TAA (denoted as '-'):

# Variant 'chr3_56591279_-/GGGGTAAGCA'
atg aac ttg ggg taa gca ggg aga tgg ttt aaa gct t[...]
 M   N   L   G   -   A   G   R   W   F   K   A[...]

I have only found another example affected by this issue: 19 52888074 . G GATCATGAGGTCAGGAGATCGAGACCATCCTGGCTAACAAGGTGAAACCC . . .

Thank you very much for the efforts on this and your great work!
Best,
Javier

@nuno-agostinho
Copy link
Contributor

Hi Javier,

Thank you for sending those examples. :)

I will go through them with my team and see how we can improve VEP based on them.

Cheers,
Nuno

@jperales
Copy link
Author

Hi @nuno-agostinho ,
Just to let you know that I have edited my previous message after discarding most candidates from the list and identifying only two variants affected by this issue.

Moreover, I realized that the issue is only with the consequence term stop_retained_variant by VEP 112. It is actually a stop_gained - as it was predicted correctly in previous versions of VEP. Actually you could see at the output from first message that both VEP 111 and VEP 112 annotates the Amino_acids (change) = L>LG*AX at the 3rd Protein_position of the protein. Thus VEP knows there is a new stop codon (at the beginning of the protein). However VEP 112 does something different and consider that a stop_retained_variant. Finally, I think the second consequence term inframe_insertion (by VEP 112) would be more appropiate than the previously frameshift_variant predicted by VEP 111 because the variant stops prematurely the reading frame within the insertion before the 3' downstream shift.

Please note that the correct prediction of a stop_gained consequence is very important. Stop gained has one of the most severe impacts in a protein coding gene, while the stop_retained_variant is lowly severe.

Thanks you very much!
Best,
Javier

@jperales
Copy link
Author

Hi @nuno-agostinho ,
Is there any update on this issue? Hopefully you have evaluated it, so we know whether your team consider it (or not) as a bug. Thanks!

Best,
Javier

@nuno-agostinho
Copy link
Contributor

nuno-agostinho commented Aug 21, 2024

Hi @jperales,

The issue is indeed a bug and occurs when the first amino acid is identical between the reference and alternative peptide sequences (and there is a stop insertion in the alternative sequence, *), as tested by VariationEffect.pm#L1340.

We aim to have a bug fix in place for the next version of VEP (113). Sorry for the inconvenience.

Best,
Nuno

@jperales
Copy link
Author

Hi @olaaustine ,
Could you please confirm me whether the bug fix for this issue is included in VEP 113? I have read that the next version release is coming. Thanks

Best,
Javier

@olaaustine
Copy link
Contributor

Hi @jperales,
Hope this meets you well?
Thank you very much for your patience.
We are still working on this bug and will update this ticket when the issue has been resolved.
Thank you
Ola.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants