Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stopgap fix for Stockholm format alignment export #1050

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

twalsh-ebi
Copy link
Contributor

@twalsh-ebi twalsh-ebi commented Oct 21, 2024

Description

Because of a bug in the Stockholm alignment writer in BioPerl 1.6.x, homologue alignment downloads in Stockholm format have sequence identifiers that run on into the accession field:

# STOCKHOLM 1.0
#=GF ID NoName
#=GS ENSMUSP00000089009_Mmus/1-3623AC unknown
#=GS ENSMUSP00000087102_Mmus/1-435AC unknown
#=GS ENSMUSP00000152236_Mmus/1-540AC unknown
#=GS ENSMUSP00000099398_Mmus/1-1385AC unknown

This prevents the alignment from being loaded into (for example) Biopython, making it more difficult to process downloaded Stockholm files automatically.

This bug was fixed in bioperl-live PR 122, so code incorporating a bugfix is available in the event of an upgrade to BioPerl 1.7.0+.

In the meantime, this PR corrects the issue by fixing the generated Stockholm alignment string. As an illustration:

# STOCKHOLM 1.0
#=GF ID NoName
#=GS ENSMUSP00000089009_Mmus/1-3623 AC unknown
#=GS ENSMUSP00000087102_Mmus/1-435 AC unknown
#=GS ENSMUSP00000152236_Mmus/1-540 AC unknown
#=GS ENSMUSP00000099398_Mmus/1-1385 AC unknown

This makes it easier for Compara team members to test — and Ensembl users to work with — downloaded Stockholm files.

Views affected

This affects export of Stockholm format alignments of homologues.

To see the effect of this PR, try previewing or downloading Stockholm alignments for the following two examples:

Possible complications

As these changes only have an effect on the GS lines of Stockholm-format alignments in which the sequence identifier runs on into the accession field, no complications are expected.

Merge conflicts

None detected.

Related JIRA Issues (EBI developers only)

N/A

my @export_lines = split(/\n/, $export);
foreach my $i (0 .. $#export_lines) {
if (substr($export_lines[$i], 0, 4) eq '#=GS') {
$export_lines[$i] =~ s/^(#=GS \S+)(AC unknown)$/$1 $2/;
Copy link
Contributor

@azangru azangru Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyothishnt is wondering if this is always "AC unknown"? In the PR to biomart-perl, AC is hard-coded, but "unknown" is not.

Copy link
Contributor Author

@twalsh-ebi twalsh-ebi Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jyothishnt and @azangru.. With the current Ensembl codebase, I believe this suffix is indeed always "AC unknown" in practice.

The $acc is set from the $seq->accession_number in Bio::AlignIO::stockholm, and a sequence accession_number is not set anywhere in the Compara codebase, though it is accessed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants