Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can ML Tags be zero length arrays? #728

Closed
gileshall opened this issue Jun 9, 2023 · 5 comments
Closed

Can ML Tags be zero length arrays? #728

gileshall opened this issue Jun 9, 2023 · 5 comments

Comments

@gileshall
Copy link

Currently, htsjdk will throw an exception if the ML record is present but empty. This post on the GATK forum documents an example of this exception in the wild. However, after reading section 1.7 on Modified Bases in the SAMtags spec, it is unclear to me if htsjdk is correctly interpreting this constraint. For example, is this arrangement of records out of spec?

< ... snip ...> MM:Z:C+m?; ML:B:C < ... snip ...>

To quote section 1.7:

Note it is permitted for the coordinate list to be empty (for example ’MM:Z:C+m;’), which may be used as an explicit indicator that this base modification is not present.

and

The optional ML tag lists the probability of each modification listed in the MM tag being correct, in the order that they occur.

If I understand this correctly, the expected count of ML entries is driven by the number of list of modifications mentioned by MM.
However, since MM:Z:C+m; explicitly indicates no modifications occurred, there are no events for ML to annotate.

I think there are two possible but mutually exclusive interpretations to this use case, but the question is which one?

  1. Empty ML arrays are legal, and htsjdk is acting too stringently
  2. Empty ML arrays are illegal, and should only be populated in records when the array is not empty
@jmarshall
Copy link
Member

jmarshall commented Jun 10, 2023

The exception encountered (Empty array passed to setUnsignedArrayAttribute for tag [tagname]) is not specific to ML but would be triggered for any tagged field. Zero-length arrays are explicitly valid in general since 2018 (see #135 and #326), and HTSJDK (and hence Picard and GATK) was adjusted accordingly in samtools/htsjdk#1194. Unfortunately setUnsignedArrayAttribute() was overlooked in that pull request, but it's clear that its check should have been removed too and this exception being thrown is an HTSJDK bug.

So much for the proximate cause. However there may be HTSJDK base modification code that would have been run for ML in particular but did not execute as this exception was thrown first, that might not handle a zero-length array; i.e., there may be additional reasons why HTSJDK might not handle a zero-length ML field — so we still need an answer to whether an empty ML array is intended to be valid.

In my opinion, such an ML value is very much valid and HTSJDK ought to accept it without issue.

SAMtags.pdf §1.7 also says

[ML] uses a byte array of type ‘C’ with the number of elements matching the summation of the number of modifications listed as being present in the MM tag

This is quite explicit that for a record that contains MM:Z:C+m; (which ¶8 of the MM description explicitly says is valid), that MM value lists a total of zero modifications, so the corresponding ML array has zero elements.

Practically speaking, it would be silly to require methylation calling software to include code to check whether their MM tag being emitted lists no modifications and suppress emitting an ML field accordingly. Far better to regularise the corner case of a correspondingly empty ML value.

So the correct interpretation is your (1), and this is simply an HTSJDK bug.

One commenter on the GATK discussion considered that it is unclear whether the SAM spec allows an empty ML value or not. We could add explicit text noting that an empty ML:B:C array would correspond to the largely empty MM:Z:C+m; value that is explicitly discussed. However personally I don't think that's necessary, as it is fairly clear (from the equality in the paragraph I quoted above) — easily derivable albeit not explicit — and so the only reason the GATK commenter felt that there was any doubt was that they were misled by the faulty HTSJDK / GATK behaviour.

@jmarshall
Copy link
Member

jmarshall commented Jun 13, 2023

PR samtools/htsjdk#1669 addresses the immediate issue, namely the incorrectly raised exception that has been encountered when using HTSJDK.

@jkbonfield
Copy link
Contributor

Fully agree with @jmarshall's comment.

I don't really see the need to be explicit here. Zero length arrays are legal, and we don't need to mention it for ML as it may lead people to the wrong conclusion that permitting zero-length here is an exception rather than the norm.

@jkbonfield jkbonfield moved this from New items to Progressing in GA4GH File Formats Jul 17, 2023
@jkbonfield jkbonfield moved this from Progressing to New items in GA4GH File Formats Jul 17, 2023
@jkbonfield
Copy link
Contributor

We are closing this as we do not think there is merit in singling out ML tags for additional clarification in the specification, but thank you for raising the issue. This is an bug in a specific implementation, for which a fix has already been provided and will be merged in due course.

@jmarshall
Copy link
Member

The HTSJDK fix has now been released in HTSJDK 4.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants