Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for 0-length B arrays in SAM files to conform to 1.6 spec #1194

Merged
merged 20 commits into from
Oct 24, 2018

Conversation

lbergelson
Copy link
Member

Making changes to allow 0-length arrays in SAM because this is allowed in SAM 1.6

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@lbergelson
Copy link
Member Author

@yfarjoun Could you take a look at this to make sure I haven't done anything crazy... I removed all the checks for 0 lengths that I could find and the test passes for bam/sam/cram but it's possible there's something else relying on it that I haven't though of.

@codecov-io
Copy link

codecov-io commented Oct 10, 2018

Codecov Report

Merging #1194 into master will increase coverage by 0.302%.
The diff coverage is 88.889%.

@@               Coverage Diff               @@
##              master     #1194       +/-   ##
===============================================
+ Coverage     68.414%   68.715%   +0.302%     
- Complexity      8019      8065       +46     
===============================================
  Files            542       543        +1     
  Lines          32701     32751       +50     
  Branches        5530      5541       +11     
===============================================
+ Hits           22372     22505      +133     
+ Misses          8125      8043       -82     
+ Partials        2204      2203        -1
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/SAMRecord.java 67.678% <ø> (+0.274%) 314 <0> (-1) ⬇️
src/main/java/htsjdk/samtools/TextTagCodec.java 77.922% <100%> (+24.374%) 54 <1> (+21) ⬆️
src/main/java/htsjdk/samtools/SAMLineParser.java 62.36% <50%> (+2.583%) 41 <0> (+3) ⬆️
...sjdk/samtools/util/Md5CalculatingOutputStream.java 66.667% <0%> (-7.692%) 7% <0%> (-1%)
...ava/htsjdk/variant/vcf/VCFStandardHeaderLines.java 75.581% <0%> (-0.475%) 11% <0%> (ø)
...ain/java/htsjdk/samtools/cram/io/CramIntArray.java 88.889% <0%> (ø) 4% <0%> (?)
...rc/main/java/htsjdk/samtools/SamReaderFactory.java 67.464% <0%> (+0.957%) 7% <0%> (ø) ⬇️
...n/java/htsjdk/samtools/cram/structure/ReadTag.java 71.806% <0%> (+3.524%) 65% <0%> (+3%) ⬆️
... and 8 more

Copy link
Contributor

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

final int length = Array.getLength(value);
if(length == 0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after if

@@ -316,6 +326,16 @@ private Object covertStringArrayToObject(final String stringVal) {
}
}

private static Object createEmptyArray(char elementType) {
switch( Character.toLowerCase(elementType) ) {
case 'c': { return new byte[0]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need { } around case or default body. Also, usually case body statements are formatted across lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformatted

@@ -316,6 +326,16 @@ private Object covertStringArrayToObject(final String stringVal) {
}
}

private static Object createEmptyArray(char elementType) {
switch( Character.toLowerCase(elementType) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd whitespace here, I would expect to see switch (Character.toLowerCase(elementType)) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/test/java/htsjdk/samtools/SAMRecordUnitTest.java Outdated Show resolved Hide resolved
.setCreateMd5File(false)
.setCreateIndex(false);
final Path reference = IOUtil.getPath("src/test/resources/htsjdk/samtools/one-contig.fasta");
try(final SAMFileWriter samFileWriter = writerFactory.makeWriter(samRecords.getHeader(), false, tmp, reference)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after try here and on line 1237 below

src/test/java/htsjdk/samtools/SAMRecordUnitTest.java Outdated Show resolved Hide resolved

@DataProvider
public Object[][] getEmptyArrays(){
final String BAM = BamFileIoUtils.BAM_FILE_EXTENSION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be replaced with streaming operations, e.g.

    public Iterator<Object[]> getEmptyArrays() {
        return Stream.of(BamFileIoUtils.BAM_FILE_EXTENSION, IOUtil.SAM_FILE_EXTENSION, CramIO.CRAM_FILE_EXTENSION)
                .flatMap(ext -> Stream.of(new int[0], new short[0], new byte[0], new float[0])
                        .map(array -> new Object[]{array, array.getClass(), ext})).iterator();
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, I do not find that to be more clear...less line? yes...but not more clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the sentiment but not this particular stream operation. Instead I've added a new cartesianProduct method which allows you to combine dataproviders in a nice way.

if(length == 0){
return "";
}
final StringBuilder ret = new StringBuilder(Array.get(value, 0).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be tested with non-zero length.

case 'c': { return new byte[0]; }
case 's': { return new short[0]; }
case 'i': { return new int[0]; }
case 'f': { return new float[0]; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what to do with F

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not accept F

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not accepting F


final int numberOfTokens = StringUtil.splitConcatenateExcessTokens(stringVal, elementTypeAndValue, ',');
if (!(numberOfTokens == 1 || numberOfTokens == 2)) {
throw new SAMFormatException("Tag of type B requires an element type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test-case that will fall into this hole?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't come up with one... removed this...

Assert.assertEquals(record.getAttribute(arrayTag), new char[0]);
record.setAttribute(arrayTag, null);
Assert.assertNull(record.getStringAttribute(arrayTag));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove excess nls

Assert.assertEquals(attribute.getClass(), expectedClass);
Assert.assertEquals(Array.getLength(attribute), 0);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove NLs

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this.

@lbergelson lbergelson added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Oct 15, 2018
Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the suggestions are a nice addition.

lbergelson and others added 9 commits October 17, 2018 14:25
* adding explicit test dependency on a recent version of guava, we
already have a transitive test dependency on an older version so this shouldn't be
a big deal (unless of course it breaks something unexpected)
* adding new method for combining dataproviders to TestNGUtils
@lbergelson
Copy link
Member Author

@yfarjoun Could you take a further quick look when you get a chance, I made some additional changes to things including changing our test dependencies slightly and added additional tests.

The way that arrays and particularly unsigned arrays are retrofitted into the tag encoder is really gross and I want to replace it.

src/main/java/htsjdk/samtools/TextTagCodec.java Outdated Show resolved Hide resolved
src/main/java/htsjdk/samtools/TextTagCodec.java Outdated Show resolved Hide resolved
src/main/java/htsjdk/samtools/TextTagCodec.java Outdated Show resolved Hide resolved
}

@Test(dataProvider = "getArraysAndLists")
public void testObjectsToLists(Object[][] objects, List<List<Object>> lists){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many cases here where the code is missing a space before a {, and missing a space after a comma above (e.g. line 36). It might be easier to reformat the file completely in the IDE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done... I'm not sure it helps much though, this is a gross block of cases

/**
* @return the cartesian product of two or more DataProviders than can be used as a new dataprovider
*/
public static Object[][] cartesianProduct(Object[][] ... dataProviders){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to write this and preserve types (e.g. avoid SuppressWarnings). I'm pretty sure it can also be written using generics (vs Object) but didn't have a chance to finish that.

    public static Object[][] cartesianProduct(Object[][]... dataProviders) {
        List<List<List<Object>>> lists = Arrays.stream(dataProviders)
                .map(Main::nestedArraysToNestedLists)
                .collect(Collectors.toList());
        final List<List<List<Object>>> product = Lists.cartesianProduct(lists);
        final List<List<Object>> mergeProduct = product.stream()
                .map(list -> {
                    List<Object> result = new ArrayList<>();
                    list.forEach(result::addAll);
                    return result;
                }).collect(Collectors.toList());
        return nestedListsToNestedArrays(mergeProduct);
    }

    /**
     * @param dataProvider a nested Object array
     * @return an equivalent nested List
     */
    public static<T> List<List<T>> nestedArraysToNestedLists(T[][] dataProvider) {
        return Arrays.stream(dataProvider)
                .map(Arrays::asList)
                .collect(Collectors.toList());
    }
    /**
     * @param lists a nested List
     * @return an equivalent nested array
     */
    public static Object[][] nestedListsToNestedArrays(List<List<Object>> lists) {
        return lists.stream().map(List::toArray).toArray(Object[][]::new);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think stream().map(... list.forEach().. return) is really flatMap. something like

        final List<List<Object>> mergeProduct = product.stream().flatMap(list -> list.stream()).collect(Collectors.toList());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 👍 to the first comment, thats better.
  • I don't think this can be done with generics in any sane way, since the It's an Object[][] of totally unrelated types. There's no shared type other than Object. Also, it's what dataproviders expect.
  • That flatMap flattens the wrong level of list, this works but I'm not sure it's any clearer
  final List<List<List<Object>>> product = Lists.cartesianProduct(lists);
        final List<List<Object>> mergeProduct = product.stream()
                .map( l -> l.stream()
                        .flatMap(Collection::stream)
                        .collect(Collectors.toList()))
                .collect(Collectors.toList());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not sure much can be done here to make the code clearer. Regarding generics you can definitely make cartesianProduct generic. I ran into trouble with nestedListsToNestedArrays though. But as you say, the caller is always passing Object so not much is gained there.

@lbergelson
Copy link
Member Author

@yfarjoun Did you have any final comments on this one? I'd like to get it in soon.

final int length = Array.getLength(value);
for (int i = 1; i < length; ++i) {
final StringBuilder ret = new StringBuilder();
for (int i = 0; i < length; ++i) {
ret.append(',');
ret.append(Array.get(value, i).toString());
}
return ret.toString();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove nl

if (StringUtil.splitConcatenateExcessTokens(stringVal, elementTypeAndValue, ',') != 2) {
throw new SAMFormatException("Tag of type B should have an element type followed by comma");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove NL

final char elementType = elementTypeAndValue[0].charAt(0);
if (numberOfTokens == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be removed and fall through with length==0?

public void testEmptyArrayAttributeHasNoCommaWhenWrittenToSAM(){
final SAMFileHeader header = new SAMFileHeader();
final SAMRecord record = new SAMRecord(header);
record.setAttribute("xa", new int[0]);
Copy link
Contributor

@yfarjoun yfarjoun Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARRAY_TAG?

? textTagCodec.encode(tagName, array)
: textTagCodec.encodeUnsignedArray(tagName, array);
Assert.assertEquals(encodedTag, expectedTag);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra NL

@@ -57,4 +64,6 @@ public static Path getTribbleFileInJimfs(String vcf, String index, FileSystem fi
}
return Files.copy(vcfPath, vcfDestination);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra NLs

@DataProvider
public Object[][] getArraysAndLists() {
return new Object[][]{
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add empty set,
add actual products of more than 2 lists (i.e. with more than one element each)
add a prodcut with an empty set
add a product with a empty set object.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for keeping us compliant!

List<List<List<Object>>> lists = Arrays.stream(dataProviders)
.map(TestNGUtils::nestedArraysToNestedLists)
.collect(Collectors.toList());
final List<List<List<Object>>> product = Lists.cartesianProduct(lists);
Copy link
Contributor

@pshapiro4broad pshapiro4broad Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, didn't realize this required adding a guava dependency. It seems like there should be a way to do this using jdk 8 streams directly, by using map(...flatMap()) I can see how to do this for 2, 3, etc, but not sure how to do it for N.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not actually adding a guava dependency. We have a it as a transient test dependency already through jimfs. I'm just making it an explicit dependency and also bumping to the most recent version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it's still a test dependency.

@lbergelson
Copy link
Member Author

@yfarjoun Any further comments? I'd like to get this merged soon.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks!

src/test/java/htsjdk/utils/TestNGUtils.java Outdated Show resolved Hide resolved
src/test/java/htsjdk/utils/TestNGUtilsTest.java Outdated Show resolved Hide resolved
@lbergelson
Copy link
Member Author

@yfarjoun Some day I'll write a commit without a typo... but today is not that day.

@lbergelson lbergelson merged commit 44baddf into master Oct 24, 2018
@lbergelson lbergelson deleted the lb_binary_array_0_len branch October 24, 2018 21:17
@jmarshall
Copy link
Member

jmarshall commented Jun 10, 2023

I removed all the checks for 0 lengths that I could find and the test passes for bam/sam/cram but it's possible there's something else

Unfortunately there was (at least) one 0-length check that was missed, in setUnsignedArrayAttribute(). This is the proximate cause of samtools/hts-specs#728 and this GATK forum posting linked from there.

Now addressed by PR #1669.

jmarshall added a commit to jmarshall/htsjdk that referenced this pull request Jun 13, 2023
This function was omitted from PR samtools#1194. Zero-length 'B' arrays
are valid, so remove this check too and add a test case.
@gileshall gileshall mentioned this pull request Aug 3, 2023
lbergelson pushed a commit that referenced this pull request Aug 3, 2023
* Remove setUnsignedArrayAttribute zero-length-array check

This function was omitted from PR #1194. Zero-length 'B' arrays
are valid, so remove this check too and add a test case.

* Remove Slice setAttribute zero-length-array check

Slices' optional tags follow the rules for BAM tagged fields, so
zero-length 'B' arrays are valid. Slice tags are currently unused,
so the question is somewhat moot and there are no test cases.

* removed unused setAttribute methods

---------

Co-authored-by: John Marshall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Review This PR is waiting for a reviewer to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants