-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CRAMReferenceRegion updating. #1708
Conversation
d769bce
to
2d5b20f
Compare
bf67f68
to
969bdf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Back to you with my review comments
if (newSequenceRecord == null) { | ||
throw new IllegalArgumentException( | ||
String.format("Requested reference sequence index %d not found", referenceIndex)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call the existing getSAMSequenceRecord()
method here? It does exactly the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, done.
if ((referenceIndex != this.referenceIndex) || | ||
regionStart != 0 || | ||
(regionLength < referenceBases.length)) { | ||
(regionLength != newSequenceRecord.getSequenceLength())) { | ||
setCurrentSequence(referenceIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that the setCurrentSequence()
call needs to happen first in this if
block (in particular, before the assignment to regionLength
below). Or alternatively, you could have regionLength = newSequenceRecord.getSequenceLength();
below to eliminate the dependency on order of operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latter is done.
{ new File(TEST_DATA_DIR, "CEUTrio.HiSeq.WGS.b37.NA12878.20.21.10m-10m100.cram"), | ||
new File("src/test/resources/htsjdk/samtools/reference/human_g1k_v37.20.21.fasta.gz"), | ||
false, false }, | ||
{ new File(TEST_DATA_DIR, "NA12878.unmapped.cram"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one-line comments explaining the provenance of these three crams, and what they are testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with as much specificity as I have about these, since they've been in the repo for a while.
// these tests use lenient equality to only validate read names, bases and qual scores | ||
{ new File(TEST_DATA_DIR, "mitoAlignmentStartTestGATKGen.cram"), | ||
new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false }, | ||
{ new File(TEST_DATA_DIR, "mitoAlignmentStartTest.cram"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments explaining what these files are as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with as much specificity as I have about these, since they've been in the repo for a while.
new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false }, | ||
// files created by rewriting the htsjdk test file src/test/resources/htsjdk/samtools/cram/mitoAlignmentStartTest.cram | ||
// using code that replicates the first read (which is aligned to position 1 of the mito contig) either | ||
// 10,000 or 20,000 times, to create a file with 2 or 3 containers, respectively, that have reads aligned to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you confirm via direct inspection of the file that it did in fact create 2 or 3 containers with reads mapped to position 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You can also see this in the GATK PR, which uses files that were created using the same code that created these files, to test the detector on files with multiple bad containers. Alternatively, you can run GATK PrintFileDiagnostics
on these files and see the container alignments.
assertRoundTripFidelity(cramSourceFile, tempOutCRAM, referenceFile, false); | ||
assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile); | ||
assertRoundTripFidelity(cramSourceFile, tempOutCRAM, referenceFile, lenientEquality, emitDetail); | ||
assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile, lenientEquality, emitDetail); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining why we also do a test with samtools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its basically testing interoperability with samtools for all of the different encodings, to make sure both tool sets agree on the data. Added a comment saying its for interop testing.
@Test(dataProvider = "roundTripTestFiles") | ||
public final void testRoundTripDefaultEncodingStrategy(final File sourceFile, final File referenceFile) throws IOException { | ||
@Test(dataProvider = "defaultStrategyRoundTripTestFiles") | ||
public final void testRoundTripDefaultEncodingStrategy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment before this test method describing the purpose of the test (since we're not just testing "encoding strategies" here....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing the default encoding strategy. Comment added.
} | ||
|
||
@Test(dataProvider = "roundTripTestFiles") | ||
public final void testAllEncodingStrategyCombinations(final File cramSourceFile, final File referenceFile) throws IOException { | ||
@Test(dataProvider = "encodingStrategiesTestFiles") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be value in having this testAllEncodingStrategyCombinations
test use the defaultStrategyRoundTripTestFiles
DataProvider as well, with its many additional test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but there are currently 81 encoding strategies, so using all of those files would make this test take a really long time, and I would guess, dominate the CI test time, especially now that we have the large CRAM test file in that list. So instead I separated them this way.
@@ -167,6 +168,39 @@ public void testGetReferenceBasesByRegionPositive( | |||
Assert.assertEquals(bases, Arrays.copyOfRange(fullContigBases, requestedOffset, requestedOffset + requestedLength)); | |||
} | |||
|
|||
// simulate the state transitions that occur when writing a CRAM file | |||
@Test | |||
public void testSerialStateTransitions() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here explaining and referencing the bug that prompted us to write this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// now transition back to the full sequence | ||
cramReferenceRegion.fetchReferenceBases(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO); | ||
Assert.assertEquals(cramReferenceRegion.getRegionLength(), fullRegionFragmentLength); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is CRAMReferenceRegion
ever used across multiple threads? Give how the fetch methods update the internal state of the object on every call, it's clearly not safe to use in a parallelized manner. Do we need to make these methods synchronized
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is currently not used anywhere across threads. As you noted, it's not thread-safe. I guess I could add synchronized
, but that still wouldn't make it thread-safe, since the class is is stateful, and the usage pattern is fetch...
followed by one or more getCurrent..
methods. So synchronized won't fix the statefulness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment saying it's not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Back to you with one final comment
if ((referenceIndex != this.referenceIndex) || | ||
regionStart != 0 || | ||
(regionLength < referenceBases.length)) { | ||
(regionLength != newSequenceRecord.getSequenceLength())) { | ||
setCurrentSequence(referenceIndex); | ||
referenceBases = referenceSource.getReferenceBases(sequenceRecord, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a potential order-of-operations issue here: since the call to getReferenceBases()
uses the instance variable sequenceRecord
, the call to setCurrentSequence()
must happen first. If the order of the statements were reversed in some future refactoring, we might have another terrible bug. Either add a comment explaining that setCurrentSequence()
must happen first, or eliminate the order dependency by using newSequenceRecord
in the getReferenceBases()
call.
Fix for broadinstitute/gatk#8768.