-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663
base: master
Are you sure you want to change the base?
Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663
Conversation
…dded in the streams).
…ift methods for RANS Nx16 Order 0 and Order 1, RANS Nx16 Order 0 and Order 1 with format flags = 1 works as expected when N=4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats it for this round. When these comments are addressed I'll do another round.
src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationEncode.java
Outdated
Show resolved
Hide resolved
|
||
case TOKEN_DELTA0: | ||
tokenStream.get(TOKEN_DELTA0).getByteBuffer().put((byte)Integer.parseInt(encodeToken.getRelativeTokenValue())); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about having a default case as elsewhere: there should be a multi-value case statement here with all of the remaining type values, with a break
stmt, and a comment saying they are deliberately dropped, followed by a default with a throw.
inputByteBuffer.get(dataBytes, 0, clen); // offset in the dst byte array | ||
final ByteBuffer uncompressedDataByteBuffer; | ||
if (useArith != 0) { | ||
RangeDecode rangeDecode = new RangeDecode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely suggest caching just one of these (probably in TokenStreams) so we don't have to recreate these for each stream, and then just reset them for each use.
uncompressedDataByteBuffer = rangeDecode.uncompress(ByteBuffer.wrap(dataBytes)); | ||
|
||
} else { | ||
RANSDecode ransdecode = new RANSNx16Decode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here - this needs to be cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we do the integration of these with the CRAM reader/writer, we may even want to cache/reuse these at an even higher level, so that we don't have to create these for each slice. But for now caching them in TokenStreams
will be a big win in terms of reducing allocation/GC.
final ByteBuffer inBuffer, | ||
final String separator) { | ||
inBuffer.order(ByteOrder.LITTLE_ENDIAN); | ||
final int uncompressedLength = inBuffer.getInt() & 0xFFFFFFFF; //unused variable. Following the spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further review, we're really missing the boat here, since we're not taking advantage of this value. I suspect that the uncompressed length is serialized to the stream specifically to allow decode implementations to efficiently allocate memory for it up front, but we're not doing that because we're using a List<List<>> instead of a byte buffer for the output. As I've mentioned elsewhere, we may wind up wanting to retain the List approach, since it will be more efficient to integrate with the reader/writer that way, but I just wanted to note this as part of the review.
RANSDecode ransdecode = new RANSNx16Decode(); | ||
uncompressedDataByteBuffer = ransdecode.uncompress(ByteBuffer.wrap(dataBytes)); | ||
} | ||
this.getTokenStreamByType(tokenType).add(tokenPosition,new Token(uncompressedDataByteBuffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to use a local method call here to get the stream (this.getTokenStreamByType(tokenType)
), when everywhere else in this method you just do tokenStreams.get(tokenType)
directly. I would either use it everywhere, or (preferably) eliminate the local wrapper since it does't add much.
TokenStreams tokenStreams = new TokenStreams(inBuffer, useArith, numNames); | ||
List<List<String>> tokensList = new ArrayList<>(numNames); | ||
for(int i = 0; i < numNames; i++) { | ||
tokensList.add(new ArrayList<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these ArrayLists be preallocated to the correct length, or some reasonable length, rather than relying on them being serially reallocated as they grow ? I think even overallocating somewhat would be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon looking further, it looks like these arrays need to have a size equal to the number of tokens which the names are broken into. While you may not be able to predict that up front, you could probably guess and use a an estimate. The default for array list is 10 anyway, but I would suggest maybe 15 as a reasonable default (add a constant with a comment saying its an estimate for preallocation. 15 will probably be too big most of the time but specifying a size value would reinforce what these are expected to be.
// contains a ByteBuffer of length = number of names | ||
// This ByteBuffer helps determine the type of each of the token at the specicfied pos | ||
|
||
this(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems like tokenStreams
array is always of size 13 (number of token types) x numNames
. If thats true, then when you combine these constructors, can you preallocate the list of Arrays to size numNames
?
byte currentByte = inputBuffer.get(); | ||
while (currentByte != 0) { | ||
resultStringBuilder.append((char) currentByte); | ||
currentByte = inputBuffer.get(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal in this case, but in the spirit of keeping narrow variable scope, I would always prefer something like this:
byte currentByte = inputBuffer.get(); | |
while (currentByte != 0) { | |
resultStringBuilder.append((char) currentByte); | |
currentByte = inputBuffer.get(); | |
} | |
for (byte currentByte = inputBuffer.get(); currentByte != 0; currentByte = inputBuffer.get()) { | |
resultStringBuilder.append((char) currentByte); | |
} |
while (value.length() < len) { | ||
value = "0" + value; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also super inefficient. It would be much preferable to not reallocate this string (possibly) len
times. Instead, either use a single StringBuilder, or add apache.StringUtils and use leftPad.
… stripe, nosize, cat flags
7d2d0c5
to
67ad384
Compare
NOTE: This PR is in draft as it is dependent on RANS NX16 PR and Range Codec PR
Description
This PR is part of an effort to upgrade CRAM to v3.1. It adds the Name Tokenization Decoder implementation.
List of Changes:
Add Name Tokenization Decoder
Add NameTokenizationInteropTest to test the Name Tokenization Decoder using the test files from htscodecs. These interop tests use the files from samtools installation (samtools-1.14/htslib-1.14/htscodes/tests/names)