Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

Open
wants to merge 83 commits into
base: master
Choose a base branch
from

Conversation

yash-puligundla
Copy link
Contributor

NOTE: This PR is in draft as it is dependent on RANS NX16 PR and Range Codec PR

Description
This PR is part of an effort to upgrade CRAM to v3.1. It adds the Name Tokenization Decoder implementation.

List of Changes:
Add Name Tokenization Decoder
Add NameTokenizationInteropTest to test the Name Tokenization Decoder using the test files from htscodecs. These interop tests use the files from samtools installation (samtools-1.14/htslib-1.14/htscodes/tests/names)

yash-puligundla and others added 30 commits October 19, 2023 17:56
…ift methods for RANS Nx16 Order 0 and Order 1, RANS Nx16 Order 0 and Order 1 with format flags = 1 works as expected when N=4
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats it for this round. When these comments are addressed I'll do another round.


case TOKEN_DELTA0:
tokenStream.get(TOKEN_DELTA0).getByteBuffer().put((byte)Integer.parseInt(encodeToken.getRelativeTokenValue()));
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about having a default case as elsewhere: there should be a multi-value case statement here with all of the remaining type values, with a break stmt, and a comment saying they are deliberately dropped, followed by a default with a throw.

inputByteBuffer.get(dataBytes, 0, clen); // offset in the dst byte array
final ByteBuffer uncompressedDataByteBuffer;
if (useArith != 0) {
RangeDecode rangeDecode = new RangeDecode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely suggest caching just one of these (probably in TokenStreams) so we don't have to recreate these for each stream, and then just reset them for each use.

uncompressedDataByteBuffer = rangeDecode.uncompress(ByteBuffer.wrap(dataBytes));

} else {
RANSDecode ransdecode = new RANSNx16Decode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here - this needs to be cached.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we do the integration of these with the CRAM reader/writer, we may even want to cache/reuse these at an even higher level, so that we don't have to create these for each slice. But for now caching them in TokenStreams will be a big win in terms of reducing allocation/GC.

final ByteBuffer inBuffer,
final String separator) {
inBuffer.order(ByteOrder.LITTLE_ENDIAN);
final int uncompressedLength = inBuffer.getInt() & 0xFFFFFFFF; //unused variable. Following the spec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further review, we're really missing the boat here, since we're not taking advantage of this value. I suspect that the uncompressed length is serialized to the stream specifically to allow decode implementations to efficiently allocate memory for it up front, but we're not doing that because we're using a List<List<>> instead of a byte buffer for the output. As I've mentioned elsewhere, we may wind up wanting to retain the List approach, since it will be more efficient to integrate with the reader/writer that way, but I just wanted to note this as part of the review.

RANSDecode ransdecode = new RANSNx16Decode();
uncompressedDataByteBuffer = ransdecode.uncompress(ByteBuffer.wrap(dataBytes));
}
this.getTokenStreamByType(tokenType).add(tokenPosition,new Token(uncompressedDataByteBuffer));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to use a local method call here to get the stream (this.getTokenStreamByType(tokenType)), when everywhere else in this method you just do tokenStreams.get(tokenType) directly. I would either use it everywhere, or (preferably) eliminate the local wrapper since it does't add much.

TokenStreams tokenStreams = new TokenStreams(inBuffer, useArith, numNames);
List<List<String>> tokensList = new ArrayList<>(numNames);
for(int i = 0; i < numNames; i++) {
tokensList.add(new ArrayList<>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these ArrayLists be preallocated to the correct length, or some reasonable length, rather than relying on them being serially reallocated as they grow ? I think even overallocating somewhat would be preferable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon looking further, it looks like these arrays need to have a size equal to the number of tokens which the names are broken into. While you may not be able to predict that up front, you could probably guess and use a an estimate. The default for array list is 10 anyway, but I would suggest maybe 15 as a reasonable default (add a constant with a comment saying its an estimate for preallocation. 15 will probably be too big most of the time but specifying a size value would reinforce what these are expected to be.

// contains a ByteBuffer of length = number of names
// This ByteBuffer helps determine the type of each of the token at the specicfied pos

this();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems like tokenStreams array is always of size 13 (number of token types) x numNames. If thats true, then when you combine these constructors, can you preallocate the list of Arrays to size numNames ?

Comment on lines +148 to +152
byte currentByte = inputBuffer.get();
while (currentByte != 0) {
resultStringBuilder.append((char) currentByte);
currentByte = inputBuffer.get();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal in this case, but in the spirit of keeping narrow variable scope, I would always prefer something like this:

Suggested change
byte currentByte = inputBuffer.get();
while (currentByte != 0) {
resultStringBuilder.append((char) currentByte);
currentByte = inputBuffer.get();
}
for (byte currentByte = inputBuffer.get(); currentByte != 0; currentByte = inputBuffer.get()) {
resultStringBuilder.append((char) currentByte);
}

Comment on lines +158 to +160
while (value.length() < len) {
value = "0" + value;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also super inefficient. It would be much preferable to not reallocate this string (possibly) len times. Instead, either use a single StringBuilder, or add apache.StringUtils and use leftPad.

@yash-puligundla yash-puligundla force-pushed the yp_cram_3_1_name_tokeniser branch from 7d2d0c5 to 67ad384 Compare January 30, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants