Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements collections of bundles. #1702

Merged
merged 9 commits into from
Aug 13, 2024
14 changes: 8 additions & 6 deletions src/main/java/htsjdk/beta/io/bundle/Bundle.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

import java.io.Serializable;
import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Optional;

/**
* An immutable collection of related resources (a primary resource, such as "reads", "variants",
* "features", or "reference"), plus zero or more related companion resources ("index", "dictionary",
* An immutable collection of related resources, including a (single, required) primary resource, such as "reads",
* "variants", "features", or "reference", plus zero or more related secondary resources ("index", "dictionary",
* "MD5", etc.).
* <p>
* Each resource in a {@link Bundle} is represented by a {@link BundleResource}, which in turn describes
Expand All @@ -22,14 +22,14 @@
* in {@link BundleResourceType}.
* <p>
* A {@link Bundle} must have one resource that is designated as the "primary" resource, specified
* by a content type string. A resource with "primary content type" is is guaranteed to be present in
* by a content type string. A resource with "primary content type" is guaranteed to be present in
* the {@link Bundle}.
* <p>
* Since each resource in a {@link Bundle} has a content type that is unique within that {@link Bundle},
* a Bundle can not be used to represent a list of similar items where each item is equivalent to
* each other item (i.e., a list of shards, where each shard in the list is equivalent to each other
* shard). Rather {@link Bundle}s are used to represent related resources where each resource has a unique
* character or role relative to the other resources (i.e., a "reads" resource and a corresponding "index"
* character or role relative to the other resources (i.e., a "reads" resource and a corresponding "index"
* resource).
* <p>
* Bundles that contain only serializable ({@link IOPathResource}) resources may be serialized to, and
Expand All @@ -38,7 +38,9 @@
public class Bundle implements Iterable<BundleResource>, Serializable {
private static final long serialVersionUID = 1L;

private final Map<String, BundleResource> resources = new LinkedHashMap<>();
// don't use LinkedHashMap here; using HashMap resolves unnatural resource ordering issues that arise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what's going on here when you have a LinkedHashMap. Shouldn't serializing a linked hashmap deserialize it to the same map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I was having problems when roundtripping these through JSON, but I can no longer reproduce the issue, so reverting to our beloved LinkedHashMap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long Live LinkedHashMap!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I realized what the issue was here. JSONObject internally uses a HashMap (implying, I think, that JSON doesn't preserve the serialized order of JSON attributes), so when you roundtrip through JSON, the iteration order from JSON is based on the HashMap order. If we use LinkedHashMap in Bundle, then the order after a roundtrip gets scrambled, and tests fail (but only for some cases because sometimes the roundtrip order matches and sometimes it differs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahah, that makes sense. Seems like a weird decision to not maintain internal order, but it's good to know about. Cana we change our tests to use an order independent comparison?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an resource-order-independent equals method for use in the tests.

// when creating a bundle from serialized files or strings
private final Map<String, BundleResource> resources = new HashMap<>(); // content type -> resource
private final String primaryContentType;

/**
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/htsjdk/beta/io/bundle/BundleBuilder.java
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ public BundleBuilder addSecondary(final BundleResource resource) {
* least one (primary) resource must have been previously added to create a valid bundle.
*
* @return a {@link Bundle}
* @throws IllegalStateException if no primary resouuce has been added
* @throws IllegalStateException if no primary resource has been added
*/
public Bundle build() {
if (primaryResource == null) {
Expand Down
173 changes: 137 additions & 36 deletions src/main/java/htsjdk/beta/io/bundle/BundleJSON.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@
import htsjdk.io.IOPath;
import htsjdk.samtools.util.Log;
import htsjdk.utils.ValidationUtils;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
Expand All @@ -30,11 +32,12 @@ public class BundleJSON {
public static final String JSON_PROPERTY_PRIMARY = "primary";
public static final String JSON_PROPERTY_PATH = "path";
public static final String JSON_PROPERTY_FORMAT = "format";

public static final String JSON_SCHEMA_NAME = "htsbundle";
public static final String JSON_SCHEMA_VERSION = "0.1.0"; // TODO: bump this to 1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 0.2.0 now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the schema in a separate PR when I move the bundle classes out of the beta package.


final private static Set<String> TOP_LEVEL_PROPERTIES = Collections.unmodifiableSet(
new HashSet<String>() {
new HashSet<>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these static declarations be migrated to use Set.of() instead of these anonymous classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, thats an improvement. Done.

private static final long serialVersionUID = 1L;
{
add(JSON_PROPERTY_SCHEMA_NAME);
Expand All @@ -43,7 +46,7 @@ public class BundleJSON {
}});

/**
* Serialize this bundle to a JSON string representation. All resources in the bundle must
* Serialize a bundle to a JSON string representation. All resources in the bundle must
* be {@link IOPathResource}s for serialization to succeed. Stream resources cannot be serialized.
*
* @param bundle the {@link Bundle} to serialize to JSON
Expand Down Expand Up @@ -74,78 +77,176 @@ public static String toJSON(final Bundle bundle) {
}

/**
* Create a Bundle from jsonString.
*
* @param jsonString a valid JSON string conforming to the bundle schema
* @return a {@link Bundle} created from jsonString
* Convert a (non-empty) Collection of Bundles to a JSON array string representation.
* @param bundles a Collection of Bundles to serialize to JSON
* @return a JSON string (array) representation of the collection of bundles
* @throw IllegalArgumentException if the collection is empty
*/
public static Bundle toBundle(final String jsonString) {
return toBundle(ValidationUtils.nonEmpty(jsonString, "resource list"), HtsPath::new);
public static String toJSON(final Collection<Bundle> bundles) {
if (bundles.isEmpty()) {
throw new IllegalArgumentException("A bundle collection must contain at least one bundle");
}
return bundles.stream()
.map(BundleJSON::toJSON)
.collect(Collectors.joining(",\n", "[", "]"));
}

/**
* Create a Bundle from a jsonString.
*
* @param jsonString a valid JSON string conforming to the bundle schema (for compatibility, a bundle list is also
* accepted, as long as it only contains a single bundle)
* @return a {@link Bundle} created from jsonString
*/
public static Bundle toBundle(final String jsonString) {
return toBundle(ValidationUtils.nonEmpty(jsonString, "resource list"), HtsPath::new);
}

/**
* Create a Bundle from jsonString using a custom class that implements {@link IOPath} for all resources.
* (For compatibility, a bundle list string is also accepted, as long as it only contains a single bundle).
*
* @param jsonString a valid JSON string conforming to the bundle schema
* @param ioPathConstructor a function that takes a string and returns an IOPath-derived class of type <T>
* @param <T> the IOPath-derived type to use for IOPathResources
* @return a newly created {@link Bundle}
* @return a newly created {@link Bundle}
*/
public static <T extends IOPath> Bundle toBundle(
final String jsonString,
final Function<String, T> ioPathConstructor) {
ValidationUtils.nonEmpty(jsonString, "JSON string");
ValidationUtils.nonNull(ioPathConstructor, "IOPath-derived class constructor");
try {
return toBundle(new JSONObject(jsonString), ioPathConstructor);
} catch (JSONException | UnsupportedOperationException e) {
// see if the user provided a collection instead of a single bundle, and if so, present it as
// a Bundle as long as it only contains one Bundle
try {
final Collection<Bundle> bundles = toBundleCollection(jsonString, ioPathConstructor);
if (bundles.size() > 1) {
throw new IllegalArgumentException(
String.format("A JSON string with more than one bundle was provided but only a single Bundle is allowed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format string is missing the template variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, fixed.

e.getMessage(),
e.getMessage()));
}
return bundles.stream().findFirst().get();
} catch (JSONException | UnsupportedOperationException e2) {
throw new IllegalArgumentException(
String.format("JSON can be interpreted neither as an individual bundle (%s) nor as a bundle collection (%s)",
e2.getMessage(),
e.getMessage()),
e);
}
}
}

final List<BundleResource> resources = new ArrayList<>();
String primaryContentType;
/**
* Create a Collection<Bundle> from a jsonString, using a custom class that implements {@link IOPath} for all
* resources.
* @param jsonString the json string must conform to the bundle schema, and may contain an array or single object
* @param ioPathConstructor constructor to use to create the backing IOPath for all resources
* @return Collection<Bundle>
* @param <T> IOPath-derived class to use for IOPathResources
*/
public static <T extends IOPath> Collection<Bundle> toBundleCollection(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this to be Collection and not something with an order? I would think List might be better? The input order is usually important in some way, changing the encounter order can often change either errors generated or floating point aggregations even if it's something that seems like it should be file order agnostic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Lists.

final String jsonString,
final Function<String, T> ioPathConstructor) {
ValidationUtils.nonEmpty(jsonString, "json bundle string");
ValidationUtils.nonNull(ioPathConstructor, "IOPath-derived class constructor");

final List<Bundle> bundles = new ArrayList<>();
try {
final JSONObject jsonDocument = new JSONObject(jsonString);
if (jsonString.length() < 1) {
final JSONArray jsonArray = new JSONArray(jsonString);
jsonArray.forEach(element -> {
if (! (element instanceof JSONObject jsonObject)) {
throw new IllegalArgumentException(
String.format("Bundle collections may contain only Bundle objects, found %s",
element.toString()));
}
bundles.add(toBundle(jsonObject, ioPathConstructor));
});
} catch (JSONException | UnsupportedOperationException e) {
// see if the user provided a single bundle instead of a collection, if so, wrap it up as a collection
try {
bundles.add(toBundle(new JSONObject(jsonString), ioPathConstructor));
} catch (JSONException | UnsupportedOperationException e2) {
throw new IllegalArgumentException(
String.format("JSON file parsing failed %s", jsonString));
String.format("JSON can be interpreted neither as an individual bundle (%s) nor as a bundle collection (%s)",
e2.getMessage(),
e.getMessage()),
e);
}
}
if (bundles.size() < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky but I isEmpty might be better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

throw new IllegalArgumentException("JSON bundle collection must contain at least one bundle");
}
return bundles;
}

/**
* Create a Collection<Bundle> from a jsonString.
*
* @param jsonString a JSON strings that conform to the bundle schema; may be an array or single object
* @return a {@link Collection<Bundle>} created from a Collection of jsonStrings
*/
public static Collection<Bundle> toBundleCollection(final String jsonString) {
return toBundleCollection(jsonString, HtsPath::new);
}

private static <T extends IOPath> Bundle toBundle(
final JSONObject jsonObject, // must be a single Bundle object
final Function<String, T> ioPathConstructor) {
try {
// validate the schema name
final String schemaName = getRequiredPropertyAsString(jsonDocument, JSON_PROPERTY_SCHEMA_NAME);
final String schemaName = getRequiredPropertyAsString(jsonObject, JSON_PROPERTY_SCHEMA_NAME);
if (!schemaName.equals(JSON_SCHEMA_NAME)) {
throw new IllegalArgumentException(
String.format("Expected bundle schema name %s but found %s", JSON_SCHEMA_NAME, schemaName));
}

// validate the schema version
final String schemaVersion = getRequiredPropertyAsString(jsonDocument, JSON_PROPERTY_SCHEMA_VERSION);
final String schemaVersion = getRequiredPropertyAsString(jsonObject, JSON_PROPERTY_SCHEMA_VERSION);
if (!schemaVersion.equals(JSON_SCHEMA_VERSION)) {
throw new IllegalArgumentException(String.format("Expected bundle schema version %s but found %s",
JSON_SCHEMA_VERSION, schemaVersion));
}
primaryContentType = getRequiredPropertyAsString(jsonDocument, JSON_PROPERTY_PRIMARY);

jsonDocument.keySet().forEach((String contentType) -> {
if (! (jsonDocument.get(contentType) instanceof JSONObject jsonDoc)) {
return;
}

if (!TOP_LEVEL_PROPERTIES.contains(contentType)) {
final String format = jsonDoc.optString(JSON_PROPERTY_FORMAT, null);
final IOPathResource ioPathResource = new IOPathResource(
ioPathConstructor.apply(getRequiredPropertyAsString(jsonDoc, JSON_PROPERTY_PATH)),
contentType,
format == null ?
null :
jsonDoc.optString(JSON_PROPERTY_FORMAT, null));
resources.add(ioPathResource);
}
});
if (resources.isEmpty()) {
LOG.warn("Empty resource bundle found: ", jsonString);
final String primaryContentType = getRequiredPropertyAsString(jsonObject, JSON_PROPERTY_PRIMARY);
final Collection<BundleResource> bundleResources = toBundleResources(jsonObject, ioPathConstructor);
if (bundleResources.isEmpty()) {
LOG.warn("Empty resource bundle found in: ", jsonObject.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow empty bundles at all?

Copy link
Collaborator Author

@cmnbroad cmnbroad Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we already don't. I'll remove this (the Bundle constructor will throw anyway), and add a negative test to BundleTest to demonstrate that.

}
return new Bundle(primaryContentType, bundleResources);
} catch (JSONException | UnsupportedOperationException e) {
throw new IllegalArgumentException(e);
}
}
private static <T extends IOPath> IOPathResource toBundleResource(
final String contentType,
final JSONObject jsonObject,
final Function<String, T> ioPathConstructor) {
final String format = jsonObject.optString(JSON_PROPERTY_FORMAT, null);
return new IOPathResource(
ioPathConstructor.apply(getRequiredPropertyAsString(jsonObject, JSON_PROPERTY_PATH)),
contentType,
format == null ? null : format);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ternary is unnecessary since you're not dereferencing format here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
private static <T extends IOPath> Collection<BundleResource> toBundleResources(
final JSONObject jsonResources,
final Function<String, T> ioPathConstructor) {

return new Bundle(primaryContentType, resources);
final List<BundleResource> bundleResources = new ArrayList<>(); // default capacity of 10 seems right
jsonResources.keySet().forEach(key -> {
if (!TOP_LEVEL_PROPERTIES.contains(key)) {
if (jsonResources.get(key) instanceof JSONObject resourceObject) {
bundleResources.add(toBundleResource(key, resourceObject, ioPathConstructor));
} else {
throw new IllegalArgumentException(
String.format("Bundle resources may contain only BundleResource objects, found %s", key));
}
}
});
return bundleResources;
}

private static String getRequiredPropertyAsString(JSONObject jsonDocument, String propertyName) {
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/htsjdk/beta/plugin/HtsContentType.java
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
* of HTS data such as "aligned reads". Each content type has an associated set of interfaces that
* are used with that type (for example, codecs with content type {@link #ALIGNED_READS} expose a
* set interfaces for reading and writing aligned reads data). The content types and the packages
* containing the the common interfaces that are defined for each type are:
* containing the common interfaces that are defined for each type are:
* <p>
* <ul>
* <li> For {@link HtsContentType#HAPLOID_REFERENCE} codecs, see the {@link htsjdk.beta.plugin.hapref} package </li>
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/htsjdk/beta/plugin/reads/ReadsBundle.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import java.util.function.Function;

/**
* A {@link Bundle} specifically for reads and reads-related resources. A {@link ReadsBundle} has a
* A class for creating a {@link Bundle} for reads and reads-related resources. A {@link ReadsBundle} has a
* primary resource with content type {@link BundleResourceType#ALIGNED_READS}; and an optional index
* resource. ReadsBundles can also contain other resources.
*
Expand Down
Loading