Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hash join when writing sparkey #1

Closed
wants to merge 1 commit into from
Closed

Conversation

aslotnick
Copy link
Owner

@aslotnick aslotnick commented Jun 24, 2024

When writing to sparkey, allShards represents every expected shard even if there is no corresponding data in shards for that shard number.

Starting in Scio 14, possibly related to spotify#5208, shards.rightOuterJoin(allShards) would fail when a shard contained approximately 2GB or more data, leading to the bug described in spotify#5300: java.lang.OutOfMemoryError: Required array length 2147483639 + 15534 is too large.

This PR replaces rightOuterJoin with hashFullOuterJoin (note that there is no hashRightOuterJoin implementation). A hash join is a good fit because the right-hand side contains very little data (only the keys of the shards) and it doesn't need to use an array to represent the large left-hand side's values. As a result shards with more than 2GB data can once again be processed successfully.

@aslotnick aslotnick closed this Jun 24, 2024
@aslotnick aslotnick deleted the as/sparkey-join-fix-1 branch July 8, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant