Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle by #258

Merged
merged 10 commits into from
Nov 11, 2023
Merged

shuffle by #258

merged 10 commits into from
Nov 11, 2023

Conversation

chenziliang
Copy link
Collaborator

@chenziliang chenziliang commented Nov 7, 2023

PR checklist:

  • Did you run ClangFormat ?
  • Did you separate headers to a different section in existing community code base ?
  • Did you surround proton: starts/ends for new code in existing community code base ?

Please write user-readable short description of the changes:

Closed #256.

Streaming part will be done in a separate PR

@chenziliang chenziliang changed the title shuffle by expr shuffle by Nov 7, 2023
@chenziliang chenziliang force-pushed the feature/issue-256-light-shuffle branch 2 times, most recently from 2049fc4 to 2edf66e Compare November 10, 2023 07:45
@chenziliang chenziliang marked this pull request as ready for review November 10, 2023 07:45
@chenziliang chenziliang requested a review from yl-lisen November 10, 2023 07:49
@chenziliang chenziliang self-assigned this Nov 10, 2023
Comment on lines +202 to +203
if (elem->as<ASTIdentifier>())
shuffle_by_columns.push_back(elem->getColumnName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If shuffle by keys are an expression or constant, ignore them ?
I guess the shuffle by keys should be calculated before light shuffling, so we can always use shuffle_by_columns.push_back(elem->getColumnName()) even if it is not an identifier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will still do the work i think and it shall be fine just like regular columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning if end user do something like aliasing a constant or an expression / function as an identifier, we will do the shuffle work as expected.

When the logic reaches here, the ast of the shuffle by can be asserted to ASTIdentifier since it is already validated in the AST parser.

if (shuffle_by_expression_list)
{
for (const auto & elem : shuffle_by_expression_list->children)
if (!elem->as<ASTIdentifier>())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Identifier may be an alias of expression/function.
If want to validate whether is column, we can call IdentifierSemantic::getMembership after TranslateQualifiedNamesVisitor in TreeRewriter,

Copy link
Collaborator Author

@chenziliang chenziliang Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thought about this. For now, probably it is fine and we may extend this soon to support expression for shuffle by.

Comment on lines +44 to +45
const auto & key_col = columns[key_column_position]->convertToFullColumnIfConst();
const auto & key_col_no_lc = recursiveRemoveLowCardinality(recursiveRemoveSparse(key_col));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this code, since ColumnConst/ColumnLowCardinality/ColumnSparse have been implemented this method updateWeakHash32

Copy link
Collaborator Author

@chenziliang chenziliang Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the code. I am not quite sure if they will produce same hash results. Have you tested it out ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing mean, 2 or more different low cardinality blocks for example, they contain some overlapping lower cardinality rows but with other rows different, when we do the hash, we will need make sure same LC rows across different blocks will shuffled to same shard (having the same hash). Same for sparse column.

{
if (!input.isFinished() && input.hasData())
{
shuffled_chunks.push_back(input.pull());
Copy link
Collaborator

@yl-lisen yl-lisen Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE, for streaming aggregating, we need to use pull(/*set_not_needed*/true), since the inputs can receive heartbeat chunk, then trigger upstream again, next loop above actions, which will cause pipeline stuck, there are some reasons:

  1. The upstream is triggered first, resulting in no thread calling the downstream processor.
  2. Missing propagate heartbeat chunk can also cause pipeline stuck.

Of course, it works in historical aggregating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pure historical data. Streaming/AggregatingTransform.cpp is not covered yet

@chenziliang chenziliang force-pushed the feature/issue-256-light-shuffle branch from 2edf66e to 621497a Compare November 10, 2023 19:27
@chenziliang chenziliang merged commit 1653257 into develop Nov 11, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Light shuffle
2 participants