-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-37005][table] Make StreamExecDeduplicate ouput insert only where possible #26051
base: master
Are you sure you want to change the base?
Conversation
…ime and proc time deduplicate when keeping first row
@lincoln-lil @xuyangzhong can you take a look at this? You were recently involved in a couple of changes around here: FLINK-36837 and FLINK-34702. I could try to implement the async append-only row-time version, but I would probably need a bit of guidance from your side. For example, the function that I introduced |
@pnowojski IIUC, the async state operator currently processes watermarks only after all prior async state requests have been completed. That means when the event time timer is triggered, all preceding async state access operations have already been processed. Please correct me if there is any mistake @Zakelly . |
@pnowojski Glad to see you here. @xuyangzhong is right, you have that guarantee. The same-key records and timers will happen in order of arrival. That is, when watermark advance, the timer will fire after all arrived records for that key finishes. And any records for that key arrive after the watermark advance will start processing after the timer finishes. |
Thanks for your answers @Zakelly and @xuyangzhong! One more thing. At the moment I have only sync implementation of the append-only deduplicate operator. I guess there is no harm if this is merged as is, so that the synchronous version will be used despite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pnowojski Yes. And the async state version could be implemented in a separate PR. One suggestion if so:
return new AsyncKeyedProcessOperator<>( | ||
new RowTimeDeduplicateKeepFirstRowFunction( | ||
rowTypeInfo, stateRetentionTime, rowtimeIndex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use the sync version of operator if the function is for sync state:
return new AsyncKeyedProcessOperator<>( | |
new RowTimeDeduplicateKeepFirstRowFunction( | |
rowTypeInfo, stateRetentionTime, rowtimeIndex)); | |
return new KeyedProcessOperator<>( | |
new RowTimeDeduplicateKeepFirstRowFunction( | |
rowTypeInfo, stateRetentionTime, rowtimeIndex)); |
What is the purpose of the change
This PR:
RowTimeDeduplicateKeepFirstRowFunction
using watermarks to avoid retracting previous resultsThanks to that, planner can avoid costly operators like
SinkUpsertMaterializer
downstream the whole query can be append-only - removing a need for retracting/upserting results from the output table.Currently there is no variant for the async state backend.
Verifying this change
This adds new tests and is also covered by existing ITCases.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation