Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: aes 291 continuous queries in olap aggregator #538

Merged

Conversation

samika98
Copy link
Contributor

@samika98 samika98 commented Sep 23, 2024

The OLAP aggregator will continuously run the data event and time event queries and store the results into Basin. This code here adds logic to run the queries continuously.

We opted not to utilize Data Fusion's streaming capabilities, as they lack the features and flexibility required for our use case. Data Fusion's streaming functionality is relatively limited in scope.

Copy link

linear bot commented Sep 23, 2024

@samika98 samika98 changed the title Feature/aes 291 continuous queries in olap aggregator Feature: aes 291 continuous queries in olap aggregator Sep 23, 2024
@samika98 samika98 changed the title Feature: aes 291 continuous queries in olap aggregator feat: aes 291 continuous queries in olap aggregator Sep 23, 2024
@samika98 samika98 marked this pull request as ready for review September 23, 2024 22:06
@samika98 samika98 requested review from nathanielc and a team as code owners September 23, 2024 22:06
@samika98 samika98 requested review from gvelez17 and removed request for a team September 23, 2024 22:06
/// Represents a processor for continuous stream processing of conclusion feed data.
struct ContinuousStreamProcessor {
ctx: SessionContext,
last_processed_index: Cell<u64>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is last_processed_index inside a Cell? Why not accept a &mut self in the process_batch method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want ctx to be mutable on the struct. Taking &self can make the API more flexible. I think interior mutability can make the API more flexible. Example :

    let mut processor = ContinuousStreamProcessor::new();
    
    // This works fine
    processor.process_batch(&[Data::new()]);
    
    // But what if we want to do this?
    let processors = vec![&processor, &processor];
    
    for p in &processors {
        p.process_batch(&[Data::new()]); // This won't compile
    }```
    Don't feel too strongly either ways.

debug!("Received shutdown signal, stopping continuous stream processing");
break;
}
result = processor.process_batch() => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_batch only ever returns true. Should we simplify this logic such that it returns Result<()>?


let df = batch.cache().await?;
let df_clone = df.clone();
process_feed_batch(self.ctx.clone(), df_clone).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, since the name df_clone doesn't convey any meaning I'd say this should just be process_feed_batch(self.ctx.clone(), df.clone()).await?;

.filter(col("index").gt(lit(self.last_processed_index.get())))?
.limit(0, Some(100))?;

let df = batch.cache().await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here that we need a cache because we do two passes over the data would be good. First pass is the normal process_feed_batch pass while the second pass is to get the highest index

#[test(tokio::test)]
// TODO : Create a test that checks how many times quereies run. They should run continuously in a loop.
// #[tokio::test]
// async fn test_continuous_stream_processing() -> Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any luck on getting a test working with the custom context?

@samika98 samika98 added this pull request to the merge queue Oct 8, 2024
Merged via the queue into main with commit 0e21e77 Oct 8, 2024
5 checks passed
@samika98 samika98 deleted the feature/aes-291-continuous-queries-in-olap-aggregator branch October 8, 2024 21:56
@smrz2001 smrz2001 mentioned this pull request Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants