-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grid-based transform limits for downsampling #9424
Comments
Cross-ref of an example of an applied thinning algorithm, vega/vegafusion#350 (comment). It’s in Python, but it should be possible to do similar in JS. |
M4 is the exact sort of thing I am looking for. I skimmed the JS implementation and it seems simple and fast. Even better the data isn't required to be sorted, though if it is the three loops required can be reduced to just one loop. The Vega dataflow model of add/rem/mod makes it difficult to enforce ordering for transforms, which is why I'm excited about M4. Aside: How would could ordering be enforced with add/rem/mod? We only care about ordering on inserts to the add array, I think(??). With pan that would be to the beginning or end of that array, but with zoom or sampling it could be anywhere. This excludes insertions based on an internal auto-incrementing row index. The best approach would be to maintain an ordered stack of removals used to play back inserts in reverse. The transform would maintain an ordered history array of Vega tuple IDs (basically an internal auto-incrementing row index) as well as a map from tuple IDs to tuples. On a pulse the map and add-array would be inner-joined on the history list (ordered by the history list), and we'd walk over the result in reverse. I see that you used The simplest path towards integration would be to first merge all the data from add/mod/rem, and then run M4 on the result. I assume that REFLOW +ADD -REM (plus the MOD updates) constitutes the visible data and nothing more. If not and the Vega dataflow calculates transforms for the entire backing source array, we're in trouble. I'm not familiar with Vega but I really hope not. What's nice about M4 is that it can likely be adapted for incremental usage, ie as a moving window. Rather than applied to the entire visible data, I think it can be applied independently to the ADD and REM arrays (where MOD contributes to both) and used to update the 4 internal accumulated statistics - the "M4". The only alternative to replacing the chart (or data) would be to implement M4 as a transformation. Unfortunately, I seem to be hitting a wall as vega-lite refuses to recognize and load my custom vega transforms. Until then my hands are tied. |
The reference I included was a test how this could be integrated within VegaFusion. In the past, we have 'good-enough' experiences by setting up params (signals in Vega) in chart specification and using the Vega View API to replace the data in the chart on-the-fly. A dedicated backend or WASM engine is then adopted to execute the custom queries and only return the data that fits within the current chart-view. We used like a max of showing 5K datapoints in the chart. Btw, a JupyterChart object can interact with the Vega View API within Python, if the latter is not included in your stack, then the JupyterChart feature is noise here. |
I love the ideas here. Do you think we would need a new transform in Vega or can this be done entirely in Vega-Lite (by generating specific Vega)? |
Enhancement Description
Describe the feature's goal, motivating use cases, and its expected behavior.
I have 50,000 temporal observations and I wish to make them all available in an 800-pixel-wide plot of mark-type
point
so I can pan and zoom into any time period. I don't need to see all 50K points at once. In fact I don't want to because it slows vega-lite to a crawl. I'd like to limit the number of points visible at any given time to around 500 orwidth*2
(ideally). When I'm zoomed-out, 49.5K points are removed but I still get to see an overview. When I'm zoomed in sufficiently (ie, a day) no points are removed and I see all the data.I tried applying a signal-based
sample
filter which averages toX
points displayed:total_count_orig / (grid_count_orig / X)
. The problems is that the more points the transform removes, the slower vega-lite becomes. Significantly slower. Fully zoomed in, drawing 50K points was faster than sampling 50K points to 10 points. Panning those 10 points was excruciatingly slow.I'm proposing a grid-based sampling filter. I believe the original
sample
was slow because it would constantly operate from the full dataset. I'm interested in a new sampler whose parameter describes the maximum number of data objects to include in the grid (current view). Any data outside would be dropped to make the transform more efficient.That's the minimum requirements, but some additional features would be nice.
Transforms that reduce the data change the order of the data objects. It would be nice to re-insert the tuples in their original locations using an internal
rowid
key. I drop down into JS to log some statistics to a text mark, and constantly sorting the data has a performance penalty.Sampling is not the ideal decimation algorithm. If there's anything you can think of to improve the sampling, let us know!
Here is the stack overflow session that inspired this issue: sample or density transform based on zoom domain. I've also tried to add my own transform, but that didn't work out: create a custom vega.transform in vega-lite.
Also in the meantime do you know of any vega optimization projects that implement this exact type of transform?
Checklist
The text was updated successfully, but these errors were encountered: