Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More investigation on epi_df duplicate detection #598

Open
brookslogan opened this issue Jan 23, 2025 · 1 comment
Open

More investigation on epi_df duplicate detection #598

brookslogan opened this issue Jan 23, 2025 · 1 comment

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Jan 23, 2025

This comment identifies some alternative approaches (conversion to data.table paired with other approaches, as well as vctrs::vec_duplicate_any) (and some details with memory benchmarking). If as_epi_df() is still consuming a lot of time in some operations (I need to package up the archive -> archive slide mentioned in the issue), then we may want to look at these some more. (The memory aspect probably only matters for epi_archive duplicate-key detection not epi_df duplicated-key detection.)

First part of this is probably benchmarking some code to see if it's worth the time looking into further optimizations. (profvis may not show properly if there is native code involved; be sure to check / instrument properly)

@brookslogan
Copy link
Contributor Author

vctrs::vec_run_sizes on arranged result is also an approach to look into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant