More investigation on `epi_df` duplicate detection #598

brookslogan · 2025-01-23T18:32:11Z

This comment identifies some alternative approaches (conversion to data.table paired with other approaches, as well as vctrs::vec_duplicate_any) (and some details with memory benchmarking). If as_epi_df() is still consuming a lot of time in some operations (I need to package up the archive -> archive slide mentioned in the issue), then we may want to look at these some more. (The memory aspect probably only matters for epi_archive duplicate-key detection not epi_df duplicated-key detection.)

First part of this is probably benchmarking some code to see if it's worth the time looking into further optimizations. (profvis may not show properly if there is native code involved; be sure to check / instrument properly)

The text was updated successfully, but these errors were encountered:

brookslogan · 2025-01-23T19:43:06Z

vctrs::vec_run_sizes on arranged result is also an approach to look into.

brookslogan added the performance label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More investigation on `epi_df` duplicate detection #598

More investigation on `epi_df` duplicate detection #598

brookslogan commented Jan 23, 2025 •

edited

Loading

brookslogan commented Jan 23, 2025

More investigation on epi_df duplicate detection #598

More investigation on epi_df duplicate detection #598

Comments

brookslogan commented Jan 23, 2025 • edited Loading

brookslogan commented Jan 23, 2025

More investigation on `epi_df` duplicate detection #598

More investigation on `epi_df` duplicate detection #598

brookslogan commented Jan 23, 2025 •

edited

Loading