Draft assertLargeDatasetEqualityV2 #181

zeotuan · 2024-12-08T10:31:52Z

No description provided.

zeotuan · 2024-12-09T22:07:18Z

@alfonsorr @SemyonSinchenko Benchmark result is promising. runtime of dataframe only API is 10x faster than RDD usage
for dataframe with 1m row and only 1 column. I will add more complicate use cases

zeotuan · 2025-01-04T01:22:06Z

core/src/main/scala/com/github/mrpowers/spark/fast/tests/DatasetComparer.scala

+   * @param checkKeyUniqueness
+   *   if true, will check if the primary key is actually unique
+   */
+  def assertLargeDatasetEqualityV2[T: ClassTag](


How should we approach this? Should I add more version that is closer to what we had before? or maybe just use this replace older version with this

zeotuan · 2025-01-04T01:24:57Z

benchmarks/src/main/scala/com/github/mrpowers/spark/fast/tests/DatasetComparerBenchmark.scala

TODO: add performance benchmark for Typed vs column filter
In theory Filter using column should allow better query plan generation

zeotuan added 2 commits December 8, 2024 19:56

assertLargeDatasetEqualityV2

8e470d7

use assert API for benchmark

3e6b254

zeotuan force-pushed the nonRddAssert branch from c51d164 to 3e6b254 Compare December 8, 2024 10:43

Add benchmark with join column

90ec1e1

zeotuan requested a review from alfonsorr December 9, 2024 22:00

zeotuan and others added 4 commits December 10, 2024 18:35

Add benchmark with multiple join column

37eca14

Add benchmark with multiple join column

ba7c5cb

Improve equal comparison option

cd173ef

Fix assert not work on DF with single column

a528229

zeotuan force-pushed the nonRddAssert branch from e6caa63 to a528229 Compare December 28, 2024 04:10

Typed outer join

9c589f4

zeotuan force-pushed the nonRddAssert branch from 0a93bc2 to 9c589f4 Compare January 1, 2025 05:22

Use builtin joinWith

6d68eb7

zeotuan force-pushed the nonRddAssert branch from 0f92a08 to 6d68eb7 Compare January 2, 2025 04:19

zeotuan commented Jan 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft assertLargeDatasetEqualityV2 #181

Draft assertLargeDatasetEqualityV2 #181

zeotuan commented Dec 8, 2024

zeotuan commented Dec 9, 2024

zeotuan Jan 4, 2025

zeotuan Jan 4, 2025

Draft assertLargeDatasetEqualityV2 #181

Are you sure you want to change the base?

Draft assertLargeDatasetEqualityV2 #181

Conversation

zeotuan commented Dec 8, 2024

zeotuan commented Dec 9, 2024

zeotuan Jan 4, 2025

Choose a reason for hiding this comment

zeotuan Jan 4, 2025

Choose a reason for hiding this comment