Data deduplication #44

Ciroye · 2021-07-27T15:50:58Z

Add

All the necessary functions to run the deduplication pipeline

github-actions · 2021-07-27T15:51:13Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

galv

You are missing a BUILD file for deduplicate.py. It's fine to put deduplicate.py in galvasr2/, but let's talk about BUILD file creation before merging this.

galv · 2021-08-09T15:10:32Z

galvasr2/deduplicate.py

@@ -0,0 +1,123 @@
+import logging
+from galvasr2.align.spark.align_lib import load_audio_id_text_id_mapping, load_transcripts
+from datasketch import MinHash, MinHashLSH, MinHashLSHForest


can you add the appropriate packages to environment.yml? I don't think we have datasketch or nltk right now.

galv · 2021-08-09T15:13:34Z

galvasr2/deduplicate.py

+        logging.getLogger("py4j").setLevel(logging.ERROR)
+        catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
+        training_sample_rows = catalogue_df.collect()
+        # Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.


I think you should delete "Comment this out to load everything.".

Okay to keep the note about how long loading takes. By the way, is that an old comment? My expectation was that our spark 3.1.2 upgrade fixed the slowdown with loading transcripts.

galv · 2021-08-09T15:19:26Z

galvasr2/deduplicate.py

+        catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
+        training_sample_rows = catalogue_df.collect()
+        # Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.
+        if self.num_rows > 1:


I am not enthusiastic about self.num_rows == 1 being a special case. I would recommend declaring num_rows: Option[int] = None in __init__ instead. Then you can do if self.num_rows is not None: as the condition here.

Ciroye added 2 commits July 26, 2021 16:15

data deduplication pipeline

e434230

deduplication functions

29c32e9

galv reviewed Aug 9, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data deduplication #44

Data deduplication #44

Ciroye commented Jul 27, 2021

github-actions bot commented Jul 27, 2021

galv left a comment

galv Aug 9, 2021

galv Aug 9, 2021

galv Aug 9, 2021

Data deduplication #44

Are you sure you want to change the base?

Data deduplication #44

Conversation

Ciroye commented Jul 27, 2021

Add

github-actions bot commented Jul 27, 2021

galv left a comment

Choose a reason for hiding this comment

galv Aug 9, 2021

Choose a reason for hiding this comment

galv Aug 9, 2021

Choose a reason for hiding this comment

galv Aug 9, 2021

Choose a reason for hiding this comment