Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data deduplication #44

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Data deduplication #44

wants to merge 2 commits into from

Conversation

Ciroye
Copy link
Collaborator

@Ciroye Ciroye commented Jul 27, 2021

Add

  • All the necessary functions to run the deduplication pipeline

@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Collaborator

@galv galv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing a BUILD file for deduplicate.py. It's fine to put deduplicate.py in galvasr2/, but let's talk about BUILD file creation before merging this.

@@ -0,0 +1,123 @@
import logging
from galvasr2.align.spark.align_lib import load_audio_id_text_id_mapping, load_transcripts
from datasketch import MinHash, MinHashLSH, MinHashLSHForest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the appropriate packages to environment.yml? I don't think we have datasketch or nltk right now.

logging.getLogger("py4j").setLevel(logging.ERROR)
catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
training_sample_rows = catalogue_df.collect()
# Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should delete "Comment this out to load everything.".

Okay to keep the note about how long loading takes. By the way, is that an old comment? My expectation was that our spark 3.1.2 upgrade fixed the slowdown with loading transcripts.

catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
training_sample_rows = catalogue_df.collect()
# Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.
if self.num_rows > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not enthusiastic about self.num_rows == 1 being a special case. I would recommend declaring num_rows: Option[int] = None in __init__ instead. Then you can do if self.num_rows is not None: as the condition here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants