-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data deduplication #44
base: main
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing a BUILD file for deduplicate.py. It's fine to put deduplicate.py in galvasr2/, but let's talk about BUILD file creation before merging this.
@@ -0,0 +1,123 @@ | |||
import logging | |||
from galvasr2.align.spark.align_lib import load_audio_id_text_id_mapping, load_transcripts | |||
from datasketch import MinHash, MinHashLSH, MinHashLSHForest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the appropriate packages to environment.yml? I don't think we have datasketch or nltk right now.
logging.getLogger("py4j").setLevel(logging.ERROR) | ||
catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index) | ||
training_sample_rows = catalogue_df.collect() | ||
# Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should delete "Comment this out to load everything.".
Okay to keep the note about how long loading takes. By the way, is that an old comment? My expectation was that our spark 3.1.2 upgrade fixed the slowdown with loading transcripts.
catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index) | ||
training_sample_rows = catalogue_df.collect() | ||
# Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine. | ||
if self.num_rows > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not enthusiastic about self.num_rows == 1 being a special case. I would recommend declaring num_rows: Option[int] = None
in __init__
instead. Then you can do if self.num_rows is not None:
as the condition here.
Add