Module 3 Preprocessing_Project - Suggestion to improve idempotency #36

jellomoat · 2024-07-08T01:16:10Z

LOCATION
Preprocessing_Project.ipynb
Phrase Modeling

ISSUE
The following code block is not entirely idempotent because df["pp_text"] is used for tokens in the first line, then reassigned its subsequently parsed values on the last line. If this code block is executed multiple times, the number of each models' phrasegrams will decrease after each run.

# Create bigram and trigram models
tokens = [doc.split(" ") for doc in df['pp_text']]

bigram = Phrases(tokens, min_count=10, threshold=100)
trigram = Phrases(bigram[tokens], min_count=10, threshold=50)  
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# Form trigrams
df['pp_text'] = [' '.join(trigram_phraser[bigram_phraser[doc]]) for doc in tokens]

DETAIL
This issue can be replicated and verified by running the above code block, executing len(bigram_phraser.phrasegrams.keys()), and repeating the process.

Changing the new column name in the last row from pp_text to something else would likely resolve this issue.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module 3 Preprocessing_Project - Suggestion to improve idempotency #36

Module 3 Preprocessing_Project - Suggestion to improve idempotency #36

jellomoat commented Jul 8, 2024

Module 3 Preprocessing_Project - Suggestion to improve idempotency #36

Module 3 Preprocessing_Project - Suggestion to improve idempotency #36

Comments

jellomoat commented Jul 8, 2024