Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module 3 Preprocessing_Project - Suggestion to improve idempotency #36

Open
jellomoat opened this issue Jul 8, 2024 · 0 comments
Open

Comments

@jellomoat
Copy link
Member

LOCATION
Preprocessing_Project.ipynb
Phrase Modeling

ISSUE
The following code block is not entirely idempotent because df["pp_text"] is used for tokens in the first line, then reassigned its subsequently parsed values on the last line. If this code block is executed multiple times, the number of each models' phrasegrams will decrease after each run.

# Create bigram and trigram models
tokens = [doc.split(" ") for doc in df['pp_text']]

bigram = Phrases(tokens, min_count=10, threshold=100)
trigram = Phrases(bigram[tokens], min_count=10, threshold=50)  
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# Form trigrams
df['pp_text'] = [' '.join(trigram_phraser[bigram_phraser[doc]]) for doc in tokens]

DETAIL
This issue can be replicated and verified by running the above code block, executing len(bigram_phraser.phrasegrams.keys()), and repeating the process.

Changing the new column name in the last row from pp_text to something else would likely resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant