You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ISSUE
The following code block is not entirely idempotent because df["pp_text"] is used for tokens in the first line, then reassigned its subsequently parsed values on the last line. If this code block is executed multiple times, the number of each models' phrasegrams will decrease after each run.
# Create bigram and trigram models
tokens = [doc.split(" ") for doc in df['pp_text']]
bigram = Phrases(tokens, min_count=10, threshold=100)
trigram = Phrases(bigram[tokens], min_count=10, threshold=50)
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)
# Form trigrams
df['pp_text'] = [' '.join(trigram_phraser[bigram_phraser[doc]]) for doc in tokens]
DETAIL
This issue can be replicated and verified by running the above code block, executing len(bigram_phraser.phrasegrams.keys()), and repeating the process.
Changing the new column name in the last row from pp_text to something else would likely resolve this issue.
The text was updated successfully, but these errors were encountered:
LOCATION
Preprocessing_Project.ipynb
Phrase Modeling
ISSUE
The following code block is not entirely idempotent because
df["pp_text"]
is used for tokens in the first line, then reassigned its subsequently parsed values on the last line. If this code block is executed multiple times, the number of each models' phrasegrams will decrease after each run.DETAIL
This issue can be replicated and verified by running the above code block, executing
len(bigram_phraser.phrasegrams.keys())
, and repeating the process.Changing the new column name in the last row from
pp_text
to something else would likely resolve this issue.The text was updated successfully, but these errors were encountered: