Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading ouput from twitter-intact-stream failed #7

Open
TamHHM opened this issue Nov 25, 2020 · 6 comments
Open

Loading ouput from twitter-intact-stream failed #7

TamHHM opened this issue Nov 25, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@TamHHM
Copy link

TamHHM commented Nov 25, 2020

Hi, I used the crawler from twitter-intact-stream to collect tweets. Then I uncompressed the output file, add the extension .jsonl, then load it with birdspotter. The following error happened:

Extracting raw tweets: 6186it [00:03, 1872.15it/s]
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/BirdSpotter.py", line 56, in init
self.extractTweets(path, tweetLimit = tweetLimit, embeddings=embeddings)
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/BirdSpotter.py", line 241, in extractTweets
for temp_user, temp_tweet, temp_content, temp_description, temp_cascade in itertools.chain(*map(self.process_tweet, tqdm(raw_tweets, desc="Extracting raw tweets"))):
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/BirdSpotter.py", line 142, in process_tweet
temp_text = (j['text'] if 'text' in j.keys() else j['full_text'])
KeyError: 'full_text'

@rohitram96
Copy link
Collaborator

It looks like twitter-intact-stream does some manipulation to the JSON object it receives from Tweepy, and isn't in the Tweet JSON Specification supplied by Twitter. I'd recommend using twarc to rehydrate a list of the IDs from the data.

Note, that not all the information birdspotter requires is in the JSON supplied by twitter-intact-stream.

@TamHHM
Copy link
Author

TamHHM commented Nov 26, 2020

Thanks for your reply. Would you please be a bit more specific on how to prepare the input for birdspotter? As per your answer, birdspotter cannot use the output of twitter-intact-stream directly and it has to go through twarc, doesn't it? If possibly, may you provide some examples?

@rohitram96
Copy link
Collaborator

On closer inspection; it looks like twitter-intact-stream does leave us in the correct format, assuming no post-processing has been done. I'll need to investigate further, as to why one or more of the json objects don't have a "text" or "full_text" field.

@rohitram96
Copy link
Collaborator

It looks like the twitter-intact-stream contains lines to indicated rate-limiting, such as

{"limit":{"track":283540,"timestamp_ms":"1483189188944"}}

Normally, birdspotter would ignore corrupted lines that aren't in valid json format, but this is valid json.

The work around at the moment would be to filter out lines that look like the above and then feed the result into birdspotter.

I think it would be better if birdspotter dealt with cases like this better. I'm going to put a temporary fix in so that it ignores objects without a text or full_text field.

I'll leave this open till that is implemented.

@rohitram96 rohitram96 added the enhancement New feature or request label Nov 27, 2020
@andrei-rizoiu
Copy link
Member

Rate limit messages are normal when using the search API, they give the number of lost tweets. You would expect them when using other Twitter API tools, so would be good if birdspotter would filter them our automatically.

rohitram96 pushed a commit that referenced this issue Nov 27, 2020
@rohitram96
Copy link
Collaborator

db93307 in the development branch should fix this problem, at least temporarily. There should probably be a more robust check of the json object to verify it's validity, but we'll wait till there is more interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants