-
Notifications
You must be signed in to change notification settings - Fork 100
Sampling the public Twitter stream
Unlike in previous years, participants in the TREC 2013 microblog track will not be able to crawl a copy of the official corpus. Participants will instead access the corpus through a REST API (details will be made available elsewhere).
Should you wish to gather a parallel sample of tweets from the Twitter public stream from the same time period as the official corpus, you may do so using the cc.twittertools.stream.GatherStatusStream
tool.
IMPORTANT: Crawling your own copy of the tweets is not needed for participation in the TREC 2013 microblog track. However, it might be helpful.
Accessing the Twitter public stream with the GatherStatusStream
tool requires creating a developer account on Twitter and obtaining OAuth credentials to access Twitter's API. After creating an account on the Twitter developer site, you can obtain these credentials by creating an "application". After you've created an application, create an access token by clicking on the button "Create my access token".
In order to run GatherStatusStream
, you must save your Twitter API OAuth credentials in a file named twitter4j.properties
in your current working directory. See this page for more information about Twitter4j configurations. The file should contain the following (replace the **********
instances with your information):
oauth.consumerKey=**********
oauth.consumerSecret=**********
oauth.accessToken=**********
oauth.accessTokenSecret=**********
Once you have created the twitter4j.properties
file, you can begin sampling from the public stream using the following invocation:
target/appassembler/bin/GatherStatusStream
The tool will download JSON statuses continuously until it is stopped. Statuses will be saved in the current working directory and compressed hourly. It is recommended that you run the crawler from a server that has good network connections. Crawling from EC2 is a good choice.
As an example of what you'd expected in a crawl, consider data from 2013/01/23 (times in UTC):
$ du -h statuses.log.2013-01-23-*
79M statuses.log.2013-01-23-00.gz
84M statuses.log.2013-01-23-01.gz
87M statuses.log.2013-01-23-02.gz
90M statuses.log.2013-01-23-03.gz
78M statuses.log.2013-01-23-04.gz
64M statuses.log.2013-01-23-05.gz
54M statuses.log.2013-01-23-06.gz
50M statuses.log.2013-01-23-07.gz
48M statuses.log.2013-01-23-08.gz
50M statuses.log.2013-01-23-09.gz
57M statuses.log.2013-01-23-10.gz
68M statuses.log.2013-01-23-11.gz
80M statuses.log.2013-01-23-12.gz
89M statuses.log.2013-01-23-13.gz
96M statuses.log.2013-01-23-14.gz
93M statuses.log.2013-01-23-15.gz
85M statuses.log.2013-01-23-16.gz
77M statuses.log.2013-01-23-17.gz
73M statuses.log.2013-01-23-18.gz
72M statuses.log.2013-01-23-19.gz
79M statuses.log.2013-01-23-20.gz
87M statuses.log.2013-01-23-21.gz
88M statuses.log.2013-01-23-22.gz
84M statuses.log.2013-01-23-23.gz
Here are more details:
2013-01-23 00 213981 22021
2013-01-23 01 226296 20615
2013-01-23 02 232266 21520
2013-01-23 03 240487 21694
2013-01-23 04 211955 22423
2013-01-23 05 175153 20096
2013-01-23 06 150733 20564
2013-01-23 07 132684 15812
2013-01-23 08 125808 13876
2013-01-23 09 127156 11929
2013-01-23 10 143035 12153
2013-01-23 11 169064 14078
2013-01-23 12 200296 16107
2013-01-23 13 222173 17709
2013-01-23 14 240975 20703
2013-01-23 15 237227 20556
2013-01-23 16 222692 22860
2013-01-23 17 205008 20898
2013-01-23 18 196170 22187
2013-01-23 19 197398 23250
2013-01-23 20 210420 21005
2013-01-23 21 228628 20463
2013-01-23 22 232572 25613
2013-01-23 23 219770 19348
The second column is the hour (in UTC), the third column is the number of JSON messages, and the fourth column is the number of deletes.
Notes:
-
For the official corpus to be used in the TREC 2013 microblog evaluation, crawling the public stream sample will commence on 2013/02/01 00:00:00 UTC and continue for the entire month of February and March, ending 2013/03/31 23:59:59 UTC.
-
Not all JSON messages returned by the API correspond to tweets. In particular, some messages correspond to deleted tweets. See the Twitter Streaming API page for details.