Skip to content

Sampling the public Twitter stream

Jimmy Lin edited this page Jun 6, 2013 · 2 revisions

Unlike in previous years, participants in the TREC 2013 microblog track will not be able to crawl a copy of the official corpus. Participants will instead access the corpus through a REST API (details will be made available elsewhere).

Should you wish to gather a parallel sample of tweets from the Twitter public stream from the same time period as the official corpus, you may do so using the cc.twittertools.stream.GatherStatusStream tool.

IMPORTANT: Crawling your own copy of the tweets is not needed for participation in the TREC 2013 microblog track. However, it might be helpful.

Accessing the Twitter public stream with the GatherStatusStream tool requires creating a developer account on Twitter and obtaining OAuth credentials to access Twitter's API. After creating an account on the Twitter developer site, you can obtain these credentials by creating an "application". After you've created an application, create an access token by clicking on the button "Create my access token".

In order to run GatherStatusStream, you must save your Twitter API OAuth credentials in a file named twitter4j.properties in your current working directory. See this page for more information about Twitter4j configurations. The file should contain the following (replace the ********** instances with your information):

oauth.consumerKey=**********
oauth.consumerSecret=**********
oauth.accessToken=**********
oauth.accessTokenSecret=**********

Once you have created the twitter4j.properties file, you can begin sampling from the public stream using the following invocation:

etc/run.sh cc.twittertools.stream.GatherStatusStream

The tool will download JSON statuses continuously until it is stopped. Statuses will be saved in the current working directory and compressed hourly. It is recommended that you run the crawler from a server that has good network connections. Crawling from EC2 is a good choice.

As an example of what you'd expected in a crawl, consider data from 2013/01/23 (times in UTC):

 $ du -h statuses.log.2013-01-23-*
 79M	statuses.log.2013-01-23-00.gz
 84M	statuses.log.2013-01-23-01.gz
 87M	statuses.log.2013-01-23-02.gz
 90M	statuses.log.2013-01-23-03.gz
 78M	statuses.log.2013-01-23-04.gz
 64M	statuses.log.2013-01-23-05.gz
 54M	statuses.log.2013-01-23-06.gz
 50M	statuses.log.2013-01-23-07.gz
 48M	statuses.log.2013-01-23-08.gz
 50M	statuses.log.2013-01-23-09.gz
 57M	statuses.log.2013-01-23-10.gz
 68M	statuses.log.2013-01-23-11.gz
 80M	statuses.log.2013-01-23-12.gz
 89M	statuses.log.2013-01-23-13.gz
 96M	statuses.log.2013-01-23-14.gz
 93M	statuses.log.2013-01-23-15.gz
 85M	statuses.log.2013-01-23-16.gz
 77M	statuses.log.2013-01-23-17.gz
 73M	statuses.log.2013-01-23-18.gz
 72M	statuses.log.2013-01-23-19.gz
 79M	statuses.log.2013-01-23-20.gz
 87M	statuses.log.2013-01-23-21.gz
 88M	statuses.log.2013-01-23-22.gz
 84M	statuses.log.2013-01-23-23.gz

Here are more details:

 2013-01-23	00	213981	22021
 2013-01-23	01	226296	20615
 2013-01-23	02	232266	21520
 2013-01-23	03	240487	21694
 2013-01-23	04	211955	22423
 2013-01-23	05	175153	20096
 2013-01-23	06	150733	20564
 2013-01-23	07	132684	15812
 2013-01-23	08	125808	13876
 2013-01-23	09	127156	11929
 2013-01-23	10	143035	12153
 2013-01-23	11	169064	14078
 2013-01-23	12	200296	16107
 2013-01-23	13	222173	17709
 2013-01-23	14	240975	20703
 2013-01-23	15	237227	20556
 2013-01-23	16	222692	22860
 2013-01-23	17	205008	20898
 2013-01-23	18	196170	22187
 2013-01-23	19	197398	23250
 2013-01-23	20	210420	21005
 2013-01-23	21	228628	20463
 2013-01-23	22	232572	25613
 2013-01-23	23	219770	19348

The second column is the hour (in UTC), the third column is the number of JSON messages, and the fourth column is the number of deletes.

Notes:

  • For the official corpus to be used in the TREC 2013 microblog evaluation, crawling the public stream sample will commence on 2013/02/01 00:00:00 UTC and continue for the entire month of February and March, ending 2013/03/31 23:59:59 UTC.

  • Not all JSON messages returned by the API correspond to tweets. In particular, some messages correspond to deleted tweets. See the Twitter Streaming API page for details.