A simple and easy to use crawler for web sources (fb, twitter, nodebb, etc)
favara is a Siculo-Arabic word meaning: water source. The Siculo-Arabic language is dead now (IX-XIV century), but we believe the word favara sounds great and its meaning really reflects the purpose of the project.
- crawls posts and events from several sources, inserting them into a database
- only FB at the moment
- A recent version of ruby and rubygems installed.
- A Postgres database, where Favara will put the crawled data.
- Clone this repo
- Install any dependencies via
$ bundle install
- Configure the database - you can override the settings in
database.yml
using the following env variables- FAVARA_DB_ADAPTER
- FAVARA_DB_ENCODING
- FAVARA_DB_POOL
- FAVARA_DB_USERNAME
- FAVARA_DB_PASSWORD
- FAVARA_DB_HOST
- FAVARA_DB_DATABASE
- Configure the sources -
config.yml
You will then have to make a choice regarding the ownership of the database tables favara uses:
- If you want to run Favara with isamuni, then let isamuni create the tables for you, no other configuration is required.
- You can edit the models in the models folder to reflect your tables' structure.
- You can ask favara to create the required tables via
rake create_tables
. - If you run a rails app, you can generate a new migration and then copy the contents of
migrations/001_init.rb
inside of it. - You also manually create the required tables by yourself referring to
migrations/001_init.rb
.
- Run favara issuing
rake favara
to crawl only the latest contents - Run
rake "favara[true]"
to crawl all posts from all sources - Run
clockwork clock.rb
to leave favara running, and automatically crawl the latest posts at regular intervals (the default configurtation runs a complete crawling between 11pm and 5am).
Favara is designed to import the crawled contents into a database. If that doesn't suit your needs, feel free to copy the files in crawlers/lib/*
containing the database-independent logic and use them as any other ruby library.
We also provide a very thin Sinatra webservice. This is not supposed to be used in production, but it may come in handy for testing or diagnostic. To run it, simply run ruby server.rb
, then point your browser to localhost:4567.
You can check the crawled events under /events and posts under /posts