-
Notifications
You must be signed in to change notification settings - Fork 3
Home
The Sparkler Crawl Environment ("Sparkler CE") provides the tools to collect webpages from the Internet and make them available for Memex search tools. It includes the ability to:
- build a domain discovery model which will guide the data collection,
- crawl data from the Web, and
- output it in a format that is usable by Memex search tools.
Technically, Sparkler CE consists of a domain discovery tool and a Web crawler called Sparkler. Sparkler is inspired by Apache Nutch, runs on top of Apache Spark, and stores the crawled data in Apache Solr. To make it easy to install in multiple environments, Sparkler CE is configured as a multi-container Docker application. All of these technologies are open source and freely available for download, use, and improvement.
Sparkler CE can run on any machine or in any environment on which Docker can be installed.
You can use the Community Edition (CE) from https://www.docker.com/community-edition. Choose your platform from the download section and follow the instructions.
Note: you can also use Docker Enterprise edition.
For a Linux install, you will also need to:
- Install Docker Compose from https://docs.docker.com/compose/. This is included in Docker for Mac, Docker for Windows, and Docker Toolbox, and so requires no additional installation for those environments.
- Set up docker so it can be managed by a non-root user. Those instructions are here: https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user.
Option 1. Download the zip file from https://github.com/memex-explorer/sce/archive/master.zip and unzip it.
Option 2. If you are familiar with git and have it installed, clone the repository with the following command:
$ git clone https://github.com/memex-explorer/sce.git
Run the following command from within the sce directory. The script will download and install all of the dependencies for Sparkler CE to run and may take as long as 20 minutes depending on your Internet connection speed.
$ ./kickstart.sh
Congratulations! You've installed and launched the Sparkler Crawl Environment. This is now running as a service on your machine. In order to stop it from running but leave it installed, use the following command:
$ ./kickstart.sh -stop
If you wish to stop it from running and remove the containers so that you can completely uninstall it:
$ ./kickstart.sh -down
If you do stop it and remove the containers, you can start everything again with:
$ ./kickstart.sh -start
The logs that are created during running of kickstart.sh are stored in sce/logs/kickstart.log.
The Domain Discovery Model is what is used to determine the relevancy of webpages as they are collected from the Internet.
-
Open the Domain Discovery Tool in your browser. The url is http://<domain.name>/explorer. If you are running this on your own computer and not remotely, it is http://0.0.0.0:5000/explorer. You should see a screen like this:
[Memex Team Spaces > User Guide > DDTool0616.png]
-
Find and Mark Relevant Web Pages
-
Enter Search terms in the box on the left and click the magnifying glass icon to return web pages relevant to your search.
Tip: Good search terms are keywords that are relevant to the domain and can be multiple keywords put together. Try different things. The more searches you perform and pages you mark the relevancy of, the more accurate the domain classifier will be.
-
For each webpage shown, select whether it is Highly Relevant, Relevant, or Not Relevant to the domain.
-
Highly Relevant pages contain exactly the type of information that you want to collect.
-
Relevant pages are on the correct topic, but may not contain information that will help answer questions about your domain.
-
Not Relevant are pretty self-explanatory.
-
Tip: The important thing for building a good Domain Discovery Model is to make sure that you have done the following things:
-
Included content that covers all the important areas of your domain.
-
Included at least 10 pages for each of Highly Relevant, Relevant, and Not Relevant pages.
-
The seed file is the starting point for all of the data collection that Sparkler CE retrieves from across the Internet.
- Create this file in any text editor and save it as a .txt file
- Click the Upload Seed File button in the left hand side column and follow the instructions.
TECH NOTE: this seed file will be saved in the sce directory.
Data collection on the web is called crawling. Web crawling at its most basic consists of retrieving the content on web pages from a seed list. In addition to readable text, these pages also contain links to other pages, and so these links are then followed, and those pages collected as well. The content of each page is given a relevancy score by the Domain Discovery Model, which determines if the page's content is saved, and if the links on the page are followed.
In order to launch this process, simply click the "Start Crawl" link from the left hand side bar.
The crawl can also be launched from the command line from the sce directory with the following command:
$ ./sce.sh -sf seed_<your-domain>.txt
Note: this will launch a crawl that will run until you stop it. In order to run a short crawl, you can add the -i flag into the command. More details in the Additional Options for sce.sh.
To see what is being collected in real time, view the dashboard by clicking the Launch Dashboard button or by visiting http://<domain.name>/banana. If you are running this on your own computer and not remotely, the url is http://0.0.0.0:8983/banana/.
In order to make the data that you've collected available to Memex Search Tools, it must be output in the right format: the Memex Common Data Repository schema version 3.1 (CDRv3.1). We call this process dumping the data.
Before dumping the data, stop the crawl with the "Stop Crawler" link in the left hand side bar of the /explorer interface. This may take up to 30 minutes to stop to ensure that no data is lost. If you are in a terrible rush, you can also hit the "Halt Crawler" link and everything will stop immediately, although recently collected pages are likely to be lost if you do this.
Running the following command in the sce directory will dump the data out of Sparkler CE's native database and upload it to a common data repository where it can be retrieved and used by the Memex Search Tools.
$ ./dumper.sh
Congratulations! You have now run through the basics of using the Sparkler Crawl Environment. For more information - and for special circumstances, check out the Technical Information section below.
In order to upgrade your Sparkler CE installation, follow these simple steps:
-
Upgrade Sparkler CE install script.
Option 1. Download the zip file from https://github.com/memex-explorer/sce/archive/master.zip and unzip it.
Option 2. If you installed using git, run the following command in the sce directory:
$ git pull
-
Run the following command in the sce directory to upgrade all the dependencies:
$ ./kickstart.sh
This may take up to 20 minutes to complete.
-
Success!
In order to completely remove Sparkler CE from your computer, do the following:
-
First stop all running containers
./kickstart.sh -down
-
Check if any containers are running or not
docker ps
You should not see any running containers. If you see any container running then run this:
docker stop $(docker ps -aq)
-
Now delete all images on your machine
docker rmi -f $(docker images -aq)
Check if any images are on your machine:
docker images
This should show you 0 images.
-
After the above, you can go ahead and delete the sce directory
sudo rm -rf sce
CAUTION: THIS WILL ALSO DELETE ALL YOUR CRAWL DATA
To avoid deleting all your crawl data, do not delete the sce directory. To re-install without deleting the sce directory just run the following after step 3 from the sce directory
git pull --all ./kickstart.sh
sce.sh has additional options that allow customization of the crawl being launched
$ ./sce.sh -sf /path/to/seed -i num_iterations -id job_id [-l /path/to/log]
-sf specify seed file path
-i select the number of the iterations to run the crawl for. The first iteration
will collect all of the pages in the seed list. The second and successive iterations
will collect all of the links found on the pages in the previous round, and so on
(with some limits to keep the round size reasonable). For test runs, start with 10
iterations and then look at the data that was collected to make sure you like it.
-job_id name the job to make it easy to identify in the list of running processes
-l specify the location of the log file
While crawling, the data collected are continually indexed into an Apache Solr index at http://<domain.name>/solr. If you are running this on your own computer and not remotely, the url is http://0.0.0.0:8983/solr/#/.
This link will take you to directly view and access the raw data that is being collected. To see an overview in real time, use the Banana interface available from http://<domain.name>/banana. If you are running this on your own computer and not remotely, the url is http://0.0.0.0:8983/banana/.
While Sparkler CE is running, you can first train the model (see the subsection "Domain Discovery" within "First Run Through") and then run Sparkler to crawl the injected URLs through the "sce.sh" script (see "Collect Web Content" subsection). However, you can also get into the sparkler container and use the tools reported in the "Commands" section (see also the subsections "Collect Web Content" and "Output to Search Tools" within "First Run Through") for domain discovery purposes:
$ docker exec -it $(docker ps -a -q --filter="name=compose_sparkler_1") bash
root@9fb9b04ef5bd:/data#
You should not need to get into the sparkler container if you use the utility script named "sce.sh".
The sce.sh script manages all of this for you, but if you need more control, it is possible to access the commands that underlie the script. First, Get into the Sparkler Container and then ou can use the sparkler.sh bash script (located in the bin folder of the main folder of Sparkler) to run the Sparkler crawler within the environment. Specifically, this section shows how to inject URLs, run crawls, and dump the crawl data by using the "sparkler.sh" script.
From the main folder of Sparkler, you can run bin/sparkler.sh to see the commands provided by Sparkler:
$ bin/sparkler.sh
Sub Commands:
inject : edu.usc.irds.sparkler.service.Injector
- Inject (seed) URLS to crawldb
crawl : edu.usc.irds.sparkler.pipeline.Crawler
- Run crawl pipeline for several iterations
dump : edu.usc.irds.sparkler.util.FileDumperTool
- Dump files in a particular segment dir
The bin/sparkler.sh inject command is used to inject URLs into Sparkler. This command provides the following options:
$ bin/sparkler.sh inject
-cdb (--crawldb) VAL : Crawl DB URI.
-id (--job-id) VAL : Id of an existing Job to which the urls are to be
injected. No argument will create a new job
-sf (--seed-file) FILE : path to seed file
-su (--seed-url) STRING[] : Seed Url(s)
Here is an example of injecting the Sparkler crawler with a file named "seed.txt" containing two URLs:
$ bin/sparkler.sh inject -sf ~/work/sparkler/seed.txt
2017-05-24 17:58:05 INFO Injector$:98 [main] - Injecting 2 seeds
>>jobId = sjob-1495673885495
You can also provide the URLs to inject with the -su option directly within the command line. Furthermore, you can add more URLs to the crawl database by updating an existing job with the -id option. Launch a crawl
The bin/sparkler.sh crawl command is used to run a crawl against the URLs previously injected. This command provides the following options:
$ bin/sparkler.sh crawl
Option "-id (--id)" is required
-aj (--add-jars) STRING[] : Add sparkler jar to spark context
-cdb (--crawldb) VAL : Crawdb URI.
-fd (--fetch-delay) N : Delay between two fetch requests
-i (--iterations) N : Number of iterations to run
-id (--id) VAL : Job id. When not sure, get the job id from
injector command
-ke (--kafka-enable) : Enable Kafka, default is false i.e. disabled
-kls (--kafka-listeners) VAL : Kafka Listeners, default is localhost:9092
-ktp (--kafka-topic) VAL : Kafka Topic, default is sparkler
-m (--master) VAL : Spark Master URI. Ignore this if job is started
by spark-submit
-o (--out) VAL : Output path, default is job id
-tg (--top-groups) N : Max Groups to be selected for fetch..
-tn (--top-n) N : Top urls per domain to be selected for a round
Here is an example of crawling the URLs injected for a given job identifier (e.g., sjob-1495673885495) in local mode using only one iteration:
bin/sparkler.sh crawl -id sjob-1495673885495 -m local[*] -i 1
The bin/sparkler.sh dump command is used to dump out the crawled data. This command provides the following options:
$ bin/sparkler.sh dump
Option "-i (--input)" is required
--mime-stats : Use this to skip dumping files matching the
provided mime-types and dump the rest
--skip : Use this to skip dumping files matching the
provided mime-types and dump the rest
-i (--input) VAL : Path of input segment directory containing the
part files
-m (--master) VAL : Spark Master URI. Ignore this if job is started
by spark-submit
-mf (--mime-filter) STRING[] : A space separated list of mime-type to dump i.e
files matching the given mime-types will be
dumped, default no filter
-o (--out) VAL : Output path for dumped files
Here is an example of dumping out the data that have been crawled within a path (e.g., sjob-1495673885495/20170524183747) in local mode:
$ bin/sparkler.sh dump -i sjob-1495673885495/20170524183747 -m local[*]
All of the scripts save their logs in the sce/logs directory so they can be reviewed later by using your favorite text editor. In case you need to see the log messages while the environment is running, you can do so with the following command:
$ tail -f /path/to/log
where the -f option causes tail to not stop when end of file is reached, but rather to wait for additional data to be appended to the input.
- kickstart.sh: generates sce/logs/kickstart.log
- sce.sh: generates sce/logs/sce.log
- dumper.sh: generates sce/logs/dumper.log
Once the installation procedure has completed, the "kickstart.sh" script automatically starts the docker containers in the background (i.e., detached mode) and checks if all of them are properly running.
However, you can use the docker images command to show all top level images to make certain that they are running. For Sparkler CE, you should see the following:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
sujenshah/sce-domain-explorer latest 5fe5e4586eec 13 hours ago 1.53 GB
sujenshah/sce-sparkler latest 00e0e46a0ae6 14 hours ago 2.44 GB
selenium/standalone-firefox-debug latest d7b329a44b94 6 weeks ago 705 MB
Furthermore, you can use the docker ps command to check that the containers have been built, created, started and attached for a service:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9fb9b04ef5bd sujenshah/sce-sparkler "/bin/sh -c '/data..." 34 hours ago Up 34 hours 0.0.0.0:8983->8983/tcp compose_sparkler_1
c4d7c48332ad selenium/standalone-firefox-debug "/opt/bin/entry_po..." 34 hours ago Up 34 hours 0.0.0.0:4444->4444/tcp, 0.0.0.0:9559->5900/tcp compose_firefox_1
a255097415ea sujenshah/sce-domain-explorer "python run.py" 34 hours ago Up 34 hours 0.0.0.0:5000->5000/tcp compose_domain-discovery_1
The services are started through the docker-compose up command, which is automatically executed by the "kickstart.sh" script.
If anything should glitch when you're using the tool, the easiest way to get things going again is to stop the docker containers and then start them again. This will preserve any data that you've collected, including your domain discovery model and your collected web data. Do that with the following commands:
$ ./kickstart.sh -stop
$ ./kickstart.sh -start
Or use the shorter form:
$ ./kickstart.sh -restart
If however, you want to remove your data and just start over (your crawl data will still be preserved), you can do that by bringing everything down and then back up:
$ ./kickstart.sh -down
$ ./kickstart.sh -up