This repository has been archived by the owner on Apr 9, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
tequalsme/accumulo-wikisearch
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Apache Accumulo Wikipedia Search Example This project contains a sample application for ingesting and querying wikipedia data. Prerequisites ------------- 1. Accumulo, Hadoop, and ZooKeeper must be installed and running 2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory You will want to grab the files with the link name of pages-articles.xml.bz2 3. Though not strictly required, the ingest will go more quickly if the files are decompressed: $ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml INSTRUCTIONS ------------ Configuration and Build ----------------------- 1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information (For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml) 2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents as done in step 1. 3. From the wikisearch directory, run mvn package Ingest ------ 1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar 2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext 3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML files reside, this will start a MapReduce job to ingest the data into Accumulo (For parallel ingest, instead run ingest/bin/ingest_parallel.sh) Query ----- 1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory: commons-jexl-*.jar guava-*.jar kryo-*.jar minlog-*.jar 2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext 3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example: setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki 4. cd into webapp and run mvn jetty:run 5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/ You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query 6. Ctrl-C to stop the jetty container
About
Fork of Apache/accumulo-wikisearch, with the goal of being simpler to setup and use.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published