Skip to content

Latest commit

 

History

History
40 lines (26 loc) · 1.54 KB

README.md

File metadata and controls

40 lines (26 loc) · 1.54 KB

Inverted Index Given a directory with books in txt format, write a mapreduce which outputs an inverted index, i.e., a table that associates a word with the books and the corresponding positions at which it occurs (http://en.wikipedia.org/wiki/Inverted_index).

Dataset URL: here HDFS Cluster DataSet path: /user/hadoop/mapreduce/data/books

hint 1: Suggested output example (not real data): Love alice_in_wonderland.txt:100,the_prince.txt:900,the_prince.txt:1050

hint 2: Given the mapper doesn’t receive the filename as input. A Hadoop Configured Parameter (environment variable) could help to retrieve the filename from which the word comes.

To run this in a hadoop environment, first set up the alias in the .bashrc

run_mapreduce() {
        hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-x.x.x.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}

Because ~ / .bashrc will only run automatically only to start a new instance of the shell is necessary to evaluate the content of the file manually to make the changes effective as of this moment:

$ source ~/.bashrc

alias hs=run_mapreduce

Once the alias has been setup you can either run the process as a MapReduce using the aliased command hs.

eg.

hs mapper.py reducer.py forum_data inverted_index

where:

  • "forum_data" is the folder in the HDFS containing the forum node text records
  • "inverted_index" is the output data folder, it is important that this folder doesn't already exist.