Skip to content
Anil Shanbhag edited this page Sep 14, 2015 · 1 revision

Scenario 1: Input files are partitioned and distributed across the different machines. On each machine ensure that they are in the same directory on each machine. Check scripts/fabfile.py. Adapt the code to point to the right directories. Run the following 3 commands.

fab bulk_sample_gen
fab create_robust_tree
fab write_partitions

Scenario 2: Input files are in HDFS. In this case, use the spark shell to sample the data and write to a filename sample. Then run:

fab create_robust_tree

Writing out partitions by reading files from HDFS is currently unimplemented.

Clone this wiki locally