This is a git repository for CS434-AdvancedProgramming at POSTECH.
This repository implements the final project which is distributed sorting using multiple servers.
You can find the implementation details here.
Among three teams Red, Blue, and Green, our team (Blue) achieved the best performance by a significant margin. You can find the final results here.
Clone the git repository.
git clone https://github.com/qwercxzsda/cs434-project.git
Run sbt assembly
.
cd cs434-project
sbt assembly
Two files master.jar
and worker.jar
will be created at the project root directory.
Run the following code.
java -jar master.jar [worker_number]
For example,
java -jar master.jar 1
Run the following code. Use 30962
for master_port
. See below for more information.
java -jar worker.jar [master_ip]:[master_port] -I [input_directory1] [input_directory2] ... -O [output_directory]
For examble,
java -jar worker.jar 2.2.2.142:30962 -I /home/blue/test2/input_dir -O /home/blue/test2/output_dir
Output will be created in the output_directory
as partition*
.
Disregard the tmp1
, tmp2
folders created in the output_directory
. These are not part of the output. They are temporary files used during the worker
execution.
master_port
is always assumed to be 30962
. If you use other number as master_port
it will just be disregarded.
For master and worker to run properly, port 30962
must be free(not in use). If you really need to change the port as port 30962
is already in use, change the variable NetworkConfig.port
in the file cs434-project/utils/src/main/scala/NetworkConfig.scala
and run sbt assembly
again(for both master
and worker
).
1 Master
4 Workers each with 20000 * 2490 = 49.8 M Records(49.8 M * 100 B = 4.98 GB)
Total 49.8 M * 4 = 199.2 M Records(199.2 M * 100 B = 19.92 GB)
1 Master
2 Workers each with 2 Records(2 * 100 B = 200 B)
Total 2 * 2 = 4 Records(4 * 100 B = 400 B)
1 Master
4 Workers, 1 directory, each directory with 100,000,000 = 100 M Records(100 M * 100 B = 10 GB)
Totla 100 M * 4 = 400 M Records(400 M * 100 B = 40 GB)
100,000,000 * 4 = 400 M Records
Created using the below code on each workers.
gensort -b[start_index] [number] partition0
For examble, on worker 3,
gensort -b200000000 100000000
99586002 + 94949919 + 107194027 + 98270052 = 400,000,000 = 400 M
All sorted, no duplicates
1 Master
8 Workers, 3 directories, each directory with 100,000 = 100 K Records(100 K * 100 B = 10 MB)
Total 100 K * 3 * 8 = 2.4 M Records(2.4 M * 100 B = 240 MB)
100,000 * 3 * 8 = 2,400,000 = 2.4 M Records
Created using the below code on each workers.
gensort -b[start_index] [number] partition0
For examble,
283952 + 295771 + 308125 + 307138 + 303761 + 305078 + 307586 + 288589 = 2,400,000 = 2.4 M
All sorted, no duplicates