updated README

Tale152 · Mar 3, 2022 · 3548bf2 · 3548bf2
1 parent 04e14fa
commit 3548bf2
Showing 1 changed file with 49 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -26,23 +26,14 @@ It's also probable that changing the dependencies in the build.sbt files it's po
 
 
 ## How To Run
-
-First compile the code and generate the jar via sbt using:  
-```
-sbt compile
-sbt package
-```
-
 You must have a spark master and at least one worker running on your machine. Follow [Spark documentation](https://spark.apache.org/docs/3.0.3/spark-standalone.html#launching-spark-applications) to correctly run them.
 
-Then you can run the program (arguments can be placed in any order):
-```
-sbt "run -m=<SPARK MASTER IP> -d=<FILE PATH> -cs=<CENTROIDS SELECTOR> -cn=<CENTROIDS NUMBER> -mr=<MAP-REDUCE VERSION> -ec=<END CONDITION> -sl"
-```
+There are two methods for running the application, but they share the need to provide some arguments to the main.
 
 Mandatory arguments:  
-- **-m** is the spark master ip address, which is currently running (including the spark:// prefix).
+- **-m** is the spark master ip address, which is currently running (also possible to run local[*]).
 - **-d** is the file path that is going to be used to perform K-Means clustering: it must be a .seq file (see the [Support material](#support-material) section for downloading the .seq files).
+- **-j** is the file path that contains the support jar (generated by sbt package) that will be propagated during the execution to all the workers in order make them able to run the job.
 
 Optional arguments:  
 - **-cs** determines how the centroids will be chosen. If omitted "first_n" is used.
@@ -51,6 +42,48 @@ Optional arguments:
 - **-ec** determines the end condition that terminates the computation ("max" if omitted).
 - **-sl** is the option that enables spark logging: just omit this to disable it and to read only the program output.
 
+### Run Method 1: sbt run
+
+First compile the code and generate the support jar via sbt using:  
+```
+sbt compile
+sbt package
+```
+Such jar will be generated in ./target/scala-2.12/app_2.12-1.0.jar
+
+Then you can run the program (arguments can be placed in any order):
+```
+sbt "run -m=<SPARK MASTER IP> -d=<FILE PATH> -j=<SUPPORT JAR PATH> -cs=<CENTROIDS SELECTOR> -cn=<CENTROIDS NUMBER> -mr=<MAP-REDUCE VERSION> -ec=<END CONDITION> -sl"
+```
+
+### Run Method 2: running standalone jar
+
+Exactly as method 1, first compile the code and generate the support jar via sbt using:  
+```
+sbt compile
+sbt package
+```
+Such jar will be generated in ./target/scala-2.12/app_2.12-1.0.jar
+
+Then run
+```
+sbt assembly
+```
+to create a standalone jar in ./target/scala-2.12/app-assembly-1.0.jar
+
+Then run (arguments, except -jar with standalone jar path, can be placed in any order)
+```
+java -jar <STANDALONE JAR PATH> -m=<SPARK MASTER IP> -d=<FILE PATH> -j=<SUPPORT JAR PATH> -cs=<CENTROIDS SELECTOR> -cn=<CENTROIDS NUMBER> -mr=<MAP-REDUCE VERSION> -ec=<END CONDITION> -sl
+```
+
+For computing in -d the 10M file, you will need to increase java heap memory (1GB default) by using the -Xmx option (2gb should be fine, increase if you want to be sure).
+
+```
+java -Xmx2048m -jar <STANDALONE JAR PATH> ...
+```
+
+In the [Support material](#support-material) section **both app-assembly-1.0.jar and app_2.12-1.0.jar are available for download**, skipping the jar creation.
+
 ## Versions implemented
 ### Centroids selection (-cs)
 - **first_n**: selects from the beginning of the data Array as many centroids as specified by -cn.
@@ -89,6 +122,10 @@ More details and results can be read in the **technical report** available in th
 
 ## Support material
 In the [release section](https://github.com/Tale152/big_data_assignment_2/releases) of this repository, along the source code, it's possible to download:
+- **app-assembly-1.0.jar**:<br />
+Standalone jar described in the [how to run](#how-to-run) section.
+- **app_2.12-1.0.jar**:<br />
+Support jar described in the [how to run](#how-to-run) section.
 - **seq_files.rar**:<br />
 Archive containing three different **.seq files** that can be passed to the -d argument; the three files contain 1, 2 and 10 Million points to analyze.
 - **vm_spark_hadoop.rar**:<br />