Skip to content

Commit

Permalink
updated README
Browse files Browse the repository at this point in the history
  • Loading branch information
Tale152 committed Mar 3, 2022
1 parent 04e14fa commit 3548bf2
Showing 1 changed file with 49 additions and 12 deletions.
61 changes: 49 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,23 +26,14 @@ It's also probable that changing the dependencies in the build.sbt files it's po


## How To Run

First compile the code and generate the jar via sbt using:
```
sbt compile
sbt package
```

You must have a spark master and at least one worker running on your machine. Follow [Spark documentation](https://spark.apache.org/docs/3.0.3/spark-standalone.html#launching-spark-applications) to correctly run them.

Then you can run the program (arguments can be placed in any order):
```
sbt "run -m=<SPARK MASTER IP> -d=<FILE PATH> -cs=<CENTROIDS SELECTOR> -cn=<CENTROIDS NUMBER> -mr=<MAP-REDUCE VERSION> -ec=<END CONDITION> -sl"
```
There are two methods for running the application, but they share the need to provide some arguments to the main.

Mandatory arguments:
- **-m** is the spark master ip address, which is currently running (including the spark:// prefix).
- **-m** is the spark master ip address, which is currently running (also possible to run local[*]).
- **-d** is the file path that is going to be used to perform K-Means clustering: it must be a .seq file (see the [Support material](#support-material) section for downloading the .seq files).
- **-j** is the file path that contains the support jar (generated by sbt package) that will be propagated during the execution to all the workers in order make them able to run the job.

Optional arguments:
- **-cs** determines how the centroids will be chosen. If omitted "first_n" is used.
Expand All @@ -51,6 +42,48 @@ Optional arguments:
- **-ec** determines the end condition that terminates the computation ("max" if omitted).
- **-sl** is the option that enables spark logging: just omit this to disable it and to read only the program output.

### Run Method 1: sbt run

First compile the code and generate the support jar via sbt using:
```
sbt compile
sbt package
```
Such jar will be generated in ./target/scala-2.12/app_2.12-1.0.jar

Then you can run the program (arguments can be placed in any order):
```
sbt "run -m=<SPARK MASTER IP> -d=<FILE PATH> -j=<SUPPORT JAR PATH> -cs=<CENTROIDS SELECTOR> -cn=<CENTROIDS NUMBER> -mr=<MAP-REDUCE VERSION> -ec=<END CONDITION> -sl"
```

### Run Method 2: running standalone jar

Exactly as method 1, first compile the code and generate the support jar via sbt using:
```
sbt compile
sbt package
```
Such jar will be generated in ./target/scala-2.12/app_2.12-1.0.jar

Then run
```
sbt assembly
```
to create a standalone jar in ./target/scala-2.12/app-assembly-1.0.jar

Then run (arguments, except -jar with standalone jar path, can be placed in any order)
```
java -jar <STANDALONE JAR PATH> -m=<SPARK MASTER IP> -d=<FILE PATH> -j=<SUPPORT JAR PATH> -cs=<CENTROIDS SELECTOR> -cn=<CENTROIDS NUMBER> -mr=<MAP-REDUCE VERSION> -ec=<END CONDITION> -sl
```

For computing in -d the 10M file, you will need to increase java heap memory (1GB default) by using the -Xmx option (2gb should be fine, increase if you want to be sure).

```
java -Xmx2048m -jar <STANDALONE JAR PATH> ...
```

In the [Support material](#support-material) section **both app-assembly-1.0.jar and app_2.12-1.0.jar are available for download**, skipping the jar creation.

## Versions implemented
### Centroids selection (-cs)
- **first_n**: selects from the beginning of the data Array as many centroids as specified by -cn.
Expand Down Expand Up @@ -89,6 +122,10 @@ More details and results can be read in the **technical report** available in th

## Support material
In the [release section](https://github.com/Tale152/big_data_assignment_2/releases) of this repository, along the source code, it's possible to download:
- **app-assembly-1.0.jar**:<br />
Standalone jar described in the [how to run](#how-to-run) section.
- **app_2.12-1.0.jar**:<br />
Support jar described in the [how to run](#how-to-run) section.
- **seq_files.rar**:<br />
Archive containing three different **.seq files** that can be passed to the -d argument; the three files contain 1, 2 and 10 Million points to analyze.
- **vm_spark_hadoop.rar**:<br />
Expand Down

0 comments on commit 3548bf2

Please sign in to comment.