COM6012 Scalable Machine Learning 2023 by Haiping Lu at The University of Sheffield
Accompanying lectures: YouTube video lectures recorded in Year 2020/21.
- Task 1: To finish in the lab session on 10th Feb. Critical
- Task 2: To finish in the lab session on 10th Feb. Critical
- Task 3: To finish in the lab session on 10th Feb. Essential
- Task 4: To finish in the lab session on 10th Feb. Essential
- Task 5: To finish by the following Wednesday 15th Feb. Exercise
- Task 6: To explore further. Optional
Suggested reading:
- Spark Overview
- Spark Quick Start (Choose Python rather than the default Scala)
- Chapters 2 to 4 of PySpark tutorial (several sections in Chapter 3 can be safely skipped)
- Reference: PySpark documentation
- Reference: PySpark source code
Note - Please READ before proceeding:
- HPC nodes are shared resources (like buses/trains) relying on considerate usage of every user. When requesting resources, if you ask for too much (e.g. 50 cores), it will take a long time to get allocated, particularly during "rush hours" (e.g. close to deadlines) and once allocated, it will leave much less for the others. If everybody is asking for too much, the system won't work and everyone suffers.
- We have five nodes (each with 40 cores, 768GB RAM) reserved for this module. You can specify
-P rse-com6012
(e.g. afterqrshx
) to get access. However, these nodes are not always more available, e.g. if all of us are using it. There are 100+ regular nodes, many of which may be idle. - Please follow ALL steps (step by step without skipping) unless you are very confident in handling problems by yourself.
- Please try your best to follow the study schedule above to finish the tasks on time. If you start early/on time, you will find your problems early so that you can make good use of the labs and online sessions to get help from the instructors and teaching assistants to fix your problems early, rather than getting panic close to an assessment deadline. Based on our experience from the past five years, rushing towards an assessment deadline in this module is likely to make you fall, sometimes painfully.
Unless you are on the campus network, you MUST first connect to the university's VPN.
Follow the official instruction from our university. I have get your HPC account created already due to the need of this module. You have been asked to complete and pass the HPC Driving License test by Thursday 9th Feb. If you have not done so, please do it as soon as possible.
Use your university username such as abc18de
and the associated password to log in. You are required to use Multi-factor authentication (MFA) to connect to VPN. If you have problem logging in, do the following in sequence:
- Check the Frequently Asked Questions to see whether you have a similar problem listed there, e.g.
bash-4.x$
being displayed instead of your username at the bash prompt. - Come to the labs on Fridays and office hours on Mondays to get in-person help and online sessions on Wednesdays for online help.
Following the official instructions for Windows or Mac OS/X and Linux to open a terminal and connect to sharc via SSH by
ssh $USER@sharc.shef.ac.uk # Use lowercase for your username, without `$`
You need to replace $USER
with your username. Let's assume it is abc1de
, then you do ssh [email protected]
(using lowercase and without $
). If successful, you should see
[abc1de@sharc-login1 ~]$
abc1de
should be your username.
- You can save the host, username (and password if your computer is secure) as a Session if you want to save time in future.
- You can edit
settings --> keyboard shortcuts
to customise the keyboard shortcuts, e.g. change the paste shortcut from the defaultShift + Insert
to our familiarCtrl + V
. - You can DRAG your file or folder to the left directory pane of MobaXterm.
- You can open multiple sessions (but do not open more than what you need as these are shared resources).
- YOu can directly open a file to edit and then save it.
- You can use VSCode to write and manage your code and scripts on HPC by following the VSCode Remote HPC instructions.
- After performing the steps in the above repo, you will be able to 1) start a remote code server on the HPC and 2) connect to it via your web browser and edit/manage your code with access to the remote filesystem on the HPC.
- Using VSCode via the browser provides similar functionality as a desktop VSCode installation but having some restrictions on the marketplace and extensions. See Why can't code-server use Microsoft's extension marketplace?.
NOTE: While using VScode provides a level of convenience, it is also good to get familiar with writing and managing code from the terminal using vim/nano.
Type qrshx
for a *regular- node or qrshx -P rse-com6012
for a com6012-reserved node. If successful, you should see
[abc1de@sharc-node*** ~]$ # *** is the node number
Otherwise, try qrshx
or qrshx -P rse-com6012
again. You will not be able to run the following commands if you are still on the login node.
module load apps/java/jdk1.8.0_102/binary
module load apps/python/conda
conda create -n myspark python=3.9.1
When you are asked whether to proceed, say y
. When seeing Please update conda by running ...
, do NOT try to update conda following the given command. As a regular user, you will NOT be able to update conda.
source activate myspark
The prompt says to use conda activate myspark
but it does not always work. You must see (myspark) [abc1de@sharc-nodeXXX ~]$
, i.e. (myspark) in front, before proceeding. Otherwise, you did not get the proper environment. Check the above steps.
pip install pyspark==3.3.1
When you are asked whether to proceed, say y
. You should see the last line of the output as
Successfully installed py4j-0.10.9.5 pyspark-3.3.1
[]py4j
](https://www.py4j.org/) enables Python programmes to Java objects. We need it because Spark is written in scala, which is a Java-based language.
pyspark
You should see spark version 3.3.1 displayed like below
......
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Python version 3.9.1 (default, Dec 11 2020 14:32:07)
Spark context Web UI available at http://sharc-node007.shef.ac.uk:4040
Spark context available as 'sc' (master = local[*], app id = local-1675603301275).
SparkSession available as 'spark'.
>>>
Bingo! Now you are in pyspark! Quit pyspark shell by Ctrl + D
.
You are expected to have passed the HPC Driving License test and become familiar with the HPC environment.
Terminal/command line: learn the basic use of the command line in Linux, e.g. use pwd
to find out your current directory.
Transfer files: learn how to transfer files to/from ShARC HPC. I recommend MobaXterm for Windows and FileZilla for Mac/Linux. In MobaXterm, you can drag and drop files between HPC and your local machine.
Line ending WARNING!!!: if you are using Windows, you should be aware that line endings differ between Windows and Linux. If you edit a shell script (below) in Windows, make sure that you use a Unix/Linux compatible editor or do the conversion before using it on HPC.
File recovery: your files on HPC are regularly backed up as snapshots so you could recover files from them following the instructions on recovering files from snapshots.
NOTE: You may skip this part 1.4.
This module focuses on the HPC terminal. You are expected to use the HPC terminal to complete the labs. ALL assessments use the HPC terminal.
Installation of PySpark on your own machine is more complicated than installing a regular python library because it depends on Java (i.e. not pure python). The following steps are typically needed:
- Install Java 8, i.e. java version 1.8.xxx. Most instructions online ask you to install Java SDK, which is heavier. *Java JRE- is lighter and sufficient for pyspark.
- Install Python 3.7+ (if not yet)
- Install PySpark 3.3.1 with Hadoop 2.7
- Set up the proper environments (see references below)
As far as I know, it is not necessary to install Scala.
Different OS (Windows/Linux/Mac) may have different problems. We provide some references below if you wish to try but it is *not required- and we can provide only very limited support on this task (i.e. we may not be able to solve all problems that you may encounter).
If you do want to install PySpark and run Jupyter Notebooks on your own machine, you need to complete the steps above with reference to the instructions below for your OS (Windows/Linux/Mac).
If you follow the steps in these references, be aware that they are not up to date so you should install the correct versions: Java 1.8, Python 3.7+, PySpark 3.3.1 with Hadoop 2.7. *Scala- is optional.
-
Windows: 1) Install Spark on Windows (PySpark) (with video) 2) How to install Spark on Windows in 5 steps.
-
Linux: 1) Install PySpark on Ubuntu (with video); 2)Installing PySpark with JAVA 8 on ubuntu 18.04
-
Mac: 1) Install Spark on Mac (PySpark) (with video); 2) Install Spark/PySpark on Mac
Here we provide detailed instructions only for Windows.
- Install Java
- Download
jre-8u...
and install Java 8 JRE. - Find the path for the installed Java under
Program files\Java\jre1.8.0_xxx
(replacexxx
with the number you see) and set two environment variables to know where to find Java:JAVA_HOME = C:\Progra~1\Java\jdk1.8.0_xxx
PATH += C:\Progra~1\Java\jdk1.8.0_xxx\bin
- Check: open a command prompt and type
java -version
. If you can see the version displayed, congratulations. Otherwise, check the above.
- Download
- Install Python
- Install Python 3.7+. Open a command and type
python --version
to check your version to be 3.6+.
- Install Python 3.7+. Open a command and type
- Install PySpark (Alternatively, you may try
pip install pyspark==3.3.1
)- Download Spark 3.3.1 for Hadoop 2.7, i.e.
spark-3.3.1-bin-hadoop2.7.tgz
. - Extract the
.tgz
file (e.g. using 7zip) intoC:\Spark
so that extracted files are atC:\Spark\spark-3.3.1-bin-hadoop2.7
. - Set the environment variables:
SPARK_HOME = C:\Spark\spark-3.3.1-bin-hadoop2.7
PATH += C:\Spark\spark-3.3.1-bin-hadoop2.7\bin
- Download winutils.exe for hadoop 2.7 and move it to
C:\Spark\spark-3.3.1-bin-hadoop2.7\bin
- Set the environment variable:
HADOOP_HOME = C:\Spark\spark-3.3.1-bin-hadoop2.7
PYTHONPATH = %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-<version>-src.zip;%PYTHONPATH%
(just check what py4j version you have in yourspark/python/lib
folder to replace<version>
(source).
- Download Spark 3.3.1 for Hadoop 2.7, i.e.
Now open a command prompt and type pyspark
. You should see pyspark 3.3.1 running as above.
Known issue on Windows There may be a ProcfsMetricsGetter
warning. If you press Enter
, the warning will disappear. I did not find a better solution to get rid of it. It does not seem harmful either. If you know how to deal with it. Please let me know. Thanks. Reference 1; Reference 2; Reference 3.
From this point on, we will assume that you are using the HPC terminal unless otherwise stated. Run PySpark shell on your own machine can do the same job.
Once PySpark has been installed, after each log-in, you need to do the following to run PySpark.
-
Get a node via
qrshx
orqrshx -P rse-com6012
. -
Activate the environment by
module load apps/java/jdk1.8.0_102/binary module load apps/python/conda source activate myspark
Alternatively, put
HPC/myspark.sh
under your root directory (see above on how to transfer files) and run the above three commands in sequence viasource myspark.sh
(see more details here). You could modify it further to suit yourself better.
Run pyspark (optionally, specify to use multiple cores):
pyspark # pyspark --master local[4] for 4 cores
You will see the spark splash above. spark
(SparkSession) and sc
(SparkContext) are automatically created.
Check your SparkSession and SparkContext object and you will see something like
>>> spark
<pyspark.sql.session.SparkSession object at 0x2b3a2ad4c630>
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
Let us do some simple computing (squares)
>>> nums = sc.parallelize([1,2,3,4])
>>> nums.map(lambda x: x*x).collect()
[1, 4, 9, 16]
NOTE: Review the two common causes to the file not found
or cannot open file
errors below (line ending and relative path problems), and how to deal with them.
This example deals with Semi-Structured data in a text file.
Firstly, you need to make sure the file is in the proper directory and change the file path if necessary, on either HPC or local machine, e.g. using ``pwdto see the current directly,
ls` (or `dir` in Windows) to see the content. Also review how to transfer files to HPC and MobaXterm tips for Windows users.
Now quit pyspark by Ctrl + D
. Take a look at where you are
(myspark) [abc1de@sharc-node175 ~]$ pwd
/home/abc1de
abc1de
should be your username. Let us make a new directory called com6012
and go to it
mkdir com6012
cd com6012
Let us make a copy of our teaching materials at this directory via
git clone --depth 1 https://github.com/haipinglu/ScalableML
If ScalableML
is not empty (e.g. you have cloned a copy already), this will give you an error. You need to delete the cloned version (the whole folder) via rm -rf ScalableML
. Be careful that you can NOT undo this delete so make sure you do not have anything valuable (e.g. your assignment) there if you do this delete.
You are advised to create a separate folder for your own work under com6012
, e.g. mywork
.
Let us check
(myspark) [abc1de@sharc-node175 com6012]$ ls
ScalableML
(myspark) [abc1de@sharc-node175 com6012]$ cd ScalableML
(myspark) [abc1de@sharc-node175 ScalableML]$ ls
Code Data HPC Lab 1 - Introduction to Spark and HPC.md Output README.md Slides
(myspark) [abc1de@sharc-node175 ScalableML]$ pwd
/home/abc1de/com6012/ScalableML
You can see that files on the GitHub has been downloaded to your HPC directory /home/abc1de/com6012/ScalableML
. Now start spark shell by pyspark
(again you should see the splash) and now we
- read the log file
NASA_Aug95_100.txt
under the folderData
- count the number of lines
- take a look at the first line
>>> logFile=spark.read.text("Data/NASA_Aug95_100.txt")
>>> logFile
DataFrame[value: string]
>>> logFile.count()
100
>>> logFile.first()
Row(value='in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839')
You may open the text file to verify than pyspark is doing the right things.
Question: How many accesses are from Japan?
Now suppose you are asked to answer the question above. What do you need to do?
- Find those logs from Japan (by IP domain
.jp
) - Show the first 5 logs to check whether you are getting what you want.
>>> hostsJapan = logFile.filter(logFile.value.contains(".jp"))
>>> hostsJapan.show(5,False)
+--------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------+
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:17 -0400] "GET / HTTP/1.0" 200 7280 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:18 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 200 5866|
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:22 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0 |
+--------------------------------------------------------------------------------------------------------------+
only showing top 5 rows
>>> hostsJapan.count()
11
Now you have used pyspark for some (very) simple data analytic task.
To run a self-contained application, you need to exit your shell, by Ctrl+D
first.
Create a file LogMining100.py
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[2]") \
.appName("COM6012 Spark Intro") \
.config("spark.local.dir","/fastdata/YOUR_USERNAME") \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN") # This can only affect the log level after it is executed.
logFile=spark.read.text("Data/NASA_Aug95_100.txt").cache()
hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()
print("\n\nHello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))
spark.stop()
Change YOUR_USERNAME
in /fastdata/YOUR_USERNAME
to your username. If you are running on your local machine, change /fastdata/YOUR_USERNAME
to a temporal directory such as C:\temp
.
Actually the file has been created for you under the folder Code
so you can just run it
spark-submit Code/LogMining100.py
You will see lots of logging info output such as
21/02/05 00:35:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/02/05 00:35:59 INFO SparkContext: Running Spark version 3.3.1
.....................
21/02/05 00:35:59 INFO ResourceUtils: Resources for spark.driver:
21/02/05 00:35:59 INFO ResourceUtils: ==============================================================
21/02/05 00:35:59 INFO SparkContext: Submitted application: COM6012 Spark Intro
.....................
21/02/05 00:36:03 INFO SharedState: Warehouse path is 'file:/home/abc1de/com6012/ScalableML/spark-warehouse'.
Hello Spark: There are 11 hosts from Japan.
The output is verbose so I did not show all (see Output/COM6012_Lab1_SAMPLE.txt
for the verbose output example). We can set the log level easily after sparkContext
is created but not before (it is a bit complicated). I leave two blank lines before printing the result so it is early to see.
Data: Download the August data in gzip (NASA_access_log_Aug95.gz) from NASA HTTP server access log (this file is uploaded to ScalableML/Data
if you have problems downloading, so actually it is already downloaded on your HPC earlier) and put into your Data
folder. NASA_Aug95_100.txt
above is the first 100 lines of the August data.
Question: How many accesses are from Japan and UK respectively?
Create a file LogMiningBig.py
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[2]") \
.appName("COM6012 Spark Intro") \
.config("spark.local.dir","/fastdata/YOUR_USERNAME") \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN") # This can only affect the log level after it is executed.
logFile=spark.read.text("../Data/NASA_access_log_Aug95.gz").cache()
hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()
hostsUK = logFile.filter(logFile.value.contains(".uk")).count()
print("\n\nHello Spark: There are %i hosts from UK.\n" % (hostsUK))
print("Hello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))
spark.stop()
Spark can read gzip file directly. You do not need to unzip it to a big file. Also note the use of cache() above.
See how to submit batch jobs to ShARC and follow the instructions for SGE. Reminder: The more resources you request, the longer you need to queue.
Interactive mode will be good for learning, exploring and debugging, with smaller data. For big data, it will be more convenient to use batch processing. You submit the job to the node to join a queue. Once allocated, your job will run, with output properly recorded. This is done via a shell script.
Create a file Lab1_SubmitBatch.sh
#!/bin/bash
#$ -l h_rt=6:00:00 # time needed in hours:mins:secs
#$ -pe smp 2 # number of cores requested
#$ -l rmem=8G # size of memory requested
#$ -o ../Output/COM6012_Lab1.txt # This is where your output and errors are logged
#$ -j y # normal and error outputs into a single file (the file above)
#$ -M [email protected] # notify you by email, remove this line if you don't want to be notified
#$ -m ea # email you when it finished or aborted
#$ -cwd # run job from current directory
module load apps/java/jdk1.8.0_102/binary
module load apps/python/conda
source activate myspark
spark-submit ../Code/LogMiningBig.py # .. is a relative path, meaning one level up
- Get necessary files on your ShARC.
- Start a session with command
qrshx
. - Go to the
HPC
directory to submit your job via theqsub
command (can be run at the login node). - The output file will be under
Output
.
cd HPC
qsub Lab1_SubmitBatch.sh # or qsub HPC/Lab1_SubmitBatch.sh if you are at /home/abc1de/com6012/ScalableML
Check your output file, which is COM6012_Lab1.txt
in the Output
folder specified with option -o
above. You can change it to a name you like. A sample output file named COM6012_Lab1_SAMPLE.txt
is in the GitHub Output
folder for your reference. The results are
Hello Spark: There are 35924 hosts from UK.
Hello Spark: There are 71600 hosts from Japan.
Common causes and fixes to file not found
or cannot open file
errors
-
Make sure that your
.sh
file, e.g.myfile.sh
, has Linux/Unix rather than Windows line ending. To check, do the following on HPC[abc1de@sharc-node004 HPC]$ file myfile.sh myfile.sh: ASCII text, with CRLF line terminators # Output
In the above example, it shows the file has "CRLF line terminators", which will not be recognised by Linux/Unix. You can fix it by
[abc1de@sharc-node004 HPC]$ dos2unix myfile.sh dos2unix: converting file myfile.sh to Unix format ... # Output
Now check again, and it shows no "CRLF line terminators", which means it is now in the Linux/Unix line endings and ready to go.
[abc1de@sharc-node004 HPC]$ file myfile.sh myfile.sh: ASCII text # Output
-
Make sure that you are at the correct directory and the file exists using
pwd
(the current working directory) andls
(list the content). Check the status of your queuing/ running job(s) usingqstat
(jobs not shown are finished already).qw
means the job is in the queue and waiting to be scheduled.eqw
means the job is waiting in error state, in which case you should check the error and useqdel JOB_ID
to delete the job.r
means the job is running. If you want to print out the working directory when your code is running, you would useimport os print(os.getcwd())
If you have verified that you can run the same command in interactive mode, but cannot run it in batch mode, it may be due to the environment you are using has been corrupted.
I suggest you to remove and re-install the environment. You can do this by
- Remove the
myspark
environment by runningconda remove --name myspark --all
, following conda's managing environments documentation and redo Lab 1 (i.e. install everything) to see whether you can run spark-submit in batch mode again. - If the above does not work, delete the
myspark
environment (folder) at/home/abc1de/.conda/envs/myspark
via the terminal folder window on the left of the screen on mobax term or use linux command. Then redo Lab 1 (i.e. install everything) to see whether you can run spark-submit in batch mode again. - If the above still does not work, you may have installed
pyspark==3.3.1
wrongly, e.g. before but not after activating themyspark
environment. If you made this mistake, when reinstallingpyspark==3.3.1
, you may be prompted withRequirement already satisfied: pyspark==3.3.1
andRequirement already satisfied: py4j==0.10.9.5
. To fix the problem, you can try unstallpyspark
andpy4j
before activatingmyspark
environment bypip uninstall pyspark==3.3.1
andpip uninstall py4j==0.10.9.5
and then activate themyspark
environment bysource activate myspark
and reinstall pyspark bypip install pyspark==3.3.1
.
The analytic task you are doing above is Log Mining. You can imaging nowadays, log files are big and manual analysis will be time consuming. Follow examples above, answer the following questions on NASA_access_log_Aug95.gz.
- How many requests are there in total?
- How many requests are from
gateway.timken.com
? - How many requests are on 15th August 1995?
- How many 404 (page not found) errors are there in total?
- How many 404 (page not found) errors are there on 15th August?
- How many 404 (page not found) errors from
gateway.timken.com
are there on 15th August?
You are encouraged to try out in the pyspark shell first to figure out the right solutions and then write a Python script, e.g. Lab1_exercise.py
with a batch file (e.g. Lab1_Exercise_Batch.sh
to produce the output neatly under Output
, e.g. in a file Lab1_exercise.txt
.
You are encouraged to explore these more challenging questions by consulting the pyspark.sql
APIs to learn more. We will not provide solutions but Session 2 will make answering these questions easier.
- How many unique hosts on a particular day (e.g., 15th August)?
- How many unique hosts in total (i.e., in August 1995)?
- Which host is the most frequent visitor?
- How many different types of return codes?
- How many requests per day on average?
- How many requests per host on average?
- Any other question that you (or your imagined clients) are interested in to find out.
- Compare the time taken to complete your jobs with and without
cache()
.
- Compare the time taken to complete your jobs with 2, 4, 8, 16, and 32 cores.
Many thanks to Twin, Will, Mike, Vamsi for their kind help and all those kind contributors of open resources.
The log mining problem is adapted from UC Berkeley cs105x L3.