COM6012 Scalable Machine Learning - University of Sheffield

Spring 2025

In this module, we will learn how to do machine learning at large scale using Apache Spark. We will use the High Performance Computing (HPC) cluster systems of our university. If you are NOT on the University's network, you must use VPN (Virtual Private Network) to connect to the HPC.

This edition uses PySpark 3.5.4, the latest stable release of Spark (Dec 20, 2024), and has 10 sessions below. You can refer to the overview slides for more information, e.g. timetable and assessment information.

Session 1: Introduction to Spark and HPC (Shuo Zhou)
Session 2: RDD, DataFrame, ML pipeline, & parallelization (Shuo Zhou)
Session 3: Scalable logistic regression and Spark configuration (Shuo Zhou)
Session 4: Scalable generalized linear models and Spark data types (Shuo Zhou)
Session 5: Scalable decision trees and ensemble models (Tahsin Khan)
Session 6: Scalable neural networks (Tahsin Khan)
Session 7: Scalable matrix factorisation for collaborative filtering in recommender systems (Tahsin Khan)
Session 8: Scalable k-means clustering and PCA for dimensionality reduction (Haiping Lu)
Session 9: Open-source software engineering practices for reproducible and reusable AI (Xianyuan Liu)
Session 10: Apache Spark in the Cloud (Xianyuan Liu)

You can also download the Spring 2024 version for preview or reference.

If you do not have one yet, we recommend you to sign up for a GitHub account to learn using this popular open source software development platform.

Shuo Zhou and Haiping Lu developed a course on An Introduction to Transparent Machine Learning with Prof. Haiping Lu, part of the Alan Turing Institute’s online learning courses in responsible AI. If interested, you can refer to this introductory course with emphasis on transparency in machine learning to assist you in your learning of scalable machine learning.

The materials are built with references to the following sources:

The official Apache Spark documentations. Note: the latest information is here.
The PySpark tutorial by Wenqiang Feng with PDF - Learning Apache Spark with Python. Also see GitHub Project Page. Note: last update in Dec 2022.
The Introduction to Apache Spark course by A. D. Joseph, University of California, Berkeley. Note: archived.
The book Learning Spark: Lightning-Fast Data Analytics, 2nd Edition, O'Reilly by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee, with a github repository.
The book Spark: The Definitive Guide by Bill Chambers and Matei Zaharia. There is also a Repository for code from the book.

Many thanks to

Robert Loftin and Mauricio A Álvarez, who contributed to this module in 2024 and from 2016 to 2022, respectively. Their contributions remain reflected in the course materials.
Mike Croucher, Neil Lawrence, William Furnass, Twin Karmakharm, Mike Smith, Xianyuan Liu, Desmond Ryan, Steve Kirk, James Moore, and Vamsi Sai Turlapati for their inputs and inspirations since 2016.
Our teaching assistants and students who have contributed in many ways since 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 428 Commits
Code		Code
Data		Data
Figs		Figs
HPC		HPC
Output		Output
Slides		Slides
.gitattributes		.gitattributes
.gitignore		.gitignore
Lab 1 - Introduction to Spark and HPC.md		Lab 1 - Introduction to Spark and HPC.md
README.md		README.md