This repository is a demonstration of linear regression solved using low level matrix and linear algebra operations in C and NumPy in Python.
- Quickstart in C
- Quickstart in Python
- Example usage
- Theory
- Gradient Descent
- Stochastic Gradient Descent
The C source code uses BLAS (Basic Linear Algebra Subsystems) libraries for matrix and vector operations. The implementation specifically used is OpenBLAS. I have included a helper script called setup_dependencies.sh
inside c/scripts
to download and build the specific versions of these dependencies locally.
To install all dependencies, run
cd c/scripts
source setup_dependencies.sh <NUMBER_OF_THREADS>
Building these requires a C/C++ compiler (gcc/clang
) and CMake installed. OpenBLAS takes minutes to build, and can be sped up by specifying more than one <NUMBER_OF_THREADS>
on your machine when sourcing the setup script.
Once dependencies are installed, source the build.sh
script to build the source and test files.
cd c/scripts
source build.sh <NUMBER_OF_THREADS>
Catch2 is used for testing. CMake downloads and builds Catch2 while also building tests. Testing is enabled by default.
Building Catch2 is also time-consuming. Specifying more than 1 <NUMBER_OF_THREADS>
will make compilation faster.
To run a multivariate linear regression problem with the default settings and stochastic gradient descent solver, please run
cd c/build
./run
Hyperparameters can be customized from the command line. For more details, please run ./run --help
.
To run unit tests, please run
cd c/build
ctest
Code has currently only been tested on Fedora Linux. The dependencies may or may not work as-is with other platforms. Only a Linux-based distribution is supported for now, so Windows users may look into Windows Subsystem for Linux.
The only dependencies are numpy
and matplotlib
. Python 3.10+ is supported for now.
To create a virtual environment, please run
python -m venv venv
If on Windows, run
venv\Scripts\activate.bat # For cmd.exe
venv\Scripts\Activate.ps1 # For PowerShell
If on a Mac or Unix-based OS, run
source venv/bin/activate
These commands will activate the virtual environment on your machine. Finally, run
pip install -r requirements.txt
to install all dependencies.
To run a multivariate linear regression problem using the default settings, please run
cd python
python main.py
Hyperparameters can be customized from the command line. They can be viewed using
python main.py --help
PyTest is used for unit testing. Tests can be run using
cd python
pytest
The hyperparameters that can be customized are visible in the following output of ./run --help
Usage: run [OPTION...]
A demonstration of linear regression using gradient descent and stochastic
gradient descent.
-b, --bias[=BIAS] Bias term
-B, --batch_size[=BATCH_SIZE] Batch size
-f, --test_frac[=TEST_FRAC] Fraction of data for test set
-i, --n_iter[=N_ITER] Number of iterations
-I, --noise intensity[=NOISE_INTENSITY]
Intensity of Gaussian noise to be added
-M, --n_samples=N_SAMPLES Number of samples
-n, --learning_rate[=LEARNING_RATE]
Learning rate for the gradient descent
-N, --n_features=N_FEATURES Number of features
-S, --seed[=SEED] Random number seed
-t, --tol[=TOL] Tolerance for convergence
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
The output with default hyperparameters of ./run
looks like this:
Arguments:
n_iter = 10000, tol = 0.001000,
n_features = 20, n_samples = 100000
bias = -300.700000, noise_intensity = 2.000000
learning_rate = 0.001000, batch_size = 32
test_frac = 0.200000, seed = 42
It. 100, loss = 225.8889
It. 200, loss = 202.3389
It. 300, loss = 181.5736
It. 400, loss = 163.2489
It. 500, loss = 147.0635
...
It. 9800, loss = 3.3934
It. 9900, loss = 3.3534
It. 10000, loss = 3.3150
Gradient descent took 28.313645 seconds.
MSE: 3.2978
MAE: 1.4716
R-squared: 0.9868
It. 100, loss = 201.0783
It. 200, loss = 180.0591
It. 300, loss = 161.5581
It. 400, loss = 145.2608
It. 500, loss = 130.8926
...
It. 9800, loss = 2.7880
It. 9900, loss = 2.7330
It. 10000, loss = 2.6799
Stochastic gradient descent took 6.151944 seconds.
MSE: 5.1999
MAE: 1.8582
R-squared: 0.9792
The same hyperparameters can be customized in the python code as well, as visible in the following output of python main.py --help
usage: Linear regression using SGD [-h] [--n_samples N_SAMPLES]
[--n_features N_FEATURES] [--noise NOISE]
[--bias BIAS] [--learning_rate LEARNING_RATE]
[--batch_size BATCH_SIZE] [--test_frac TEST_FRAC]
[--seed SEED] [--max_iter MAX_ITER] [--tol TOL]
A demonstration of linear regression using gradient descent in NumPy.
options:
-h, --help show this help message and exit
--n_samples N_SAMPLES
Number of samples
--n_features N_FEATURES
Number of features
--noise NOISE Intensity of Gaussian noise to be added
--bias BIAS Bias term
--learning_rate LEARNING_RATE
Learning rate for gradient descent
--batch_size BATCH_SIZE
Batch size
--test_frac TEST_FRAC
Fraction of data for test set
--seed SEED Random number seed
--max_iter MAX_ITER Maximum number of iterations
--tol TOL Tolerance for convergence
The output with default hyperparameters of python main.py
(same as that of the C code) looks like this:
INFO:root:100: loss = 1179.3173
INFO:root:200: loss = 507.3512
INFO:root:300: loss = 275.4007
INFO:root:400: loss = 170.9328
...
INFO:root:9900: loss = 4.1598
INFO:root:10000: loss = 4.1577
INFO:root:Gradient descent did not converge.
INFO:root:Gradient descent took 15.427083 seconds.
INFO:root:MSE = 4.1435
INFO:root:MAE = 1.6245
INFO:root:R2 = 1.0000
INFO:root:100: loss = 1318.9693
INFO:root:200: loss = 582.8164
INFO:root:300: loss = 379.9656
INFO:root:400: loss = 156.3315
...
INFO:root:9900: loss = 4.5390
INFO:root:10000: loss = 3.5210
INFO:root:Stochastic gradient descent did not converge.
INFO:root:Minibatch stochastic gradient descent took 0.221708 seconds.
INFO:root:MSE = 4.2095
INFO:root:MAE = 1.6373
INFO:root:R2 = 1.0000
Losses of both gradient descent (GD) and stochastic gradient descent (SGD) converge to similar values.
The general form of the multivariate linear regression model is:
where:
-
$\vec{y}$ is the dependent variable (the outcome we want to predict). -
$\vec{x_1}$ ,$\vec{x_2}$ ,$\dots$ ,$\vec{x_n}$ are the independent variables, each of dimension$m$ . -
$\beta_0$ is the y-intercept (the value of$y$ when all predictors are 0). -
$\beta_1$ ,$\beta_2$ ,$\dots$ ,$\beta_n$ are the coefficients (slopes) representing the change in$y$ for a unit change in each predictor. -
$\epsilon$ is the error term, representing the deviation of the observed values from the predicted values. -
$\vec{1_m}$ is the$m$ -dimensional column vector of ones.
- The relationship between the independent and dependent variables is linear.
- Observations are independent of each other.
- The residuals (errors) of the model are normally distributed.
- Independent variables are not too highly correlated with each other.
The coefficients
For a cost function
where:
-
$\vec{\theta}$ are the parameters of the model. -
$\alpha$ is the learning rate, a hyperparameter that determines the step size at each iteration. -
$\vec{\nabla} J(\vec{\theta})$ is the gradient of the cost function with respect to$\theta$ .
- Initialize the parameters
$\vec{\theta}$ (often with random values). - Compute the gradient
$\vec{\nabla} J(\vec{\theta})$ . - Update the parameters by moving in the direction of the negative gradient.
- Repeat the process until the change in the cost function is below a predefined threshold (convergence) or a maximum number of iterations is reached.
The update rule for Stochastic Gradient Descent is:
where:
-
$x^{i}$ ,$y^{i}$ represents a single training example. -
$\nabla J(\vec{\theta}; x^{i}, y^{i})$ is the gradient of the cost function with respect to$\vec{\theta}$ for that single example.
- Gather a batch of random examples from training set
$(x^{i}, y^{i})$ . - Compute the gradient for the random batch.
- Update the parameters
$\vec{\theta}$ using the computed gradient. - Repeat the process for a set number of epochs or until convergence.
Minibatch stochastic gradient descent is computationally less expensive (significantly) than gradient descent because of the reduced cost of calculating gradient in every epoch.