GitHub

EECS E6893 Big Data Analytics Final Project: Credit Card Fraud detection via Cluster based Scoring & Anomaly Detection

Team Members: Vedant Kumar (vrk2109), Siddharth Nijhawan (sn2951), Sushant Tiwari (st3425)

Description

The repository contains 4 jupyter notebooks containing end-to-end pipelines of implementing various iterative and clustering based anomaly detection algorithms on the dataset of Credit Card Fraud Detection

Dataset is available here: https://www.kaggle.com/mlg-ulb/creditcardfraud

data_analysis.ipynb - performs initial data analysis by generating statistical metrics for each feature dimension like mean, std, min-max values, etc. Notebook also generates histograms for each feature vector and plots correlation heatmap as well
kmeans.ipynb - runs Kmeans clustering on the given dataset to generate consistency scores using the following methodology:

Run K-means algorithm 10 times.
Every run takes bootstrapped samples which are normalised between 0 and 1.
K is varied between 0 and 20 and cluster indices, cluster centroids and number of data points in the clusters are calculated.
Finally, a weighted score for the data point for each combination of the assigned cluster is computed by calculating dot products of the C centroids.
Precision-Recall Curves, ROC Curves, and AUPRC, AUROC, Scatter Plots are generated

isolation_forest.ipynb - runs Isolation Forest algorithm on the given dataset to generate anomaly scores using the following methodology:

Isolation Forest algorithm is run 10 times.
Every run takes bootstrapped samples with no. of trees = 100
Scikit Learn’s inbuilt isolation forest class is used to generate isolation trees on our data set.
decision_function() and predict() functions generate scores & predicted labels respectively.
Outlier fraction (ratio of fraudulent to non-fraudulent transactions) is passed to the isolation forest class.
Precision-Recall Curves, ROC Curves, and AUPRC, AUROC, Scatter Plots are generated.

local_outlier_factor.ipynb - runs Local Outlier Factor algorithm on the given dataset to generate anomaly scores using the following methodology:

Local Outlier Factor algorithm is run 10 times .
Computes LOF(X) = (sum of avg. LRD of X’s neighbors)/ LRD(X)
LRD(X) = Local Reachability Distance (X) = 1/(Avg. Reachability of X from neighbors)
Scores and predictions are generated using negative_outlier_factor_ object and fit_predict() functions of LOF class.
“Minkowski” distance is used as a distance metric with the number of neighbors = 20
Precision-Recall curves, histogram plots of score distribution, and ROC curves are plotted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EECS E6893 Big Data Analytics Final Project: Credit Card Fraud detection via Cluster based Scoring & Anomaly Detection

Description

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
data_analysis.ipynb		data_analysis.ipynb
isolation_forest.ipynb		isolation_forest.ipynb
kmeans.ipynb		kmeans.ipynb
local_outlier_factor.ipynb		local_outlier_factor.ipynb

fgethell/EECS_E6893_Final_Project

Folders and files

Latest commit

History

Repository files navigation

EECS E6893 Big Data Analytics Final Project: Credit Card Fraud detection via Cluster based Scoring & Anomaly Detection

Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages