Malicious URL Detector built utilizing several data mining, machine learning and data science concepts, techniques and algorithms (PAs 1 and 2 from Applied Data Mining course - DCOMP - UFSJ).
All the project dependencies are listed is this section (languages, libraries, package managers, frameworks, ...), as well as the instructions to install each of of them.
Python3 and pip package manager:
sudo apt install python3 python3-pip build-essential python3-dev
Node.JS package manager - npm (Optional):
sudo apt-get install npm
scikit-learn library:
pip install -U scikit-learn
xgboost library:
pip install xgboost
mlxtend library:
pip install mlxtend
imbalanced-learn library:
pip install imbalanced-learn
pandas library:
pip install pandas
joblib library:
pip install joblib
Matplotlib library:
pip install matplotlib
seaborn library:
pip install seaborn
numpy library:
pip install numpy
Beautiful Soup library:
pip install beautifulsoup4
mechanize library:
pip install mechanize
Random User Agents library:
pip install random_user_agent
PyCryptodome library:
pip install pycryptodomex
To install all GUI dependencies:
npm i
Vue.js framework:
npm install -g @vue/cli
Bootstrap framework:
npm install [email protected] --save
axios library:
npm i axios
Font Awesome tool kit:
npm i --save @fortawesome/free-solid-svg-icons && npm i --save @fortawesome/vue-fontawesome@latest-2
All the instructions for exploring the project functionalities are listed in this section, as well as the commands to execute each application.
You can explore all functionalities (different models, datasets, ...) by just modifying (or uncommenting) few parts of the source code.
Inside src directory, execute the command using the following template:
python3 cli <url> <algorithm>
. -
Example with a phishing URL:
python3 cli XGB
Open two terminal instances and execute the following commands in each one of them, respectively.
Terminal 1 - Back-end (inside src directory):
python3 server
Terminal 2 - Front-end (inside url-detector directory):
npm run serve
You should receive two URLs as outputs (
http://localhost:<port number>
). To visualize it, just open any of them in a browser of your choice. The front-end server (GUI) should be running at:http://localhost:8080
Finally, feel free to test the model with your own URLs! 🍾
Due to model training with the Kaggle dataset, the model reliability can suffer a lot depending on the user's inputted URL format. Most of the URLs present in the Kaggle dataset doesn't have its communication protocol specified (HTTP, HTTPS, ...), which could introduce large bias on the results and models trained, making the classifications quite unstable.