Machine Learning-Based Cyberdefenses Competition

The data_features_combined folder has a small dataset with extracted features. To recreate the full dataset check the Dataset Section.

The data_test folder has executables for testing the model. ATTENTION: this folder contains real malware executables which can be harmful.

Quick start
Build the sample solution
Train and test model
Requirements
Generate Dataset
PE Files Datasets
- Datasets Used
- Other Datasets (Not Used)

Quick start

Instead of building solution from code, download the competition docker image from here.

An additional docker image with a better overall model is provided here

Before you proceed, you must install Docker Engine for your operating system.

Load the docker image

docker load -i ml.rar

Run the docker container:

docker run -itp 8080:8080 --memory=1g ml

Test the solution on malicious and benign samples of your choosing via:

python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benign

Build the sample solution

Before you proceed, you must install Docker Engine for your operating system.

A sample solution that you may modify is included in the defender folder.

Install Python requirements needed to test the solution:

pip install -r requirements.txt

OPTIONAL: To apply obfuscation to the code, copy the defender folder somewhere else since it is applied in place and run

pyminify defender/ --in-place --remove-literal-statements

Compile python code to run faster and slightly obfuscate code run

python out.py

Some trained models can be found in defender/saved_models.

Add the *.pkl file to use as model into docker/models/, we will later set the model to use during docker run.

From the root folder that contains the Dockerfile, build the solution:

docker build -t ml .

Run the docker container:

docker run -itp 8080:8080 --memory=1g ml

The flag -p 8080:8080 maps the container's port 8080 to the host's port 8080.

The flag --memory=1g limits the container with 1GB of RAM.

The flag --env MODEL_FILE="models/ml_classifier.pkl" can be added to specify which model to run

Test the solution on malicious and benign samples of your choosing via:

python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benign

You can also use the system folder C:\Windows\System32\ as benign samples.

Sample collections may be in a folder, or in an archive of type zip, tar, tar.bz2, tar.gz or tgz.

It is not required to unzip and strongly recommended that you do not unzip the archive to test malicious samples.

Train and test model

Once you have a trained model, it can be tested by running

python test.py -m model_path.pkl

To train a Random Forest model check defender/models/ml.py

python -m defender.models.ml && python test.py -m defender/ml.pkl

To train a Deep Learning model check defender/models/malware_gpt.py

./scripts/run.sh && python test.py -m defender/ml.pkl

Requirements

Minimum scores

FPR 1%
TPR 95%

Constraints

1GB of RAM
Response time 5 seconds per sample

A valid submission for the defense track consists of the following

a Docker image
listens on port 8080
accepts POST / with header Content-Type: application/octet-stream and the contents of a PE file in the body
returns {"result": 0} for benign files and {"result": 1} for malicious files
for files up to 10**21 bytes (10 MiB), must respond in less than 5 seconds (a timeout results in a benign verdict)

Generate Dataset

The datasets used are listed in data.txt.

To apply the feature extractor on a folder of PE files and save them for training models use

python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]
python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]

Different parameters allow creating a dataset from

--large_dataset from Practical Security Analytics dataset
--dike from the DikeDataset
--windows from the own Windows files
--programs from the Program Files and Drivers
--benign to specify any number of folders considered benign
--malware to specify any number of folders considered malware

PE Files Datasets

Datasets Used

Combined Datasets

Malware Datasets

Benign dataset

https://github.com/bormaa/Benign-NET

Other Datasets (Not Used)

BODMAS: https://whyisyoung.github.io/BODMAS/
https://github.com/elastic/ember
https://github.com/Virus-Samples/Malware-Sample-Sources
https://bazaar.abuse.ch/
Malware Detection PE-Based Analysis Using Deep Learning Algorithm Dataset. Same files but DikeDataset has passed them through VirusTotal API. It contains malicious and benign PE files and having CC BY 4.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning-Based Cyberdefenses Competition

Quick start

Build the sample solution

Train and test model

Requirements

Generate Dataset

PE Files Datasets

Datasets Used

Combined Datasets

Malware Datasets

Benign dataset

Other Datasets (Not Used)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data_features_combined		data_features_combined
data_test		data_test
defender		defender
scripts		scripts
test		test
tmp_calc		tmp_calc
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
diagram.drawio		diagram.drawio
docker-requirements.txt		docker-requirements.txt
out.py		out.py
requirements.txt		requirements.txt
test.py		test.py
train_torch.py		train_torch.py

vibalcam/ml-malware-detection

Folders and files

Latest commit

History

Repository files navigation

Machine Learning-Based Cyberdefenses Competition

Quick start

Build the sample solution

Train and test model

Requirements

Generate Dataset

PE Files Datasets

Datasets Used

Combined Datasets

Malware Datasets

Benign dataset

Other Datasets (Not Used)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages