The data_features_combined folder has a small dataset with extracted features. To recreate the full dataset check the Dataset Section.
The data_test folder has executables for testing the model. ATTENTION: this folder contains real malware executables which can be harmful.
- Quick start
- Build the sample solution
- Train and test model
- Requirements
- Generate Dataset
- PE Files Datasets
Instead of building solution from code, download the competition docker image from here.
An additional docker image with a better overall model is provided here
Before you proceed, you must install Docker Engine for your operating system.
Load the docker image
docker load -i ml.rar
Run the docker container:
docker run -itp 8080:8080 --memory=1g ml
Test the solution on malicious and benign samples of your choosing via:
python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benign
Before you proceed, you must install Docker Engine for your operating system.
A sample solution that you may modify is included in the defender
folder.
Install Python requirements needed to test the solution:
pip install -r requirements.txt
OPTIONAL: To apply obfuscation to the code, copy the defender
folder somewhere else since it is applied in place and run
pyminify defender/ --in-place --remove-literal-statements
Compile python code to run faster and slightly obfuscate code run
python out.py
Some trained models can be found in defender/saved_models
.
Add the *.pkl
file to use as model into docker/models/
, we will later set the model to use during docker run.
From the root
folder that contains the Dockerfile
, build the solution:
docker build -t ml .
Run the docker container:
docker run -itp 8080:8080 --memory=1g ml
The flag -p 8080:8080
maps the container's port 8080 to the host's port 8080.
The flag --memory=1g
limits the container with 1GB of RAM.
The flag --env MODEL_FILE="models/ml_classifier.pkl"
can be added to specify which model to run
Test the solution on malicious and benign samples of your choosing via:
python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benign
You can also use the system folder C:\Windows\System32\
as benign samples.
Sample collections may be in a folder, or in an archive of type zip
, tar
, tar.bz2
, tar.gz
or tgz
.
It is not required to unzip and strongly recommended that you do not unzip the archive to test malicious samples.
Once you have a trained model, it can be tested by running
python test.py -m model_path.pkl
To train a Random Forest model check defender/models/ml.py
python -m defender.models.ml && python test.py -m defender/ml.pkl
To train a Deep Learning model check defender/models/malware_gpt.py
./scripts/run.sh && python test.py -m defender/ml.pkl
Minimum scores
- FPR 1%
- TPR 95%
Constraints
- 1GB of RAM
- Response time 5 seconds per sample
A valid submission for the defense track consists of the following
- a Docker image
- listens on port 8080
- accepts
POST /
with headerContent-Type: application/octet-stream
and the contents of a PE file in the body - returns
{"result": 0}
for benign files and{"result": 1}
for malicious files - for files up to 10**21 bytes (10 MiB), must respond in less than 5 seconds (a timeout results in a benign verdict)
The datasets used are listed in data.txt
.
To apply the feature extractor on a folder of PE files and save them for training models use
python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]
python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]
Different parameters allow creating a dataset from
--large_dataset
from Practical Security Analytics dataset--dike
from the DikeDataset--windows
from the own Windows files--programs
from the Program Files and Drivers--benign
to specify any number of folders considered benign--malware
to specify any number of folders considered malware
- https://practicalsecurityanalytics.com/pe-malware-machine-learning-dataset/
- DikeDataset: https://github.com/iosifache/DikeDataset
- Malware-Feed: https://github.com/MalwareSamples/Malware-Feed
- https://github.com/ytisf/theZoo
- https://github.com/fabrimagic72/malware-samples
- https://github.com/InQuest/malware-samples
- https://github.com/mstfknn/malware-sample-library
- https://github.com/wolfvan/some-samples
- https://github.com/RamadhanAmizudin/malware
- BODMAS: https://whyisyoung.github.io/BODMAS/
- https://github.com/elastic/ember
- https://github.com/Virus-Samples/Malware-Sample-Sources
- https://bazaar.abuse.ch/
- Malware Detection PE-Based Analysis Using Deep Learning Algorithm Dataset. Same files but DikeDataset has passed them through VirusTotal API. It contains malicious and benign PE files and having CC BY 4.0 license.