This repository shows the labels and codes of paper Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation. We only show part of training data here as the full data is too big, please contact authors to get the full data.
File | Description |
---|---|
code/clasifier | Codes for classifier of BERT-CodeBERT model with cross model self-attention layer |
code/generator | Codes for generator of BART-CodeBERT model with cross model self-attention layer |
data/clasifier | Labels for training classifier |
data/generator | Labels for training generator |
We combine the pre-trained BART model with pre-trained CodeBERT model to test the capability of model containing both natural language and code information. One drawback of the pre-trained Transformer model is that the dimensions of its input and parameters are fixed, so we cannot simply concatenate the two encoders’ outputs, which is incompatible with the dimensions of the pre-trained decoder’s input. To solve the problem, we use cross-model self-attention layer, using query states (Q) of BART encoder with key (K) and value states (V) of CodeBERT encoder to calculate importance of each CodeBERT output token to the BART encoder’s output tokens. The input of BART decoder is output of residual structure between BART encoder and cross-model self-attention layer.
The code is based on CodeBERT project. The path of all files need to be changed.
According to paper Section 4.1, the classification results are shown below:
CodeBERT | BERT | BERT-CodeBert | Transformer | LSTM | |
---|---|---|---|---|---|
AUC | 0.91 | 0.89 | 0.57 | 0.80 | 0.71 |
CodeBERT has the best result (0.91 AUC), followed by BERT (0.89 AUC). Both of them have much better performance than non-pre-trained models (Transformer and LSTM).
Commit Message | Added & Deleted Code Segments | All Code Segments | Commit Message & Added & Deleted Code Segments | Commit Message & All Code Segments | |
---|---|---|---|---|---|
AUC | 0.55 | 0.67 | 0.62 | 0.80 | 0.91 |
The input results with both commit messages and code segments are better than the results without message or code segment (0.80-0.91 AUC versus 0.55-0.67 AUC), indicating that both commit messages and code segments are useful for the silent dependency alert detection.
According to paper Section 4.2, the generation results are shown below:
The results of pre-trained model based models (BART, CodeBERT and BART-CodeBERT) are much better than non-pre-trained Transformer model and the LSTM in all four key aspects. This demonstrates the advantages of model pre-training.
Inputs with both commit messages and code contents have much better results than the ones using only commit message or code contents. This indicates both commit messages and code contents are important to the generation task.