Neural Network Malware Binary Classification

PyTorch implementation of [1] Malware Detection by Eating a Whole EXE, [2] Learning the PE Header, Malware Detection with Minimal Domain Knowledge, and other derived custom models for malware detection.

Quickstart

Clone this repository via

git clone https://github.com/jaketae/deep-malware-detection.git
cd pytorch-malware-detection

Then, a Python virtual environment:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

If you have pipenv, you can also type

pipenv install -r requirements.txt

Train the model in Jupyter notebook titled run.ipynb, or start training through the terminal via

python train.py

Optional flags are documented below.

Implementation Notes

While [2] used LSTMs for the sequential model, we tested both GRU and LSTMs and found that the former was easier to train.
We combined models presented in papers [1] and [2] to derive a custom model that uses concatenated feature vector produced by the entry point 1D-CNN layer as well as the RNN units that follow. We denote these custom models with a "Res" prefix in the table below.
We also further develop the attention-based model in [2] with this residual approach.
While the [1] used the entire binary of PE files, our approach more closely resembles that of [2]. Due to computational constraints, we decided to only use PE file headers up to their 4096th bytes, thus creating a 4096 dimensional sequential feature vector for every file.

Results

Presented below is a table detailing the performance of each model.

Architecture	Acc	F1
MalConvBase	91	.931
MalConv+	94	.951
MalConv+ (E16)	93	.944
MalConv+ (W64)	94	.949
MC+ (E16,W64)	94	.950
MC+ (C256)	91	.930
GRU-CNN	93	.946
BiGRU-CNN	91	.931
GRU-CNN (H128)	93	.946
ResGRU-CNN	94	.948
AttnGRU-CNN	94	.952
AttnResGRU-CNN	94	.952

For visualizations of training and model evaluation, refer to images in the figures directory.

Contributing

The coding style is dictated by black. Depending on development environment, you can toggle format-on-save options in your code editor or set up pre-commit hooks to make the linter run on every push.

Please feel free to submit issues or pull requests if you find bugs or ways to optimize the code base. Emails to jaesungtae@gmail.com is also welcome!

References

[1] Malware Detection by Eating a Whole EXE

@misc{raff2017malware,
      title={Malware Detection by Eating a Whole EXE},
      author={Edward Raff and Jon Barker and Jared Sylvester and Robert Brandon and Bryan Catanzaro and Charles Nicholas},
      year={2017},
      eprint={1710.09435},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

[2] Learning the PE Header, Malware Detection with Minimal Domain Knowledge

@article{Raff_2017,
   title={Learning the PE Header, Malware Detection with Minimal Domain Knowledge},
   ISBN={9781450352024},
   url={http://dx.doi.org/10.1145/3128572.3140442},
   DOI={10.1145/3128572.3140442},
   journal={Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec  ’17},
   publisher={ACM Press},
   author={Raff, Edward and Sylvester, Jared and Nicholas, Charles},
   year={2017}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Neural Network Malware Binary Classification

Quickstart

Implementation Notes

Results

Contributing

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Neural Network Malware Binary Classification

Quickstart

Implementation Notes

Results

Contributing

References