Contrastive Malware Classification with Graph Embeddings

Description

Machine learning models for malware detection often lose accuracy over time due to "concept drift," which is caused by changes in the statistical properties of malware and benign software. The problem is addressed using a Time-Aware Hierarchical Contrastive Loss (HCL) with a Heterogeneous Graph Neural Network (HeteroGNN). The HeteroGNN processes each app, represented as a heterogeneous graph of API calls, classes, and permissions, to generate an app-level embedding. The contrastive loss adjusts based on the time between samples to group similar malware and keep benign apps separated. The approach is designed to maintain accuracy without constant retraining.

The model was trained on data from 2012 and tested on data from 2013, 2015, and 2018 to simulate concept drift. The method achieved F1 scores above 0.90 and low false negative rates on future test sets, matching or slightly improving on past work. In contrast, a static SVM baseline showed a rapid deterioration in F1 score and an increase in false negative rates over the same period. This suggests that the time-aware contrastive training helps the model maintain performance over time.

The graph illustrates the Time-Aware HCL loss objectives. Benign samples are in a distinct cluster, clearly separated from all malware. Malware samples from the same family form tight clusters, while malware from different families are nearby but do not overlap.

This table shows that the proposed model maintains high F1 scores and low false negative rates across all test years, unlike the SVM baseline which degrades rapidly.

Full report can be found here.

Setup and Running

Setup Environment

python -m venv env
source activate env/bin/activate
pip install -r requirements.txt

git clone https://github.com/seclab-fudan/APIGraph

To download apks, follow instructions in https://github.com/seclab-fudan/APIGraph

Pre-Processing

cd APIGraph/src
python getAllEntities.py
python getAllRelations.py

### If access to SLURM and HPC
cd ../../notebooks
# run downloader.ipynb
# run preprocess.ipynb
# process_apk.ipynb
# run_graphs.ipynb

### Otherwise
cd ../../
python downloader.py
python preprocess_data.py

Training

python trainer.py

python trainer.py
python svm.py

To run Chen et. Al, see setup instructions in https://github.com/wagner-group/active-learning

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
APIGraph		APIGraph
configs		configs
imgs		imgs
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
downloader.py		downloader.py
my_models.py		my_models.py
preprocess_data.py		preprocess_data.py
relation_graph.py		relation_graph.py
requirements.txt		requirements.txt
svm.py		svm.py
testing.py		testing.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrastive Malware Classification with Graph Embeddings

Description

Setup and Running

Setup Environment

Pre-Processing

Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Contrastive Malware Classification with Graph Embeddings

Description

Setup and Running

Setup Environment

Pre-Processing

Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages