Malware Datasets with Timestamps
This article reports a brief summary of the main datasets we have released for malware research. I will try to keep this list updated with new entries, and use the “changelog” at the end to track major changes to this article.
Datasets Available
The timestamped malware datasets which we have released for research are the following:
- Tesseract dataset (apps from 2014 to 2016): malware downloaded from AndroZoo with extracted feature spaces for DREBIN and MaMaDroid.
- S&P20 APG dataset (apps from 2017 to 2018): malware downloaded from AndroZoo with extracted feature spaces for DREBIN and MaMaDroid.
Please note that we are not releasing the goodware/malware directly, but instead only the SHAs of the apps we considered. To re-download the original apks, you can re-download them from AndroZoo.
Loading Features
We did separate the dataset into three JSON files: X, Y, and meta. The following function is used to load the dataset with timestamps:
import datetime
import json
import logging
import time
def load_features(fname, shas=False):
"""Load feature set.
Args:
feature_set (str): The common prefix for the dataset.
(e.g., 'data/features/drebin' -> 'data/features/drebin-[X|Y|meta].json')
shas (bool): Whether to include shas. In some versions of the dataset,
shas were included to double-check alignment - these are _not_ features
and _must_ be removed before training.
Returns:
Tuple[List[Dict], List, List]: The features, labels, and timestamps
for the dataset.
"""
logging.info('Loading features...')
with open('{}-X.json'.format(fname), 'r') as f:
X = json.load(f)
# if not shas:
# [o.pop('sha256') for o in X]
logging.info('Loading labels...')
with open('{}-Y.json'.format(fname), 'rt') as f:
y = json.load(f)
if 'apg' not in fname:
y = [o[0] for o in y]
logging.info('Loading timestamps...')
with open('{}-meta.json'.format(fname), 'rt') as f:
t = json.load(f)
t = [o['dex_date'] for o in t]
if 'apg' not in fname:
t = [datetime.strptime(o if isinstance(o, str) else time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(o)),
'%Y-%m-%dT%H:%M:%S') for o in t]
else:
t = [datetime.strptime(o if isinstance(o, str) else time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(o)),
'%Y-%m-%d %H:%M:%S') for o in t]
return X, y, t
Please remember to remove any SHAs from the dataset and do not consider them as features.
Memory Errors
If you are a BSc/MSc student doing a dissertation, and you are relying on our datasets, but do not have access to a powerful server, you may want to consider “downsampling” strategies to reduce the size of the dataset to make it more manageable.
Changelog
- 24/03/2022: v1.0 published
Enjoy Reading This Article?
Here are some more articles you might like to read next: