This article reports a brief summary of the main datasets we have released for malware research. I will try to keep this list updated with new entries, and use the “changelog” at the end to track major changes to this article.

Datasets Available

The timestamped malware datasets which we have released for research are the following:

Please note that we are not releasing the goodware/malware directly, but instead only the SHAs of the apps we considered. To re-download the original apks, you can re-download them from AndroZoo.

Loading Features

We did separate the dataset into three JSON files: X, Y, and meta. The following function is used to load the dataset with timestamps:

import datetime
import json
import logging
import time

def load_features(fname, shas=False):
    """Load feature set.

        feature_set (str): The common prefix for the dataset.
            (e.g., 'data/features/drebin' -> 'data/features/drebin-[X|Y|meta].json')

        shas (bool): Whether to include shas. In some versions of the dataset,
            shas were included to double-check alignment - these are _not_ features
            and _must_ be removed before training.

        Tuple[List[Dict], List, List]: The features, labels, and timestamps
            for the dataset.

    """'Loading features...')
    with open('{}-X.json'.format(fname), 'r') as f:
        X = json.load(f)
    # if not shas:
    #     [o.pop('sha256') for o in X]'Loading labels...')
    with open('{}-Y.json'.format(fname), 'rt') as f:
        y = json.load(f)
    if 'apg' not in fname:
        y = [o[0] for o in y]'Loading timestamps...')
    with open('{}-meta.json'.format(fname), 'rt') as f:
        t = json.load(f)
    t = [o['dex_date'] for o in t]
    if 'apg' not in fname:
        t = [datetime.strptime(o if isinstance(o, str) else time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(o)),
                               '%Y-%m-%dT%H:%M:%S') for o in t]
        t = [datetime.strptime(o if isinstance(o, str) else time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(o)),
                               '%Y-%m-%d %H:%M:%S') for o in t]
    return X, y, t

Please remember to remove any SHAs from the dataset and do not consider them as features.

Memory Errors

If you are a BSc/MSc student doing a dissertation, and you are relying on our datasets, but do not have access to a powerful server, you may want to consider “downsampling” strategies to reduce the size of the dataset to make it more manageable.


  • 24/03/2022: v1.0 published