Malware detection is a critical aspect of cybersecurity, where even minor changes in source code can significantly alter compiled code. Function Call Graphs (FCGs) provide a robust representation of executable control flow, making them essential for malware detection, especially in the Android domain. In this work, we introduce Better Call Graphs (BCG), a comprehensive dataset featuring extensive FCGs extracted from recent APKs, encompassing both benign and malware samples across diverse types and families. Moreover, BCG includes graph-level APK features, capturing both structural and behavioral malware characteristics. Released under a CC-BY license, BCG facilitates free sharing and adaptation for research and development purposes.
To ensure a robust and relevant BCG dataset, we constructed it by comprehensively analyzing both APK files and their corresponding graph properties. This involved filtering and refining APKs based on various quality and relevance criteria, overcoming limitations of existing datasets. There are some high level observations that guided our approach: (1) old APKs are simplistic in their structure and capabilities, (2) repackaging is very common, (3) certain virus families are over represented in datasets and (4) small (based on bytecode size and not auxiliary files) APKs are often uninteresting from a detection standpoint.
While existing datasets offer valuable properties for evaluating Android malware with FCGs, they often contain duplicate APKs with different names but identical FCG structures. Most of the previous datasets primarily consists of samples collected before 2017, potentially limiting its generalizability to modern malware, and hence misleads the ongoing research on malware detection. Additionally, existing malware datasets often include numerous smaller-sized APKs, which limits their utility in comprehensive malware analysis. Moreover, these datasets often lack essential APK properties, such as detailed information on the services or libraries used by the app, which impedes a thorough understanding of the app’s behavior and functionality, making it difficult to accurately classify malware. To address these limitations, we ensured that BCG has four key features: (1) larger size to facilitate more robust graph classification, (2) recent data (including 2017 and after) to reflect evolving threats, (3) unique APKs to ensure a more accurate evaluation testbed, and (4) non-graph APK features (graph attributes) for a more holistic evaluation. The statistics of our dataset is given in below table,
We extracted two basic features from the APK and manifest files: APK size and DEX size. Beyond basic size information, we utilize Androguard to extract various textual features from the APK. These textual features include the app/package name, permissions requested by the app, all activities of the app, services or libraries used by the app, and the list of broadcast receivers. We encoded all of the textual features using a 100-dimensional TensorFlow sentence encoder and further reduced its dimensionality to 2 using t-SNE for efficient processing. A detailed description of all APK features is provided in below table:
Feature | Description |
---|---|
APK size | The size of the APK file in bytes. |
Dex size | The size of the Dex file in bytes. |
App name | The application name of the APK. |
Package name | The unique package name of the APK. |
App permission | The list of permissions requested by the app, indicating the resources and data the app needs access to. |
App main activity | The main activity of the app and entry point when users launch the app. |
App all activity | The complete list of all activities defined in the app, representing the different screens and interactions available within the app. |
Services | The list of all services used by the app, which are components that run in the background to perform long-running operations. |
Receivers | The list of all broadcast receivers in the app, which are components that respond to system-wide broadcast announcements. |
Libraries | The list of all libraries used by the app, which can include third-party libraries that provide additional functionality and support. |
Our original dataset comprises 146 GB of APK data, 114 GB of actual Function Call Graph (FCG) data, and 5.4 GB of hashed FCG data. Click here to download all the data (including metadata) separately.