### MalwareDB
[![Test](https://github.com/malwaredb/malwaredb-rs/actions/workflows/test.yml/badge.svg)](https://github.com/malwaredb/malwaredb-rs/actions/workflows/test.yml)[![Lint](https://github.com/malwaredb/malwaredb-rs/actions/workflows/lint.yml/badge.svg)](https://github.com/malwaredb/malwaredb-rs/actions/workflows/lint.yml)[![Cross](https://github.com/malwaredb/malwaredb-rs/actions/workflows/release.yml/badge.svg)](https://github.com/malwaredb/malwaredb-rs/actions/workflows/release.yml)[![Crates.io Version](https://img.shields.io/crates/v/malwaredb)](https://crates.io/crates/malwaredb)[![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/malwaredb/malwaredb-rs/badge)](https://securityscorecards.dev/viewer/?uri=github.com/malwaredb/malwaredb-rs)[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/8234/badge)](https://www.bestpractices.dev/projects/8234)

Inspired by [VXCage](https://github.com/botherder/vxcage) and [VirusTotal](https://www.virustotal.com/), MalwareDB is a malware knowledge management system which handles the bookkeeping regarding malware/goodware samples: hashes, origination, similarity, file types, and more. Its intention is to help malware/cybersecurity researchers, forensic investigators, and others who have a need to handle malware, or other files of potentially unknown origin. This is very much a **work in progress** and **alpha-quality** project at present.

#### Key Features:
* Store malware, goodware, or unknown file samples.
* Categorize samples by:
  * Labels, a hierarchical taxonomy (not yet implemented)
  * Origin, the source of the sample.
* Permissions by group, access to file based on users' group membership
* Fetch samples by hash
* Search based on file similarity (requires the Postgres plugins mentioned below)
* Parse the files for features which may be useful for machine learning models
* Works on any modern operating system
* Allow encrypting the files on disk so the server does not cause problems with endpoint security or anti-virus software
* Supports the [CaRT](https://github.com/CybercentreCanada/cart) format using the [default key](https://github.com/CybercentreCanada/cart-rs/blob/f366de982a92efd6137988950fc6277cb11f765b/src/cipher.rs#L15).

#### Requirements:
* [Postgres](http://postgresql.org/) database server
* [Rust](https://www.rust-lang.org/) to compile
* [libmagic](https://www.darwinsys.com/file/) which is the `file` command. Install `libmagic-dev` on Linux, or `brew install libmagic` on macOS with [Homebrew](https://brew.sh/).
  * On Windows: `cargo install cargo-vcpkg; vcpkg install libmagic; vcpkg integrate install`
  * The `MAGIC` environment variable may be used to specify the paths for the libmagic database.
* Similarity hash extensions for Postgres:
  * [LZJD](https://github.com/malwaredb/LZJD)
  * [SSDeep](https://github.com/malwaredb/ssdeep_psql)
  * [TLSH](https://github.com/malwaredb/tlsh_pg)
* Alternatively, use [docker](https://github.com/malwaredb/docker) which provides a container with the Postgres extensions already installed (though they still have to be activated, see the [readme](https://github.com/malwaredb/docker/blob/main/README.md)).

#### Status
This project is in active development and not yet stable, nor are all the features implemented.

#### Installation
Install from source. Check out the repository and build (recommended), or build from crates.io:
  * `cargo install malwaredb-client`
  * `cargo install malwaredb --features=admin,admin-gui,sqlite,vt` (activates all the features, requires some external dependencies)

Server Features (which are all opt-in):
  * `admin`: command-line administrative functionality
  * `admin-gui`: [Slint](https://slint.dev/)-powered GUI, tested and works on macOS, Linux, Windows, might work elsewhere?
  * `sqlite`: Allow the use of [SQLite](https://www.sqlite.org/) as a database backend. Should only be used for testing and evaluation, as it lacks the similarity optimisations we have for Postgres.
  * `vt`: Allow (but still be enabled) the VirusTotal functionality (cache AV data for contained samples)

#### Future
* Planned features:
  * Web interface as a separate application
  * GUI applications
  * Support for [Confidential Computing](https://en.wikipedia.org/wiki/Confidential_computing)
    * Initially for Enarx: [Website](https://enarx.dev/), [Code](https://github.com/enarx/enarx)
    * Learn more at the [Confidential Computing Consortium](https://confidentialcomputing.io/) website.
  * Encrypting samples, if stored, so the anti-virus on host system doesn't trigger alerts, or allow for accidental infection.
  * Train ML models based on features of the malicious & benign files:
    * Domain-specific features (parsed features from specific file types)
    * Type-agnostic features (information about any sequence of bytes, such as n-grams, entropy, length, etc)
    * Use user input for tags/labels
    * Labels from VirusTotal information for labels with tools like ClarAVy ([Code](https://github.com/NeuromorphicComputationResearchProgram/ClarAVy), [Paper](https://arxiv.org/abs/2310.11706)) or [AVClass2](https://arxiv.org/abs/2006.10615).
* Potential features:
  * File storage backends for HDFS, S3, others?
* Something missing? Get in touch: file an [issue](https://github.com/malwaredb/malwaredb-rs/issues/new) or start a [discussion](https://github.com/orgs/malwaredb/discussions)!

### Getting Started:
0. Compile from source, ideally with `--features=admin,sqlite`.
1. Create your configuration file. Compile with the `sqlite` feature to use SQLite. This is more for testing and evaluation than using in a real environment. See the example file in the root of the repository for an example.
  * If the storage section is empty (it's optional), then MalwareDB will only store the metadata about the files, and will not store the samples. That means getting the original file will not be available.
2. Place the config file in `/etc/mdb_server/mdb_config.toml` on Linux, or `/usr/local/etc/mdb_server/mdb_config.toml` on FreeBSD for automatic config file detection. Otherwise, run with `mdb_server run load /path/to/file`, or `mdb_server run config` to specify arguments on the command line. Run with `--help` to see details.

#### Administrative Items
1. Since you compiled with the `admin` feature above, you can run `mdb_server admin --help` to see administrative options. Admin options require `-c /path/to/config.toml` to prevent making accidental changes. Note: using the `admin` command interactions with the database directly, so the server does not need to be running.
2. List users with: `mdb_server admin -c /path/to/config.toml list users`. There is a default admin user, but no password is set. So let's set one.
3. Reset Admin's password: `mdb_server admin -c /path/to/config.toml reset-password --uname admin`. You'll be prompted for the password and it won't echo. The admin user doesn't do anything special at the moment, but that will change.
4. File are organized by sources, and groups have access to sources. So groups and sources must be added and linked to be able to add files.
  * Create a source, look at the command line options: `mdb_server admin -c /path/to/config.toml create source --help`
  * Create a group, look at the command line options: `mdb_server admin -c /path/to/config.toml create group --help`
  * Add the group to the source, look at the command line options: `mdb_server admin -c /path/to/config.toml add-group-to-source --help`
  * Add the user to the group, look at the command line options: `mdb_server admin -c /path/to/config.toml add-user-to-group --help`
5. Now, use the client to login with `mdb_client` while `mdb_server` is running: `mdb_client login http://localhost:8080 admin`, replacing the URL with the actual IP and port you chose in the server configuration file.
6. Test that the client works with `mdb_client whoami`, it should show the user information and available groups and sources.

### Loading Files
* Files may be uploaded using the client: `mdb_client submit-samples -s SOURCE_ID /path/to/files_or_dirs`. Paths may be to files or directories, and more than one path may be specified. All items will be uploaded to the same source (specified by the ID). If the file is a Zip, it will be decompressed in memory and each file submitted individually as long as it's not a known document type (like MS Office .docx, .xlsx, etc.).
* Files may also be uploaded using the admin command from the server: `mdb_server admin -c /path/to/config.toml -s SOURCE_ID -u USER_ID /path/to/files_or_dirs`. With the server admin function, a user ID must also be provided. Otherwise, this works the same way as the client, directories and files may be provided, they will be associated with the same source, and Zip files will be decompressed in memory and submitted individually if not a known MS Office format.

### Downloading Files
* Using the client, a sample may be retrieved using it's hash. Hash types are detected by length, and supported hashes are: MD5, SHA1, SHA256, SHA384, and SHA512.
* `mdb_client retrieve-sample SPECIFY_HASH_HERE`. One hash per request, and it will be downloaded if it exists, and if the user has access to the group and source to which the sample is linked.

### Searching for Similar Files
* Using the client, similarity hashes are calculated and submitted to the server. The sample is not sent to the server! Just hashes.
* `mdb_client find-similar /path/to/file.bin`. The same restriction with downloading applies: the user must have access to the group and source to which a potential similar file is linked. The output will be the hashes of the similar files, and by what means (similarity algorithm) the result is similar.

### Misc. Client Commands
* `mdb_client server-info` displays some statics about the server, including version numbers, database type, and total amount of files.
* `mdb_client server-types` displays a list and magic numbers of supported file types.

### Goals
Some overall goals and design:
* MalwareDB shall be easy to use.
* MalwareDB shall be a place to store *your* data and use a simple database schema so that other applications may interact with the data directly.
* MalwareDB shall collect and enrich malicious and benign files so that some features may be used for machine learning models.
* MalwareDB should provide reusable components which may benefit other projects, even if not directly related.