# LocustDB [![Build Status][bi]][bl] [![Crates.io][ci]][cl] [![Gitter][gi]][gl] [bi]: https://github.com/cswinter/LocustDB/workflows/Test/badge.svg [bl]: https://github.com/cswinter/LocustDB/actions [ci]: https://img.shields.io/crates/v/locustdb.svg [cl]: https://crates.io/crates/locustdb/ [gi]: https://badges.gitter.im/LocustDB/Lobby.svg [gl]: https://gitter.im/LocustDB/Lobby An experimental analytics database aiming to set a new standard for query performance and storage efficiency on commodity hardware. See [How to Analyze Billions of Records per Second on a Single Desktop PC][blogpost] and [How to Read 100s of Millions of Records per Second from a Single Disk][blogpost-2] for an overview of current capabilities. ## Usage Download the [latest binary release][latest-release], which can be run from the command line on most x64 Linux systems, including Windows Subsystem for Linux. For example, to load the file `test_data/nyc-taxi.csv.gz` in this repository and start the repl run: ```Bash ./locustdb --load test_data/nyc-taxi.csv.gz --trips ``` When loading `.csv` or `.csv.gz` files with `--load`, the first line of each file is assumed to be a header containing the names for all columns. The type of each column will be derived automatically, but this might break for columns that contain a mixture of numbers/strings/empty entries. To persist data to disk in LocustDB's internal storage format (which allows fast queries from disk after the initial load), specify the storage location with `--db-path` When creating/opening a persistent database, LocustDB will open a lot of files and might crash if the limit on the number of open files is too low. On Linux, you can check the current limit with `ulimit -n` and set a new limit with e.g. `ulimit -n 4096`. The `--trips` flag will configure the ingestion schema for loading the 1.46 billion taxi ride dataset which can be downloaded [here][nyc-taxi-trips]. For additional usage info, invoke with `--help`: ```Bash $ ./locustdb --help LocustDB 0.2.1 Clemens Winter Massively parallel, high performance analytics database that will rapidly devour all of your data. USAGE: locustdb [FLAGS] [OPTIONS] FLAGS: -h, --help Prints help information --mem-lz4 Keep data cached in memory lz4 encoded. Decreases memory usage and query speeds. --reduced-trips Set ingestion schema for select set of columns from nyc taxi ride dataset --seq-disk-read Improves performance on HDD, can hurt performance on SSD. --trips Set ingestion schema for nyc taxi ride dataset -V, --version Prints version information OPTIONS: --db-path Path to data directory --load Load .csv or .csv.gz files into the database --mem-limit-tables Limit for in-memory size of tables in GiB [default: 8] --partition-size Number of rows per partition when loading new data [default: 65536] --readahead How much data to load at a time when reading from disk during queries in MiB [default: 256] --schema Comma separated list specifying the types and (optionally) names of all columns in files specified by `--load` option. Valid types: `s`, `string`, `i`, `integer`, `ns` (nullable string), `ni` (nullable integer) Example schema without column names: `int,string,string,string,int` Example schema with column names: `name:s,age:i,country:s` --table Name for the table populated with --load [default: default] --threads Number of worker threads. [default: number of cores (12)] ``` ## Goals A vision for LocustDB. ### Fast Query performance for analytics workloads is best-in-class on commodity hardware, both for data cached in memory and for data read from disk. ### Cost-efficient LocustDB automatically achieves spectacular compression ratios, has minimal indexing overhead, and requires less machines to store the same amount of data than any other system. The trade-off between performance and storage efficiency is configurable. ### Low latency New data is available for queries within seconds. ### Scalable LocustDB scales seamlessly from a single machine to large clusters. ### Flexible and easy to use LocustDB should be usable with minimal configuration or schema-setup as: - a highly available distributed analytics system continuously ingesting data and executing queries - a commandline tool/repl for loading and analysing data from CSV files - an embedded database/query engine included in other Rust programs via cargo ## Non-goals Until LocustDB is production ready these are distractions at best, if not wholly incompatible with the main goals. ### Strong consistency and durability guarantees - small amounts of data may be lost during ingestion - when a node is unavailable, queries may return incomplete results - results returned by queries may not represent a consistent snapshot ### High QPS LocustDB does not efficiently execute queries inserting or operating on small amounts of data. ### Full SQL support - All data is append only and can only be deleted/expired in bulk. - LocustDB does not support queries that cannot be evaluated independently by each node (large joins, complex subqueries, precise set sizes, precise top n). ### Support for cost-inefficient or specialised hardware LocustDB does not run on GPUs. ## Compiling from source 1. Install Rust: [rustup.rs][rustup] 2. Clone the repository ```Bash git clone https://github.com/cswinter/LocustDB.git cd LocustDB ``` 3. Compile with `--release` for optimal performance: ```Bash cargo run --release --bin repl -- --load test_data/nyc-taxi.csv.gz --reduced-trips ``` ### Running tests or benchmarks `cargo test` `cargo bench` [nyc-taxi-trips]: https://www.dropbox.com/sh/4xm5vf1stnf7a0h/AADRRVLsqqzUNWEPzcKnGN_Pa?dl=0 [blogpost]: https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/ [blogpost-2]: https://clemenswinter.com/2018/08/13/how-read-100s-of-millions-of-records-per-second-from-a-single-disk/ [rustup]: https://rustup.rs/ [latest-release]: https://github.com/cswinter/LocustDB/releases/download/v0.1.0-alpha/locustdb-0.1.0-alpha-x64-linux.0-alpha