Crates.io | steeldb |
lib.rs | steeldb |
version | 0.1.4 |
source | src |
created_at | 2023-12-30 23:09:34.90691 |
updated_at | 2024-01-01 19:06:28.885071 |
description | A simple database built from scratch in Rust |
homepage | |
repository | https://github.com/paolorechia/steeldb |
max_upload_size | |
id | 1084460 |
size | 82,594 |
This is a study repository. This is mostly for personal use. Building a Database from scratch in Rust. Why not? :)
Source code: https://github.com/paolorechia/steeldb/
*Database: https://docs.rs/steeldb/latest/steeldb *Parser: https://docs.rs/steeldb-parser/latest/steeldb_parser/
Should be as simple:
cargo add steeldb
cargo run
This should start a REPL:
------------------------------------------------
| |
| SteelDB |
| version: 0.1.0 |
| |
------------------------------------------------
Type 'exit;' to leave this shell
Current supported commands: [select]
>>
The only implemented clause is select, which selects columns of a previously constructed table. For example:
>> select name, annual_salary;
|---------------------------------|
| name | annual_salary |
|---------------------------------|
| John Man | 60000 |
| Lenon | 200000 |
| Mary | 3012000 |
|---------------------------------|
>>
Commands should always add with a ;
.
If you simply try the command above, you will instead see:
>> select name;
"Os { code: 2, kind: NotFound, message: \"No such file or directory\" }"
<------------------ COMMAND FAILED ------------------>
"TableNotFound"
<---------------------------------------------------->
>>
This is because the table must be pre-created. You can either create one using cargo test
and copying it,
or copying and pasting this into the file .steeldb/data/test_table.columnar
:
TABLE COLUMNAR FORMAT HEADER
Field name: final_grade; Type: f32; Number of elements: 3
4.0
3.2
5
Field name: name; Type: String; Number of elements: 3
John Man
Lenon
Mary
Field name: annual_salary; Type: i32; Number of elements: 3
60000
200000
3012000
As you can see, the table format is very naive and verbose. It stores data in ASCII. It's not meant to be efficient and will probably be replaced in the future.
This is not a binding roadmap.
A REPL shell
A tokenizer
A parser
A code generator
A virtual machine that interprets the generated code
A B+ Tree
Pager
OS Interface
We can simplify some components in the first iteration, so we have first a working end-to-end system. We can then tweak the individual components to have more capabilities.
Here are some simplifications we can do for our first iteration:
Our roadmap for the first iteration might end up looking like this:
Handle select columns properly in read method [x]
Test that everything is working as expected [x]
Tag v1.0 [x]
Add proper documentation to project []
Add another API besides the REPL to query the database []
Adds proper logging strategy to the server []
Add configuration file []
Add create table command []
Add drop table command []
Add alter table command []
Multiple tables query support (add FROM clause support) []
Support filters (add basic WHERE clause support) []
Add proper documentation to project []
Update Cargo.toml
Handle concurrency: needs research on approaches to use []
Implement B+ or similar algorithm.
Adds caching (or pagination) []
Support for transactions []
Implement inner join []
Implement left / right join []
Implement outer join []
Implement nested operations, including WHERE IN (SELECT) []
Implement aggregations []
Add date and timestamp types []
Implement more SQL functions []
Anything important missing in SQL standards
Implement the Spark.SQL API in Rust
Build a SDK in Rust to use it
Build Python bindings for the Rust SDK
If we reached this point, we actually have an impressive system :)
This is pobably the simplest component. We just need to implement a CLI shell that reads lines of input until the command end symbol is presented (like ';'). It then forwards the parsed string into the next layers of our system, and returns the result.
One should be able to generate them using https://github.com/lalrpop/lalrpop. Tutorial: http://lalrpop.github.io/lalrpop/tutorial/001_adding_lalrpop.html
This should be somehow integrated into the parser. It will become clearer once the first parser is built, how this can be written (if possible, avoiding too much coupling).
Probably the easiest way is to implement it as a stack, so it can handled nested commands. For each parsed command, we push it to the stack, and the virtual machine pop it, executing it. This works assuming we parse the commands in a depth-first way, e.g., the most inner command is always parsed first, and it's result is available to the next command execution. This should happen naturally using an auto-generated lexer/parser with lalrpop, as long as the grammar is correctly defined.
To keep it simple, however, we should start not supporting nested commands. We can still have the stack in place and ready for extension in the future.
The B+ Tree is wide-known algorithm, and is described in several places, for instance: https://en.wikipedia.org/wiki/B%2B_tree This will be skipped until much later! The goal is to first get an usable system/product that supports concurrency and possibly exposes an (HTTP) API to query data. This means things like transactions have to be possibly implemented first.