cooklang-sync-client

Crates.io	cooklang-sync-client
lib.rs	cooklang-sync-client
version	0.2.4
source	src
created_at	2024-06-19 11:56:52.119887
updated_at	2025-01-10 20:36:15.222218
description	A client library for cooklang-sync
homepage
repository	https://github.com/cooklang/cooklang-sync
max_upload_size
id	1276796
size	130,550

Alexey Dubovskoy (dubadub)

documentation

README

REMOTE SCHEMA

Server File Journal - stores all changes

Namespace Id (NSID) Relative Path in namespace Journal ID (JID): Monotonically increasing within a namespace

BlockServer - can store block or retrieve block

RocksDB might work

where to store chunks? s3 is to expensive for such small files, maybe cheap distributed key/value db?

LOCAL DB SCHEMA

files

jid: integer path // relative to current dir format: text|binary modified: unix timestamp size: integer is_symlink: bool checksum: varchar

USE-CASES

client needs to update a file from meta server (MS)
- S during polling receives that file /path/bla was updated
- sends list request passing namespace and current cursor
- MS returns all JIDs since passed one and their hashes (maybe except when the same file was updated multiple times, returns only the last one?)
- S
client needs to upload a file to server
- S tries to commit current file it has commit(/path/bla, [h1,h2,h3])
- MS returns back list of
program just starts
- S checks the latest journal_id
- if local latest journal_id is the same it will do nothing
- if local latest journal_id
file was removed locally
file was moved locally
file was renamed
one line in a file was edited
one line in a file was added
one line in a file was removed

if latest jid remotely bigger sync dowload from remote if metadata, size is different upload to remote and after commit store into local db

do I need hierarchy of services or they should be all independent?
how sharing should work?
how to thread it? multiple modules and multiple files
do I need to sync file metadata as well?

We have separate threads for sniffing the file system, hashing, commit, store_batch, list, retrieve_batch, and reconstruct, allowing us to pipeline parallelize this process across many files. We use compression and rsync to minimize the size of store_batch/retrieve_batch requests.

SYNCER

checks if database has not assigned jid
when it finds not assigned jid it will try to commit, after commiting it will update local DB with new jid
if chunk is not present locally it will try to download it
if chunk is not present remotely it will try to upload it

commit("breakfast/Mexican Style Burrito.cook", "h1,h2,h3");

problem if by line? => seek wont work, need to store block size to do the seek effeftively.
where to store chunks for not yet assembled file
how to understand that a new file created remotely
hot to understand that file was deleted
how to understand that

INDEXER

sync between files and local DB on schedule (once a min, f.e.)
watches changes and triggers sync
will cleanup DB once a day

do I need to copy not changed jid? or just update updated? => it makes sense to update all
what happens on delete, move?

CHUNKER

Role of Chunker is to deal with persistance of hashes and files. It operates on text files and chunks are not a fixed sized but each chunk is a line of file.

given path it will produce list of hashes of the file: fn hashify(file_path: String) -> io::Result<Vec<String>>
given path and list of hashes it will save a new version of a file fn save(file_path: String, Vec<String>) -> io::Result. It should raise an error if cache doesn't have content for a specific chunk hash
can read content of a specific chunk from cache fn read_chunk(chunk: String) -> io::Result<String>
can write content of a spefic chunk to cache fn save_chunk(chunk: String, content: String) -> io::Result
given two vectors of hashes it can compare them if they are the same fn compare_sets(left: Vec<String>, right: Vec<String>) -> bool
given hash it can check if cache contains content for it or not. fn check_chunk(chunk: String>) -> io::Result<bool>

strings will be short, 80-100 symbols. what should be used as hashing function? what size of hash should be? I'd say square root of 10. You can test it!
empty files should be different from deleted

TODO

bundling of uploads/downloads
read-only
namespaces
proper error handling
report error on unexpeted cache behaviour
don't need to throw unknown error in each non-200 response
remove clone
limit max file
configuration struct
pull changes first or reindex locally first? research possible conflict scenarios
extract to core shared datasctuctures
garbage collection on DB
test test test
metrics for monitoring (cache saturation, miss)
protect from ddos https://github.com/rousan/multer-rs/blob/master/examples/prevent_dos_attack.rs
auto-update client

open sourcing

how to keep it available for opensource (one user?)
add documentation
draw data-flow

Commit count: 106