# mongo_sync Mongodb realtime synchronizer, which is similar to [py-mongo-sync](https://github.com/caosiyang/py-mongo-sync) # Features - Support database level and collection full and concurrent synchronize. You can use `--collection-concurrent` to define how many threads to sync a database, use `--doc-concurrent` to define how many threads to sync a collection. - Support oplog based synchronize, so we can synchronize incremental data in realtime. - Support daily rotation log, you can use it through `--log-path` option. Or else log information will be output to stdout. # Support Mongodb 3.6+ (because of official mongodb driver only support mongodb 3.6+) # Intall The recommended way to install mongo_sync is using `cargo`: ```shell cargo +nightly install mongo_sync ``` You can download [released binary](https://github.com/WindSoilder/mongo_sync/releases/) as well. # Running tests? To running integration tests, you need to config `SYNCER_TEST_SOURCE` to a testing mongodb uri, or `mongodb://localhost:27017` will be used. # Example To run synchronizer, you need to start oplog_syncer to make a realtime mongodb oplog sync first. Then you can run `db_sync` to sync database in realtime. ## oplog_syncer ```shell ./target/release/oplog_syncer --src-uri "mongodb://localhost:27017" --oplog-storage-uri "mongodb://localhost:27018/" ``` ## db_sync ```shell db_sync --src-uri "mongodb://localhost:27017/?authSource=admin" --oplog-storage-uri "mongodb://localhost:27018/?authSource=admin" --target-uri "mongodb://localhost:27019" --db test_db ``` Note that the `--oplog-storage-uri` in oplog_syncer and db_sync must be the same. # Usage help ## oplog_syncer ```shell USAGE: oplog_syncer [OPTIONS] --src-uri --oplog-storage-uri FLAGS: -h, --help Prints help information -V, --version Prints version information OPTIONS: --log-path log file path, if not specified, all log information will be output to stdout -o, --oplog-storage-uri target oplog storage uri -s, --src-uri source database uri, must be a mongodb cluster ``` ## db_sync ```shell USAGE: db_sync [OPTIONS] --src-uri --target-uri --oplog-storage-uri --db FLAGS: -h, --help Prints help information -V, --version Prints version information OPTIONS: --collection-concurrent how many threads to sync a database -c, --colls ... collections to sync, default sync all collections inside a database -d, --db database to sync --doc-concurrent how many threads to sync a collection --log-path log file path, if no specified, all log information will be output to stdout -o, --oplog-storage-uri mongodb uri which save oplogs, it's saved by `oplog_syncer` binary -s, --src-uri source mongodb uri -t, --target-uri target mongodb uri ``` # The basic arthitecture diagram ``` ┌───────────────┐ │ target db │ └───┬───────────┘ │ xxxx ▼ xxxxxxx xx ┌───────┐ x xx │db_sync│ x └───────┘ x xxxxxx ▲ ▲ x xx │ │ xxxx │ │ │ │ Full dump │ │Incr dump │ │(Real time) │ │ │ │ │ │ │ │ ┌─────────────────────┐ │ └─────────┤oplog storage db │ │ └──────▲──────────────┘ │ xxxxxxx │ xxxxx │ x ┌──────┴──────┐x │ x │Oplog syncer │x Sync oplog from source cluster │ x └──────▲──────┘x to oplog storage in real time │ xxxxxxxxx│ xxxxxxx │ │ │ │ │ │ │ ┌──────┴───────┐ └────────────┤Source cluster│ └──────────────┘ ``` According to the diagram, you can find that there are 2 basic programs provided by mongo_sync 1. oplog syncer: sync mongodb cluster's oplog to target `oplog storage db`. 2. db sync: sync data from `source cluster` to `target db`. # Benchmark It's not strictly benchmark test, I just test it manually. Scenario: When source cluster insert 50,000 records, how long the `target_db` can synchronizer these new 50,000 insert. My testing result: `db_sync` takes about 50 seconds to sync these update, and `py-mongo-sync` takes about 225 seconds to sync these update. In general, it's about 3.5x faster than `py-mongo-sync`. And, please note that `50 seconds` is not accrutely, it highly depends on your database and running machine performance. # How does the core work? - oplog_syncer just use [tailable cursor](https://docs.mongodb.com/manual/core/tailable-cursors/) to read mongodb oplogs in realtime. - db_sync full dump just use multithread with [find](https://docs.mongodb.com/manual/reference/method/db.collection.find/) to read data. - db_sync incr dump use [oplog](https://docs.mongodb.com/manual/core/replica-set-oplog/) to sync incremental update (CRUD, database command.), and this is why source database must be a cluster. # Notes 1. In incremental state, for now it only support the following command to be sync (which enough for my personal use.): - rename collection - dorp collection - create collection - drop indexes - create indexes 2. I haven't test mongo sharding as target, but it should be ok to work. 3. during running `oplog_syncer`, `oplog storage db` will create and using databse named `source_oplog`, and create and using collection named `source_oplog`. For now this is hardcoded. 4. during running `db_sync`, target databse will create a new collection named `oplog_records`, it saves the latest oplog timestamp applied to the database.