# tsumugu A HTTP(S) syncing tool with lower overhead, for OSS mirrors. Instead of `HEAD`ing every single file, tsumugu parses directory listing HTML and downloads only files that do not seem to be up-to-date. ## Design goals To successfully sync from these domains, where lftp/rclone fails or finds difficulties: - [x] http://download.proxmox.com/ - [x] https://download.docker.com/ - [x] https://dl.winehq.org/wine-builds/ ## TODOs - [x] Add "--include": Sync even if the file is excluded by `--exclude` regex. - [x] Add supported Debian, Ubuntu, Fedora and RHEL versions support to `--include` regex. - Something like `--include debian/${DEBIAN_VERSIONS}`? - [x] Check for APT/YUM repo integrity (avoid keeping old invalid metadata files) - (This is experimental and may not work well) ## Usage ```console > ./tsumugu --help A HTTP(S) syncing tool with lower overhead, for OSS mirrors Usage: tsumugu Commands: sync Sync files from upstream to local list List files from upstream help Print this message or the help of the given subcommand(s) Options: -h, --help Print help -V, --version Print version > ./tsumugu sync --help Sync files from upstream to local Usage: tsumugu sync [OPTIONS] Arguments: The upstream URL The local directory Options: --user-agent Customize tsumugu's user agent [default: tsumugu] --dry-run Do not download files and cleanup --threads Threads at work [default: 2] --no-delete Do not clean up after sync --max-delete Set max delete count [default: 100] --timezone-file You can set a valid URL for guessing. Set it to "no" to disable this behavior. By default it would recursively find the first file to HEAD for guessing --timezone Manually set timezone (+- hrs). This overrides timezone_file --retry Retry count for each request [default: 3] --head-before-get Do an HEAD before actual GET. Otherwise when head-before-get and allow-time-from-parser are not set, when GETting tsumugu would try checking if we still need to download it --parser Choose a main parser [default: nginx] [possible values: nginx, apache-f2, docker, directory-lister, lighttpd, caddy, fancy-index, gradle, fallback] --parser-match Choose supplementary parsers. Format: "parsername:matchpattern". matchpattern is a relative path regex. Supports multiple --exclude Excluded relative path regex. Supports multiple --include Included relative path regex (even if excluded). Supports multiple --skip-if-exists Skip relative path regex if they exist. Supports multiple --compare-size-only Relative path regex for those compare size only **after** HEAD (head_before_get on) or GET (head_before_get off) --allow-mtime-from-parser Allow mtime from parser if not available from HTTP headers --apt-packages (Experimental) APT Packages file parser to find out missing packages --yum-packages (Experimental) YUM Packages file parser to find out missing packages --ignore-nonexist Ignore 404 NOT FOUND as error when downloading files --auto-fallback Allow automatically choose fallback parser when ParseError occurred --header
Custom header for HTTP(S) requests in format "Headerkey: headervalue". Supports multiple -h, --help Print help -V, --version Print version > ./tsumugu list --help List files from upstream Usage: tsumugu list [OPTIONS] Arguments: The upstream URL Options: --user-agent Customize tsumugu's user agent [default: tsumugu] --parser Choose a main parser [default: nginx] [possible values: nginx, apache-f2, docker, directory-lister, lighttpd, caddy, fancy-index, gradle, fallback] --exclude Excluded relative path regex. Supports multiple --include Included relative path regex (even if excluded). Supports multiple --upstream-base The upstream base starting with "/" [default: /] --header
Custom header for HTTP(S) requests in format "Headerkey: headervalue". Supports multiple -h, --help Print help -V, --version Print version ``` For a very brief introduction of parser, see [./docs/parser.md](./docs/parser.md). ## Exit code - 0: Success - 1: Failed to list - 2: Failed to download - 3: A panic!() occurred - 4: Error when cleaning up - 25: The limit stopped deletions ## Building with musl Unfortunately, this requires openssl-sys, which is not included in cross's prebuilt images. Try https://github.com/clux/muslrust. ## Evaluation Default concurrency is 2 threads. (Note: Please see [examples](./examples/) for latest commands to sync.) ### http://download.proxmox.com/ Proxmox uses a self-hosted CDN server architecture, and unfortunately its server limits concurrency to only 1 (as far as I could test). With traditional lftp/rclone it could take > 10 hours to sync once (even when your local files are identical with remote ones). Note: Consider using [Proxmox Offline Mirror](https://pom.proxmox.com/) or other tools like `apt-mirror` if you only need its APT repository. ```console > time ./tsumugu sync --threads 1 --dry-run --exclude '^temp' http://download.proxmox.com/ /srv/repo/proxmox/ ... real 1m48.746s user 0m3.468s sys 0m3.385s ``` ### https://download.docker.com/ We use [a special script](https://github.com/ustclug/ustcmirror-images/blob/master/docker-ce/tunasync/sync.py) for syncing docker-ce before, but tsumugu can also handle this now. And also, for 30x inside linux/centos/ and linux/rhel/, tsumugu could create symlinks as what this script do before. ```console > time ./tsumugu sync --timezone-file https://download.docker.com/linux/centos/docker-ce-staging.repo --parser docker --dry-run https://download.docker.com/ /srv/repo/docker-ce/ ... real 8m32.674s user 0m4.532s sys 0m2.855s ``` ### https://dl.winehq.org/wine-builds/ lftp/rclone fails to handle complex HTML. ```console > time ./tsumugu sync --parser apache-f2 --dry-run --exclude '^mageia' --exclude '^macosx' --exclude '^debian' --exclude '^ubuntu' --exclude '^fedora' --include '^debian/dists/${DEBIAN_CURRENT}' --include '^ubuntu/dists/${UBUNTU_LTS}' --include '^fedora/${FEDORA_CURRENT}' https://dl.winehq.org/wine-builds/ /srv/repo/wine/wine-builds/ ... INFO ThreadId(01) tsumugu: (Estimated) Total objects: 17514, total size: 342.28 GiB real 0m5.664s user 0m1.475s sys 0m0.294s ``` ## Notes ### Yuki integration See . YAML example: ```yaml envs: UPSTREAM: http://download.proxmox.com/ TSUMUGU_EXCLUDE: --exclude ^temp --exclude pmg/dists/.+changelog$ --exclude devel/dists/.+changelog$ TSUMUGU_TIMEZONEFILE: http://download.proxmox.com/images/aplinfo.dat TSUMUGU_THREADS: 1 image: ustcmirror/tsumugu:latest interval: 12 3 * * * logRotCycle: 10 name: proxmox storageDir: /srv/repo/proxmox/ ``` More examples in [examples/](./examples/). ### Regex variables See [./src/regex_process.rs](./src/regex_process.rs). ### Exclusion and inclusion **There's a breaking change since 20240902. User regexes with `^` and `$` would be affected.** See [./docs/exclusion.md](./docs/exclusion.md). ### Deduplication Tsumugu relies on local file size and mtime to check if file shall be downloaded. Some file-level deduplicators like [jdupes](https://codeberg.org/jbruchon/jdupes) would ignore file mtime when deduplicating with hard links. This could be an issue for some repos, as some files would be redownloaded again and again every time as it does not have a correct mtime locally. Workarounds: - Set `--compare-size-only`. - Use filesystem-level/block-level deduplication like `zfs dedup`. - Use another file-level deduplicator which considers mtime (though I don't know which would do this). Also, if you are sure that some directory is identical with another, you could manually create a symlink for that. Tsumugu would ignore symlinks during syncing. ## Acknowledgements Special thanks to [NJU Mirror](https://mirrors.nju.edu.cn/) for extensive testing and bug reporting. ## Naming The name "tsumugu", and current branch name "pudding", are derived from the manga *A Drift Girl and a Noble Moon*.
And... tsumugu, drawn as simplified version of hitori Tsumugu in the appearance of a very simplified version of Hitori (Obviously I am not very good at drawing though).
Old (2020), unfinished golang version is named as "traverse", under the `main-old` branch.