cc-downloader

Crates.iocc-downloader
lib.rscc-downloader
version0.1.0
sourcesrc
created_at2024-06-15 21:23:06.243621
updated_at2024-06-15 21:23:06.243621
descriptionA polite and user-friendly downloader for Common Crawl data.
homepage
repository
max_upload_size
id1273133
size73,540
Pedro Ortiz Suarez (pjox)

documentation

README

CC-Downloader

This is an experimental polite downloader for Common Crawl data writter in rust. Currently it downloads Common Crawl data from the Cloudfront.

Todo

  • Add retry support
  • Add Python bindings
  • Add tests
  • Refactor CLI subcommands
  • Add support for s3

Usage

Usage: cc-downloader [COMMAND]

Commands:
  download-paths  Download paths for a given snapshot
  download        Download files from a crawl
  help            Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

------

cc-downloader download -h                                                             
Download files from a crawl

Usage: cc-downloader download --path-file <PATHS> --output <OUTPUT> [PROGRESS]

Arguments:
  [PROGRESS]  Print progress #[arg(short, long)] [possible values: true, false]

Options:
      --path-file <PATHS>  Path file
  -o, --output <OUTPUT>    Otput folder
  -h, --help               Print help

------

cc-downloader download-paths -h                                                               
Download paths for a given snapshot

Usage: cc-downloader download-paths --snapshot <SNAPSHOT> --data-type <PATHS> --output <OUTPUT> [PROGRESS]

Arguments:
  [PROGRESS]  Print progress #[arg(short, long)] [possible values: true, false]

Options:
      --snapshot <SNAPSHOT>  Crawl reference
      --data-type <PATHS>    Data type
  -o, --output <OUTPUT>      Otput folder
  -h, --help                 Print help
Commit count: 0

cargo fmt