| Crates.io | snagger |
| lib.rs | snagger |
| version | 0.1.4 |
| created_at | 2025-11-05 01:35:48.368311+00 |
| updated_at | 2025-12-07 03:17:40.028066+00 |
| description | Grab full text across ?page=N pagination with page count discovery |
| homepage | |
| repository | https://github.com/fibnas/snagger |
| max_upload_size | |
| id | 1917278 |
| size | 106,625 |
Grab the full text of paginated articles where the next page is controlled by a ?page=N query parameter. Snagger discovers the page count, scrapes each page concurrently, and assembles the cleaned content into a single file per source URL.
rel="last", anchor inspection, or custom selector/regex pairscargo on your PATHreqwestInstall the crate:
cargo install snagger
Update existing install:
cargo install snagger --force
Or build locally:
git clone https://github.com/fibnas/snagger
cd snagger
cargo run --release
# and blank lines are ignoredExample (links.txt):
https://example.com/articles/alpha
# https://example.com/articles/beta (commented out)
https://example.com/articles/gamma?page=1
snagger links.txt --out snags --concurrency 8 --discover-pages
This command:
links.txtsnags/ directory (creating it when absent)--download-ext <EXT> to download URLs whose path ends in that extension instead of scraping--download-ext zip --download-ext pdf)example.com__archive.zipsnags/<slug>.txt – wrapped article text built from the fetched pagessnags/_crawl_log.csv – crawl metadata containing:
url, slug, pages_fetched, last_page_url, chars, seconds, status--delay 0.4 1.2 by default)--timeout--stop-on-repeat to end pagination when a repeated MD5 hash indicates duplicate content--selector ".main-article" to target a specific container--min-chars), stopping early if pages are mostly boilerplate--discover-pages--pages-selector "nav.pagination"--pages-regex "(?i)page\\s*\\d+\\s*of\\s*(\\d+)")--max-pages hard cap (default 20) when discovery fails--concurrency--proxy http://host:portUser-Agent to reduce blocks| Flag | Default | Description |
|---|---|---|
--out <DIR> |
snags |
Output directory for scraped files and crawl logs |
--concurrency <N> |
6 |
Maximum concurrent subjects to crawl |
--selector <CSS> |
none | CSS selector for the main content container |
--min-chars <N> |
200 |
Minimum characters per page before keeping it |
--stop-on-repeat |
false |
Stop when consecutive pages hash to the same value |
--wrap-width <N> |
96 |
Column width for output wrapping (0 disables wrapping) |
--timeout <SECS> |
20 |
HTTP timeout per request |
--delay <LO HI> |
0.4 1.2 |
Randomized per-page delay range in seconds |
--max-pages <N> |
20 |
Hard cap on pagination depth |
--discover-pages |
false |
Attempt to detect the total page count |
--pages-selector <CSS> |
none | Scope for page-count detection |
--pages-regex <REGEX> |
none | Explicit regex for total page capture group |
--proxy <URL> |
none | HTTPS proxy endpoint |
--download-ext <EXT> |
none | Direct-download URLs ending in .<EXT> (repeat/comma-separated) |
cargo check
cargo fmt
cargo clippy --all-targets -- -D warnings
Tests are not currently configured; feel free to contribute fixtures that capture additional pagination patterns.
--timeout or widen --delay if encountering rate limiting or slow responses--selector to avoid pulling navigation or comments into the combined text_crawl_log.csv for subjects marked empty or early termination causesLicensed under the MIT License. See LICENSE for details.