snagger

Crates.io	snagger
lib.rs	snagger
version	0.1.4
created_at	2025-11-05 01:35:48.368311+00
updated_at	2025-12-07 03:17:40.028066+00
description	Grab full text across ?page=N pagination with page count discovery
homepage
repository	https://github.com/fibnas/snagger
max_upload_size
id	1917278
size	106,625

frankstallionjr (fibnas)

documentation

README

snagger

Grab the full text of paginated articles where the next page is controlled by a ?page=N query parameter. Snagger discovers the page count, scrapes each page concurrently, and assembles the cleaned content into a single file per source URL.

Highlights

Async Rust CLI with polite throttling, concurrency control, and optional proxy support
Automatic pagination discovery via rel="last", anchor inspection, or custom selector/regex pairs
Flexible content extraction using CSS selectors with heuristics for common article layouts
Produces wrapped plaintext output alongside a crawl log CSV for auditability

Getting Started

Prerequisites

Rust 1.75+ with cargo on your PATH
OpenSSL or platform-native TLS libraries required by reqwest

Installation

Install the crate:

cargo install snagger

Update existing install:

cargo install snagger --force

Or build locally:

git clone https://github.com/fibnas/snagger
cd snagger
cargo run --release

Input File Format

Plain text file
One URL per line
Lines beginning with # and blank lines are ignored

Example (links.txt):

https://example.com/articles/alpha
# https://example.com/articles/beta  (commented out)
https://example.com/articles/gamma?page=1

Usage

snagger links.txt --out snags --concurrency 8 --discover-pages

This command:

Reads seeds from links.txt
Writes merged plaintext files to the snags/ directory (creating it when absent)
Allows up to eight concurrent crawls
Enables pagination discovery to stop once the last page is detected

Direct Downloads

Use --download-ext <EXT> to download URLs whose path ends in that extension instead of scraping
Repeat the flag or provide comma-separated values to allow multiple extensions (e.g. --download-ext zip --download-ext pdf)
Files are saved using the slug with the requested extension, such as example.com__archive.zip

Output Layout

snags/<slug>.txt – wrapped article text built from the fetched pages
snags/_crawl_log.csv – crawl metadata containing:
- url, slug, pages_fetched, last_page_url, chars, seconds, status

Throttling and Reliability

Delay between page fetches is randomized within the configured range (--delay 0.4 1.2 by default)
HTTP timeout defaults to 20s; adjust via --timeout
Set --stop-on-repeat to end pagination when a repeated MD5 hash indicates duplicate content

Content Extraction

Provide a CSS selector with --selector ".main-article" to target a specific container
Without a selector, Snagger attempts common article heuristics before falling back to full-page text
Minimum characters per page default to 200 (--min-chars), stopping early if pages are mostly boilerplate

Page Count Discovery

Enable with --discover-pages
Optionally scope page metadata via --pages-selector "nav.pagination"
Supply a custom regex with a capture group (e.g. --pages-regex "(?i)page\\s*\\d+\\s*of\\s*(\\d+)")
Falls back to the --max-pages hard cap (default 20) when discovery fails

Networking

Concurrency is capped with --concurrency
Configure an HTTPS proxy via --proxy http://host:port
Custom headers include a desktop browser User-Agent to reduce blocks

Command Reference

Flag	Default	Description
`--out <DIR>`	`snags`	Output directory for scraped files and crawl logs
`--concurrency <N>`	`6`	Maximum concurrent subjects to crawl
`--selector <CSS>`	none	CSS selector for the main content container
`--min-chars <N>`	`200`	Minimum characters per page before keeping it
`--stop-on-repeat`	`false`	Stop when consecutive pages hash to the same value
`--wrap-width <N>`	`96`	Column width for output wrapping (`0` disables wrapping)
`--timeout <SECS>`	`20`	HTTP timeout per request
`--delay <LO HI>`	`0.4 1.2`	Randomized per-page delay range in seconds
`--max-pages <N>`	`20`	Hard cap on pagination depth
`--discover-pages`	`false`	Attempt to detect the total page count
`--pages-selector <CSS>`	none	Scope for page-count detection
`--pages-regex <REGEX>`	none	Explicit regex for total page capture group
`--proxy <URL>`	none	HTTPS proxy endpoint
`--download-ext <EXT>`	none	Direct-download URLs ending in `.<EXT>` (repeat/comma-separated)

Development

cargo check
cargo fmt
cargo clippy --all-targets -- -D warnings

Tests are not currently configured; feel free to contribute fixtures that capture additional pagination patterns.

Troubleshooting

Ensure the target site allows crawling; respect robots and rate limits
Increase --timeout or widen --delay if encountering rate limiting or slow responses
Use --selector to avoid pulling navigation or comments into the combined text
Inspect _crawl_log.csv for subjects marked empty or early termination causes

License

Licensed under the MIT License. See LICENSE for details.

Commit count: 0