snagger

Crates.iosnagger
lib.rssnagger
version0.1.4
created_at2025-11-05 01:35:48.368311+00
updated_at2025-12-07 03:17:40.028066+00
descriptionGrab full text across ?page=N pagination with page count discovery
homepage
repositoryhttps://github.com/fibnas/snagger
max_upload_size
id1917278
size106,625
frankstallionjr (fibnas)

documentation

README

snagger

Grab the full text of paginated articles where the next page is controlled by a ?page=N query parameter. Snagger discovers the page count, scrapes each page concurrently, and assembles the cleaned content into a single file per source URL.

Crates.io License Rust

Highlights

  • Async Rust CLI with polite throttling, concurrency control, and optional proxy support
  • Automatic pagination discovery via rel="last", anchor inspection, or custom selector/regex pairs
  • Flexible content extraction using CSS selectors with heuristics for common article layouts
  • Produces wrapped plaintext output alongside a crawl log CSV for auditability

Getting Started

Prerequisites

  • Rust 1.75+ with cargo on your PATH
  • OpenSSL or platform-native TLS libraries required by reqwest

Installation

Install the crate:

cargo install snagger

Update existing install:

cargo install snagger --force

Or build locally:

git clone https://github.com/fibnas/snagger
cd snagger
cargo run --release

Input File Format

  • Plain text file
  • One URL per line
  • Lines beginning with # and blank lines are ignored

Example (links.txt):

https://example.com/articles/alpha
# https://example.com/articles/beta  (commented out)
https://example.com/articles/gamma?page=1

Usage

snagger links.txt --out snags --concurrency 8 --discover-pages

This command:

  • Reads seeds from links.txt
  • Writes merged plaintext files to the snags/ directory (creating it when absent)
  • Allows up to eight concurrent crawls
  • Enables pagination discovery to stop once the last page is detected

Direct Downloads

  • Use --download-ext <EXT> to download URLs whose path ends in that extension instead of scraping
  • Repeat the flag or provide comma-separated values to allow multiple extensions (e.g. --download-ext zip --download-ext pdf)
  • Files are saved using the slug with the requested extension, such as example.com__archive.zip

Output Layout

  • snags/<slug>.txt – wrapped article text built from the fetched pages
  • snags/_crawl_log.csv – crawl metadata containing:
    • url, slug, pages_fetched, last_page_url, chars, seconds, status

Throttling and Reliability

  • Delay between page fetches is randomized within the configured range (--delay 0.4 1.2 by default)
  • HTTP timeout defaults to 20s; adjust via --timeout
  • Set --stop-on-repeat to end pagination when a repeated MD5 hash indicates duplicate content

Content Extraction

  • Provide a CSS selector with --selector ".main-article" to target a specific container
  • Without a selector, Snagger attempts common article heuristics before falling back to full-page text
  • Minimum characters per page default to 200 (--min-chars), stopping early if pages are mostly boilerplate

Page Count Discovery

  • Enable with --discover-pages
  • Optionally scope page metadata via --pages-selector "nav.pagination"
  • Supply a custom regex with a capture group (e.g. --pages-regex "(?i)page\\s*\\d+\\s*of\\s*(\\d+)")
  • Falls back to the --max-pages hard cap (default 20) when discovery fails

Networking

  • Concurrency is capped with --concurrency
  • Configure an HTTPS proxy via --proxy http://host:port
  • Custom headers include a desktop browser User-Agent to reduce blocks

Command Reference

Flag Default Description
--out <DIR> snags Output directory for scraped files and crawl logs
--concurrency <N> 6 Maximum concurrent subjects to crawl
--selector <CSS> none CSS selector for the main content container
--min-chars <N> 200 Minimum characters per page before keeping it
--stop-on-repeat false Stop when consecutive pages hash to the same value
--wrap-width <N> 96 Column width for output wrapping (0 disables wrapping)
--timeout <SECS> 20 HTTP timeout per request
--delay <LO HI> 0.4 1.2 Randomized per-page delay range in seconds
--max-pages <N> 20 Hard cap on pagination depth
--discover-pages false Attempt to detect the total page count
--pages-selector <CSS> none Scope for page-count detection
--pages-regex <REGEX> none Explicit regex for total page capture group
--proxy <URL> none HTTPS proxy endpoint
--download-ext <EXT> none Direct-download URLs ending in .<EXT> (repeat/comma-separated)

Development

cargo check
cargo fmt
cargo clippy --all-targets -- -D warnings

Tests are not currently configured; feel free to contribute fixtures that capture additional pagination patterns.

Troubleshooting

  • Ensure the target site allows crawling; respect robots and rate limits
  • Increase --timeout or widen --delay if encountering rate limiting or slow responses
  • Use --selector to avoid pulling navigation or comments into the combined text
  • Inspect _crawl_log.csv for subjects marked empty or early termination causes

License

Licensed under the MIT License. See LICENSE for details.

Commit count: 0

cargo fmt