scrape-cli

Crates.ioscrape-cli
lib.rsscrape-cli
version0.2.0
created_at2026-01-16 16:22:07.933538+00
updated_at2026-01-20 01:47:39.842896+00
descriptionCommand-line HTML extraction tool powered by scrape-rs
homepage
repositoryhttps://github.com/bug-ops/scrape-rs
max_upload_size
id2048870
size79,426
Andrei G (bug-ops)

documentation

README

scrape-cli

Crates.io MSRV License

10-50x faster HTML extraction from command line. Rust-powered, shell-friendly.

Installation

cargo install scrape-cli
Pre-built binaries

Download from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/bug-ops/scrape-rs/releases/latest/download/scrape-darwin-aarch64.tar.gz | tar xz

# macOS (Intel)
curl -L https://github.com/bug-ops/scrape-rs/releases/latest/download/scrape-darwin-x86_64.tar.gz | tar xz

# Linux (x86_64)
curl -L https://github.com/bug-ops/scrape-rs/releases/latest/download/scrape-linux-x86_64.tar.gz | tar xz

# Linux (ARM64)
curl -L https://github.com/bug-ops/scrape-rs/releases/latest/download/scrape-linux-aarch64.tar.gz | tar xz

[!IMPORTANT] Requires Rust 1.88 or later when building from source.

Quick start

# Extract h1 text from file
scrape 'h1' page.html

# Extract from stdin
curl -s example.com | scrape 'title'

# Extract links as JSON
scrape -o json 'a[href]' page.html

Usage

Basic extraction
# Extract text content
scrape 'h1' page.html
# Output: Welcome to Our Site

# Extract attribute value
scrape -a href 'a.nav-link' page.html
# Output: /home
#         /about
#         /contact

# First match only
scrape -1 'p' page.html
# Output: First paragraph text
Output formats
# Plain text (default)
scrape 'h1' page.html
# Output: Hello World

# JSON
scrape -o json 'a[href]' page.html
# Output: ["Link 1","Link 2"]

# Pretty JSON
scrape -o json -p 'a' page.html

# HTML fragments
scrape -o html 'div.content' page.html

# CSV (requires named selectors)
scrape -o csv -s name='td:nth-child(1)' -s price='td:nth-child(2)' table.html
# Output: name,price
#         "Product A","$10.00"
Named selectors
# Extract multiple fields
scrape -o json \
  -s title='h1' \
  -s links='a[href]' \
  -s images='img[src]' \
  page.html
# Output: {"title":["Page Title"],"links":[...],"images":[...]}
Batch processing
# Process multiple files (parallel by default)
scrape 'h1' pages/*.html
# Output: page1.html: Welcome
#         page2.html: About Us
#         page3.html: Contact

# Control parallelism
scrape -j 4 'h1' pages/*.html

[!TIP] Batch processing uses all CPU cores by default. Use -j N to limit threads.

Pipeline integration
# NUL delimiter for xargs
scrape -0 -a href 'a' page.html | xargs -0 -I{} curl {}

# Suppress errors
scrape -q 'h1' *.html 2>/dev/null

# Disable filename prefix
scrape --no-filename 'h1' *.html

Options

Option Short Description
--output FORMAT -o Output format: text, json, html, csv
--select NAME=SEL -s Named selector extraction
--attribute ATTR -a Extract attribute instead of text
--first -1 Return only first match
--pretty -p Pretty-print JSON output
--null -0 Use NUL delimiter (for xargs)
--color MODE -c Colorize: auto, always, never
--parallel N -j Parallel threads for batch
--quiet -q Suppress error messages
--with-filename -H Always show filename prefix
--no-filename Never show filename prefix

Performance

v0.2.0 includes significant improvements:

  • SIMD-accelerated parsing — 2-10x faster class selector matching on large documents
  • Batch parallelization — Scales near-linearly with thread count when processing multiple files
  • Zero-copy serialization — 50-70% memory reduction in output generation

Exit codes

Code Meaning
0 Success, matches found
1 No matches found
2 Runtime error (invalid selector, I/O error)
4 Argument validation error

Built on Servo

Powered by battle-tested libraries from the Servo browser engine: html5ever (HTML5 parser) and selectors (CSS selector engine).

Related packages

Platform Package
Rust scrape-core
Python fast-scrape
Node.js @fast-scrape/node
WASM @fast-scrape/wasm

License

MIT OR Apache-2.0

Commit count: 56

cargo fmt