| Crates.io | crawlurls |
| lib.rs | crawlurls |
| version | 0.1.1 |
| created_at | 2025-11-05 01:50:02.323603+00 |
| updated_at | 2025-11-05 17:49:54.895202+00 |
| description | A fast async Rust crawler that discovers and filters URLs by pattern without scraping content. |
| homepage | |
| repository | https://github.com/fibnas/crawlurls |
| max_upload_size | |
| id | 1917289 |
| size | 86,942 |
crawlurls is a small, production-ready Rust CLI that walks HTML pages starting from one or more URLs and reports any links that match an optional pattern. It stays fast and polite by using asynchronous networking, bounded concurrency, and sensible defaults that keep the crawl inside your domain unless you opt out.
reqwest + tokio with configurable concurrencyrayon so large pages are processed in parallelYou need Rust 1.74 or newer.
cargo install --path .
crawlurls [OPTIONS] <URL>...
Key flags:
--pattern <REGEX> – only emit URLs that match the regex--max-depth <N> – stop enqueuing new pages deeper than N--concurrency <N> – number of pages resolved in parallel (default 16)--allow-external – follow and report links that leave the starting host(s)--include-subdomains – treat subdomains of the starting host(s) as in scope--connect-timeout <SECS> / --request-timeout <SECS> – tame slow responses--max-redirects <N> – cap redirect chains--output <FILE> – write matched URLs to a file as well as stdout--exclude <REGEX> – skip URLs matching the regex (repeatable)--exclude-file <FILE> – load newline-delimited exclude patterns from a file--user-agent <STRING> – set a custom user agentPrint every link inside a domain:
crawlurls https://example.com
Find all product URLs that look like /products/ but stay on the origin host:
crawlurls https://example.com --pattern '/products/'
Discover sitemap entries across subdomains, with deeper crawl depth and a custom agent string:
crawlurls https://example.com \
https://blog.example.com \
--include-subdomains \
--max-depth 3 \
--user-agent "crawlurls-batch/1.0"
Write the matching URLs to an artifacts file for post-processing:
crawlurls https://example.com --pattern 'example\\.com/s' --output urls.txt
Avoid admin dashboards while still printing marketing pages:
crawlurls https://example.com --pattern '/promo/' --exclude '/admin'
http and https schemes are crawled.