crawn

Crates.iocrawn
lib.rscrawn
version0.1.0
created_at2026-01-25 20:41:53.123228+00
updated_at2026-01-25 20:41:53.123228+00
descriptionA utility for web crawling and scraping
homepage
repositoryhttps://github.com/Tahaa-Dev/crawn
max_upload_size
id2069498
size89,011
Taha Mahmoud (Tahaa-Dev)

documentation

README

crawn

Fast async web crawler with smart keyword filtering

CI License: MIT


Features

  • Blazing fast – Built with Rust + tokio for async I/O
  • Smart filtering – URL-based keyword matching (no content fetching required)
  • NDJSON output – One JSON object per line for easy streaming
  • BFS crawling – Depth-first traversal with configurable depth limits
  • Rate limiting – Configurable request rate (default: 10 req/sec)
  • Error recovery – Gracefully handles network errors and broken links
  • Rich logging – Colored, timestamped logs with context chains

Installation

Run this command (requires cargo):

cargo install crawn
  • Or build from source (requires cargo):
git clone https://github.com/Tahaa-Dev/crawn.git
cd crawn
cargo build --release

Usage

  • Basic Crawling:
crawn -o output.ndjson https://example.com
  • With Logging:
crawn -o output.ndjson -l crawler.log https://example.com
  • Verbose Mode (Log All Requests):
crawn -o output.ndjson -v https://example.com
  • Custom Depth Limit:
crawn -o output.ndjson -m 3 https://example.com
  • Full HTML:
crawn -o output.ndjson --include-content https://example.com
  • Extracted text only:
crawn -o output.ndjson --include-text https://example.com

Output Format

Results are written as NDJSON (newline-delimited JSON):

{"url":"https://example.com","title":"Example Domain","depth":0}
{"url":"https://example.com/about","title":"About Us","depth":1}
{"url":"https://example.com/contact","title":"Contact","depth":1}
  • With --include-text:
{"url":"https://example.com","title":"Example Domain","depth":0,"text":"Example Domain\nThis domain is..."}
  • With --include-content:
{"url":"https://example.com","title":"Example Domain","depth":0,"content":"<!DOCTYPE html>\n<html>..."}

How It Works

  1. BFS Crawling:
  • Starts at the seed URL (depth 0)
  • Discovers links on each page
  • Processes links level-by-level (breadth-first)
  • Stops at max_depth (default: 4)
  1. Keyword Filtering:
  • Extracts "keywords" from URL paths (sanitized, lowercased)
  • Splits by /, -, _ (e.g., /rust-tutorials/async → ["rust", "tutorials", "async"])
  • Filters stop words, numbers, short words (<3 chars)
  • Matches candidate URLs against base keywords
  • Result: Only crawls relevant pages, skips off-topic content
  1. Rate Limiting:
  • Random range between 200 - 500ms
  • Prevents server overload and IP bans
  • Configurable via code (not exposed as CLI flag yet)
  1. Error Handling:
  • Network errors: Logged as warnings, crawling continues
  • HTTP 404/500: Skipped, logged as warnings
  • Parse failures: Logged, returns empty JSON
  • Fatal errors: Printed to stdout with full context chain

Logging

Log Levels:

  • INFO (verbose mode only): Request logs
  • WARN (always): Recoverable errors (404, network timeouts)
  • FATAL (always): Unrecoverable errors (invalid URL, disk full)

Log Format:

2026-01-24 02:37:40.351 [INFO]:
Sent request to URL: https://example.com

2026-01-24 02:37:41.123 [WARN]:
Failed to fetch URL: https://example.com/broken-link
Caused by: HTTP 404 Not Found

Examples

  • Crawl Documentation Site:
crawn -o rust-docs.ndjson https://doc.rust-lang.org/book/
  • Crawl with Logging:
crawn -o output.ndjson -l crawler.log -v https://example.com
  • Limit to 2 Levels Deep:
crawn -o shallow.ndjson -m 2 https://example.com

Limitations

  • Same-domain only (no external links, by design)
  • No JavaScript rendering (static HTML only)
  • No authentication (public pages only)

Notes

Commit count: 19

cargo fmt