crawn

Crates.io	crawn
lib.rs	crawn
version	0.1.0
created_at	2026-01-25 20:41:53.123228+00
updated_at	2026-01-25 20:41:53.123228+00
description	A utility for web crawling and scraping
homepage
repository	https://github.com/Tahaa-Dev/crawn
max_upload_size
id	2069498
size	89,011

Taha Mahmoud (Tahaa-Dev)

documentation

README

crawn

Fast async web crawler with smart keyword filtering

Features

Blazing fast – Built with Rust + tokio for async I/O
Smart filtering – URL-based keyword matching (no content fetching required)
NDJSON output – One JSON object per line for easy streaming
BFS crawling – Depth-first traversal with configurable depth limits
Rate limiting – Configurable request rate (default: 10 req/sec)
Error recovery – Gracefully handles network errors and broken links
Rich logging – Colored, timestamped logs with context chains

Installation

Run this command (requires cargo):

cargo install crawn

Or build from source (requires cargo):

git clone https://github.com/Tahaa-Dev/crawn.git
cd crawn
cargo build --release

Usage

Basic Crawling:

crawn -o output.ndjson https://example.com

With Logging:

crawn -o output.ndjson -l crawler.log https://example.com

Verbose Mode (Log All Requests):

crawn -o output.ndjson -v https://example.com

Custom Depth Limit:

crawn -o output.ndjson -m 3 https://example.com

Full HTML:

crawn -o output.ndjson --include-content https://example.com

Extracted text only:

crawn -o output.ndjson --include-text https://example.com

Output Format

Results are written as NDJSON (newline-delimited JSON):

{"url":"https://example.com","title":"Example Domain","depth":0}
{"url":"https://example.com/about","title":"About Us","depth":1}
{"url":"https://example.com/contact","title":"Contact","depth":1}

With --include-text:

{"url":"https://example.com","title":"Example Domain","depth":0,"text":"Example Domain\nThis domain is..."}

With --include-content:

{"url":"https://example.com","title":"Example Domain","depth":0,"content":"<!DOCTYPE html>\n<html>..."}

How It Works

BFS Crawling:

Starts at the seed URL (depth 0)
Discovers links on each page
Processes links level-by-level (breadth-first)
Stops at max_depth (default: 4)

Keyword Filtering:

Extracts "keywords" from URL paths (sanitized, lowercased)
Splits by /, -, _ (e.g., /rust-tutorials/async → ["rust", "tutorials", "async"])
Filters stop words, numbers, short words (<3 chars)
Matches candidate URLs against base keywords
Result: Only crawls relevant pages, skips off-topic content

Rate Limiting:

Random range between 200 - 500ms
Prevents server overload and IP bans
Configurable via code (not exposed as CLI flag yet)

Error Handling:

Network errors: Logged as warnings, crawling continues
HTTP 404/500: Skipped, logged as warnings
Parse failures: Logged, returns empty JSON
Fatal errors: Printed to stdout with full context chain

Logging

Log Levels:

INFO (verbose mode only): Request logs
WARN (always): Recoverable errors (404, network timeouts)
FATAL (always): Unrecoverable errors (invalid URL, disk full)

Log Format:

2026-01-24 02:37:40.351 [INFO]:
Sent request to URL: https://example.com

2026-01-24 02:37:41.123 [WARN]:
Failed to fetch URL: https://example.com/broken-link
Caused by: HTTP 404 Not Found

Examples

Crawl Documentation Site:

crawn -o rust-docs.ndjson https://doc.rust-lang.org/book/

Crawl with Logging:

crawn -o output.ndjson -l crawler.log -v https://example.com

Limit to 2 Levels Deep:

crawn -o shallow.ndjson -m 2 https://example.com

Limitations

Same-domain only (no external links, by design)
No JavaScript rendering (static HTML only)
No authentication (public pages only)

Notes

crawn is licensed under the MIT license.
For specifics about contributing to fiux, see CONTRIBUTING.md.

Commit count: 19

crawn

documentation

README

crawn

Features

Installation

Usage

Output Format

How It Works

Logging

Log Levels:

Log Format:

Examples

Limitations

Notes

cargo fmt