docrawl

Crates.iodocrawl
lib.rsdocrawl
version0.1.3
created_at2025-09-18 04:47:06.553083+00
updated_at2025-10-16 00:59:39.260235+00
descriptionDocs-focused crawler library and CLI: crawl documentation sites, extract main content, convert to Markdown, mirror paths, and save with frontmatter.
homepagehttps://github.com/neur0map/docrawl
repositoryhttps://github.com/neur0map/docrawl
max_upload_size
id1844244
size834,022
Carlos (neur0map)

documentation

https://docs.rs/docrawl

README

docrawl

______   _____  _______  ______ _______ _  _  _
|     \ |     | |       |_____/ |_____| |  |  | |
|_____/ |_____| |_____  |    \_ |     | |__|__| |_____

A documentation-focused web crawler that converts sites to clean Markdown while preserving structure and staying polite.

Demo VideoCrates.ioGitHub

Crates.io Documentation License: MIT

Installation

# Install from crates.io
cargo install docrawl

# Or build from source
git clone https://github.com/neur0map/docrawl
cd docrawl
cargo build --release

Quick Start

docrawl "https://docs.rust-lang.org"          # crawl with default depth
docrawl "https://docs.python.org" --all       # full site crawl
docrawl "https://react.dev" --depth 2         # shallow crawl
docrawl "https://nextjs.org/docs" --fast      # quick scan without assets
docrawl "https://example.com/docs" --silence  # suppress progress/status output
docrawl --update                              # update to latest version

Key Features

  • Documentation-optimized extraction - Built-in selectors for Docusaurus, MkDocs, Sphinx, Next.js docs
  • Clean Markdown output - Preserves code blocks, tables, and formatting with YAML frontmatter metadata
  • Path-mirroring structure - Maintains original URL hierarchy as folders with index.md files
  • Polite crawling - Respects robots.txt, rate limits, and sitemap hints
  • Security-first - Sanitizes content, detects prompt injections, quarantines suspicious pages
  • Self-updating - Built-in update mechanism via docrawl --update

Why docrawl?

Unlike general-purpose crawlers, docrawl is purpose-built for documentation:

Tool Purpose Output Documentation Support
wget/curl File downloading Raw HTML No extraction
httrack Website mirroring Full HTML site No Markdown conversion
scrapy Web scraping framework Custom formats Requires coding
docrawl Documentation crawler Clean Markdown Auto-detects docs frameworks

docrawl combines crawling, extraction, and conversion in a single tool optimized for technical documentation.

Library Usage

Add to your Cargo.toml:

[dependencies]
docrawl = "0.1"
tokio = { version = "1", features = ["full"] }

Minimal example:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cfg = docrawl::CrawlConfig {
        base_url: url::Url::parse("https://example.com/docs")?,
        output_dir: std::path::PathBuf::from("./out"),
        max_depth: Some(3),
        ..Default::default()
    };
    let stats = docrawl::crawl(cfg).await?;
    println!("Crawled {} pages", stats.pages);
    Ok(())
}

CLI Options

Option Description Default
--depth <n> Maximum crawl depth 10
--all Crawl entire site -
--output <dir> Output directory Current dir
--rate <n> Requests per second 2
--concurrency <n> Parallel workers 8
--selector <css> Custom content selector Auto-detect
--fast Quick mode (no assets, rate=16) -
--resume Continue previous crawl -
--silence Suppress built-in progress/status output -
--update Update to latest version from crates.io -

Configuration

Optional docrawl.config.json:

{
  "selectors": [".content", "article"],
  "exclude_patterns": ["\\.pdf$", "/api/"],
  "max_pages": 1000,
  "host_only": true
}

Output Structure

output/
└── example.com/
    ├── index.md
    ├── guide/
    │   └── index.md
    ├── assets/
    │   └── images/
    └── manifest.json

Each Markdown file includes frontmatter:

---
title: Page Title
source_url: https://example.com/page
fetched_at: 2025-01-18T12:00:00Z
---

Performance

docrawl is optimized for speed and efficiency:

  • Fast HTML to Markdown conversion using fast_html2md
  • Concurrent processing with configurable worker pools
  • Intelligent rate limiting to respect server resources
  • Persistent caching to avoid duplicate work
  • Memory-efficient streaming for large sites

Security

docrawl includes built-in security features:

  • Content sanitization removes potentially harmful HTML
  • Prompt injection detection identifies and quarantines suspicious content
  • URL validation prevents malicious redirects
  • File system sandboxing restricts output to specified directories
  • Rate limiting prevents overwhelming target servers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Commit count: 15

cargo fmt