| Crates.io | docrawl |
| lib.rs | docrawl |
| version | 0.1.3 |
| created_at | 2025-09-18 04:47:06.553083+00 |
| updated_at | 2025-10-16 00:59:39.260235+00 |
| description | Docs-focused crawler library and CLI: crawl documentation sites, extract main content, convert to Markdown, mirror paths, and save with frontmatter. |
| homepage | https://github.com/neur0map/docrawl |
| repository | https://github.com/neur0map/docrawl |
| max_upload_size | |
| id | 1844244 |
| size | 834,022 |
______ _____ _______ ______ _______ _ _ _
| \ | | | |_____/ |_____| | | | |
|_____/ |_____| |_____ | \_ | | |__|__| |_____
A documentation-focused web crawler that converts sites to clean Markdown while preserving structure and staying polite.
# Install from crates.io
cargo install docrawl
# Or build from source
git clone https://github.com/neur0map/docrawl
cd docrawl
cargo build --release
docrawl "https://docs.rust-lang.org" # crawl with default depth
docrawl "https://docs.python.org" --all # full site crawl
docrawl "https://react.dev" --depth 2 # shallow crawl
docrawl "https://nextjs.org/docs" --fast # quick scan without assets
docrawl "https://example.com/docs" --silence # suppress progress/status output
docrawl --update # update to latest version
index.md filesdocrawl --updateUnlike general-purpose crawlers, docrawl is purpose-built for documentation:
| Tool | Purpose | Output | Documentation Support |
|---|---|---|---|
| wget/curl | File downloading | Raw HTML | No extraction |
| httrack | Website mirroring | Full HTML site | No Markdown conversion |
| scrapy | Web scraping framework | Custom formats | Requires coding |
| docrawl | Documentation crawler | Clean Markdown | Auto-detects docs frameworks |
docrawl combines crawling, extraction, and conversion in a single tool optimized for technical documentation.
Add to your Cargo.toml:
[dependencies]
docrawl = "0.1"
tokio = { version = "1", features = ["full"] }
Minimal example:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let cfg = docrawl::CrawlConfig {
base_url: url::Url::parse("https://example.com/docs")?,
output_dir: std::path::PathBuf::from("./out"),
max_depth: Some(3),
..Default::default()
};
let stats = docrawl::crawl(cfg).await?;
println!("Crawled {} pages", stats.pages);
Ok(())
}
| Option | Description | Default |
|---|---|---|
--depth <n> |
Maximum crawl depth | 10 |
--all |
Crawl entire site | - |
--output <dir> |
Output directory | Current dir |
--rate <n> |
Requests per second | 2 |
--concurrency <n> |
Parallel workers | 8 |
--selector <css> |
Custom content selector | Auto-detect |
--fast |
Quick mode (no assets, rate=16) | - |
--resume |
Continue previous crawl | - |
--silence |
Suppress built-in progress/status output | - |
--update |
Update to latest version from crates.io | - |
Optional docrawl.config.json:
{
"selectors": [".content", "article"],
"exclude_patterns": ["\\.pdf$", "/api/"],
"max_pages": 1000,
"host_only": true
}
output/
└── example.com/
├── index.md
├── guide/
│ └── index.md
├── assets/
│ └── images/
└── manifest.json
Each Markdown file includes frontmatter:
---
title: Page Title
source_url: https://example.com/page
fetched_at: 2025-01-18T12:00:00Z
---
docrawl is optimized for speed and efficiency:
fast_html2mddocrawl includes built-in security features:
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
MIT