Crates.io | docrawl |
lib.rs | docrawl |
version | 0.1.0 |
created_at | 2025-09-18 04:47:06.553083+00 |
updated_at | 2025-09-18 04:47:06.553083+00 |
description | Docs-focused crawler library and CLI: crawl documentation sites, extract main content, convert to Markdown, mirror paths, and save with frontmatter. |
homepage | https://github.com/neur0map/docrawl |
repository | https://github.com/neur0map/docrawl |
max_upload_size | |
id | 1844244 |
size | 826,749 |
Docs‑focused crawler that converts documentation sites to clean Markdown, mirrors the site’s path structure, and adds useful metadata — while staying polite and secure.
title
, source_url
, fetched_at
, optional security_flags
, quarantined
)index.md
per directory/sitemap.xml
for broader coverage on --all
manifest.json
) and persistent visited‑URL cachePrerequisites: Rust (stable toolchain).
cargo build --release
Binary: target/release/docrawl
You can use docrawl
as a library and ship a single binary for your own CLI or service.
Add to Cargo.toml (using the Git URL until it’s published on crates.io):
[dependencies]
docrawl = { git = "https://github.com/neur0map/docrawl" }
url = "2"
tokio = { version = "1", features = ["full"] }
Minimal programmatic crawl:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let cfg = docrawl::CrawlConfig {
base_url: url::Url::parse("https://example.com/docs")?,
output_dir: std::path::PathBuf::from("./out"),
user_agent: format!("mytool/{}", env!("CARGO_PKG_VERSION")),
max_depth: Some(2),
rate_limit_per_sec: 8,
follow_sitemaps: false,
concurrency: 8,
timeout: None,
resume: false,
config: docrawl::Config { host_only: true, skip_assets: true, ..Default::default() },
};
let stats = docrawl::crawl(cfg).await?;
eprintln!("pages={} assets={}", stats.pages, stats.assets);
Ok(())
}
Notes:
docrawl::crawl
uses the default on-disk layout (same as the CLI) and returns simple Stats
.CrawlConfig
and Config
to keep one binary.docrawl "https://example.com/docs" # default depth=10
docrawl "https://example.com" --all # full same‑origin site
docrawl "https://example.com" --depth 2 # start + links + their links
docrawl "https://example.com" -o ./export # choose output root
docrawl "https://example.com/docs" --fast # quick smoke test (no assets, higher rate)
url
(positional): Absolute starting URL. Only same‑origin links are followed by default.--all
: Unlimited depth. Crawls the whole same‑origin site (still honors robots).--depth <n>
: Max link depth from the start page. Default is 10
when --all
isn’t set.-o, --output <path>
: Output root; a host‑named folder is created inside.--concurrency <n>
: Number of parallel fetch workers (bounded). Default: 8
.--rate <n>
: Global requests per second. Default: 2
.--timeout-minutes <n>
: Graceful shutdown after N minutes. The crawler stops scheduling new work, drains active tasks, writes the manifest, and exits.--resume
: Resume from the previous run using the persisted frontier (see Cache). Useful after a timeout or manual stop.--host-only
: Restrict scope to the exact origin (scheme+host+port).--external-assets
: Allow downloading images from other domains.--allow-svg
: Permit saving SVG images.--no-assets
: Skip image downloads (fastest).--max-pages <n>
: Stop after writing this many pages.--selector <CSS>
: Preferred CSS selector for content (repeatable).--exclude <REGEX>
: Exclude URLs matching the regex (repeatable).--fast
: Preset for quick crawls: raises --rate
/--concurrency
to at least 16 and implies --no-assets
.Depth is link‑hop depth (not URL path depth). Hub‑style homepages can expose many pages even at small depths.
Place docrawl.config.json
in the working directory or in the output root. All fields are optional.
{
"host_only": true,
"external_assets": false,
"allow_svg": false,
"max_pages": 500,
"selectors": [".theme-doc-markdown", ".article-content"],
"exclude_patterns": ["\\.pdf$", "/private/"]
}
host_only
: Limit scope to exact origin (scheme+host+port). Default: same‑domain allowed.external_assets
: Allow images from outside the site’s origin/domain. Default: false.allow_svg
: Permit saving SVG images. Default: false.max_pages
: Stop after writing this many pages.selectors
: Preferred CSS selectors for the main content; tried before built‑ins.exclude_patterns
: Regex patterns; if a canonical URL matches any, it’s skipped../out/example.com/
).--all
overrides --depth
.host_only
is true.robotstxt
matcher (agent‑specific allow/deny).--all
is set, seeds from /sitemap.xml
to front‑load coverage (non‑recursive for nested indexes).<main>
, <article>
, <body>
; you can supply selectors
to override.script
, iframe
, object
, form
, link
, meta
, etc.), normalizes and neutralizes risky URLs.security_flags
.quarantined: true
.png
, jpeg/jpg
, gif
, webp
, bmp
) are downloaded by default. SVG and data:
images are disabled unless allowed via config.Example frontmatter with flags:
---
title: Example
source_url: https://example.com/page
fetched_at: 2025-01-01T00:00:00Z
quarantined: true
security_flags:
- llm_ignore_previous
- javascript_link
---
<output_root>/example.com/
├── index.md
├── guide/
│ └── index.md
├── assets/ # images saved under site paths
└── manifest.json
manifest.json
records id, url, relative path, title, quarantined, security_flags
, and timestamps for each saved page.<output_root>/.docrawl_cache
(sled key–value store).--resume
to continue from where you left off (e.g., after --timeout-minutes
or Ctrl‑C). If you don’t use --resume
, the persisted frontier is cleared at the start and new seeds are used.HTTP_PROXY
/HTTPS_PROXY
if present. If neither is set, system proxy detection is disabled to avoid rare macOS System Configuration panics.export HTTPS_PROXY=http://user:pass@host:port
export HTTP_PROXY=http://user:pass@host:port
cargo build
or cargo build --release
docrawl "https://example.com/docs" --depth 2
selectors
if needed for your site.security_flags
.