| Crates.io | mq-crawler |
| lib.rs | mq-crawler |
| version | 0.5.12 |
| created_at | 2025-11-24 14:41:24.813639+00 |
| updated_at | 2026-01-25 14:05:42.574311+00 |
| description | Directory crawler for batch Markdown file processing |
| homepage | https://mqlang.org/ |
| repository | https://github.com/harehare/mq |
| max_upload_size | |
| id | 1947949 |
| size | 145,068 |
A web crawler that fetches HTML content, converts it to Markdown, and processes it with mq queries.
Make web scraping and content extraction effortless with intelligent Markdown conversion:
brew install harehare/tap/mqcr
cargo install mq-crawler
git clone https://github.com/harehare/mq
cd mq
cargo build --release -p mq-crawler
# Crawl a website and output to stdout
mqcr https://example.com
# Save crawled content to directory
mqcr -o ./output https://example.com
# Crawl with custom delay (default: 0.5 seconds)
mqcr -d 2.0 https://example.com
# Limit crawl depth
mqcr --depth 2 https://example.com
# Extract only headings from crawled pages
mqcr -m '.h | select(contains("News"))' https://example.com
# Extract all code blocks
mqcr -m '.code' https://developer.example.com
# Extract and transform links
mqcr -m '.link | to_text()' https://example.com
# Crawl with 3 concurrent workers
mqcr -c 3 https://example.com
# High-speed crawling with 10 workers
mqcr -c 10 -d 0.1 https://example.com
# Use custom robots.txt file
mqcr --robots-path ./custom-robots.txt https://example.com
# Extract scripts as code blocks
mqcr --extract-scripts-as-code-blocks https://example.com
# Generate YAML front matter with metadata
mqcr --generate-front-matter https://example.com
# Use page title as H1 heading
mqcr --use-title-as-h1 https://example.com
# Combine multiple options
mqcr --generate-front-matter --use-title-as-h1 -o ./docs https://example.com
# Output as JSON
mqcr --format json https://example.com
# Output as text (default)
mqcr --format text https://example.com
For JavaScript-heavy sites, use WebDriver (requires Selenium):
# Start Selenium server first
# docker run -d -p 4444:4444 selenium/standalone-chrome
# Crawl with WebDriver
mqcr -U http://localhost:4444 https://spa-example.com
# Custom timeouts
mqcr -U http://localhost:4444 \
--page-load-timeout 60 \
--script-timeout 30 \
--implicit-timeout 10 \
https://example.com
A simple web crawler that fetches HTML, converts it to Markdown, and optionally processes it with an mq query
Usage: mqcr [OPTIONS] <URL>
Arguments:
<URL> The initial URL to start crawling from
Options:
-d, --crawl-delay <CRAWL_DELAY>
Delay (in seconds) between crawl requests [default: 0.5]
-c, --concurrency <CONCURRENCY>
Number of concurrent workers [default: 1]
--depth <DEPTH>
Maximum crawl depth (0 = only start URL, 1 = start URL + direct links)
--implicit-timeout <IMPLICIT_TIMEOUT>
Timeout for element finding [default: 5]
-m, --mq-query <MQ_QUERY>
Optional mq query to process the crawled Markdown
--page-load-timeout <PAGE_LOAD_TIMEOUT>
Timeout for loading a page [default: 30]
-o, --output <OUTPUT>
Output directory for markdown files (stdout if not provided)
--robots-path <ROBOTS_PATH>
Path to custom robots.txt file
--script-timeout <SCRIPT_TIMEOUT>
Timeout for executing scripts [default: 10]
-U, --webdriver-url <WEBDRIVER_URL>
WebDriver URL for browser-based crawling (e.g., http://localhost:4444)
--format <FORMAT>
Output format: text or json [default: text]
--extract-scripts-as-code-blocks
Extract <script> tags as code blocks
--generate-front-matter
Generate YAML front matter with metadata
--use-title-as-h1
Use page title as H1 heading
-h, --help
Print help
-V, --version
Print version
git clone https://github.com/harehare/mq
cd mq
cargo build --release -p mq-crawler
cargo test -p mq-crawler
Licensed under the MIT License.