| Crates.io | kodegen_tools_citescrape |
| lib.rs | kodegen_tools_citescrape |
| version | 0.10.21 |
| created_at | 2025-10-29 03:26:13.129656+00 |
| updated_at | 2026-01-02 15:09:23.100242+00 |
| description | KODEGEN.ᴀɪ: Memory-efficient, Blazing-Fast, MCP tools for code generation agents. |
| homepage | https://kodegen.ai |
| repository | https://github.com/cyrup-ai/kodegen-tools-citescrape |
| max_upload_size | |
| id | 1906015 |
| size | 5,777,709 |
Memory-efficient, Blazing-Fast MCP tools for code generation agents
kodegen-tools-citescrape is a high-performance web crawling and search toolkit designed specifically for AI coding agents. It provides Model Context Protocol (MCP) tools that enable agents to crawl websites with stealth browser automation, extract content as markdown, and perform full-text search on crawled data.
# Clone the repository
git clone https://github.com/cyrup-ai/kodegen-tools-citescrape.git
cd kodegen-tools-citescrape
# Build the project
cargo build --release
# Start the HTTP server (default port: 30445)
cargo run --release --bin kodegen-citescrape
The server will expose 4 MCP tools over HTTP transport, typically managed by the kodegend daemon.
use kodegen_tools_citescrape::{CrawlConfig, ChromiumoxideCrawler, Crawler};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Configure the crawler
let config = CrawlConfig::builder()
.start_url("https://docs.rs/tokio")?
.storage_dir("./crawl_output")?
.max_depth(3)
.max_pages(100)
.follow_external_links(false)
.build();
// Create and run crawler
let crawler = ChromiumoxideCrawler::new(config);
crawler.crawl().await?;
Ok(())
}
The server provides four tools for AI agents:
scrape_urlInitiates a background web crawl with automatic search indexing.
Arguments:
url (required): Starting URL to crawloutput_dir (optional): Directory to save results (default: ${git_root}/.kodegen/citescrape or ~/.local/share/kodegen/citescrape)max_depth (optional): Maximum link depth (default: 3)max_pages (optional): Maximum pages to crawl (default: 100)follow_external_links (optional): Crawl external domains (default: false)enable_search (optional): Enable full-text indexing (default: false)Returns:
crawl_id: UUID for tracking the crawloutput_dir: Path where results are savedstatus: Initial status ("running")Example:
{
"url": "https://docs.rs/tokio",
"max_depth": 2,
"max_pages": 50,
"enable_search": true
}
scrape_check_resultsRetrieves markdown content from a crawl session.
Arguments:
crawl_id (required): UUID from scrape_urloffset (optional): Pagination offset (default: 0)limit (optional): Max results to return (default: 10)include_progress (optional): Include crawl progress stats (default: false)Returns:
status: "running", "completed", or "failed"results: Array of markdown documents with metadatatotal_pages: Total pages crawledprogress (if requested): Crawl statisticsscrape_search_resultsPerforms full-text search on indexed crawl content.
Arguments:
crawl_id (required): UUID from scrape_urlquery (required): Search query stringlimit (optional): Max results (default: 10)search_type (optional): "markdown" or "plaintext" (default: "plaintext")Returns:
results: Ranked search results with snippetstotal_hits: Total matching documentsExample:
{
"crawl_id": "550e8400-e29b-41d4-a716-446655440000",
"query": "async runtime",
"limit": 5
}
web_searchExecutes a web search using a stealth browser.
Arguments:
query (required): Search queryengine (optional): "google", "bing", or "duckduckgo" (default: "google")max_results (optional): Maximum results (default: 10)Returns:
results: Array of search results with titles, URLs, and snippetssrc/crawl_engine/): Multi-threaded crawler with rate limiting, circuit breakers, and domain concurrency controlsrc/kromekover/): Browser stealth system that injects JavaScript to evade bot detectionsrc/content_saver/): Pipeline for HTML preprocessing, markdown conversion, compression, and indexingsrc/search/): Tantivy-based dual-index system (markdown + plaintext)src/mcp/): Tool implementations and session managementThe kromekover module provides advanced browser fingerprint evasion:
The CrawlConfig builder provides extensive customization:
let config = CrawlConfig::builder()
.start_url("https://example.com")?
.storage_dir("./output")?
.max_depth(5)
.max_pages(500)
.follow_external_links(true)
.rate_limit_delay_ms(1000)
.max_concurrent_requests_per_domain(2)
.timeout_seconds(30)
.enable_compression(true)
.build();
Three-layer rate limiting system:
# Development build
cargo build
# Release build
cargo build --release
# Check without building
cargo check
# Run all tests with nextest (recommended)
cargo nextest run
# Run specific test
cargo nextest run test_name
# Standard cargo test
cargo test
# Run with output
cargo test test_name -- --nocapture
# Basic crawl demo
cargo run --example citescrape_demo
# Interactive TUI crawler
cargo run --example direct_crawl_ratatui
# Web search example
cargo run --example direct_web_search
# Format code
cargo fmt
# Lint
cargo clippy
# Check all warnings
cargo clippy -- -W clippy::all
src/
├── browser_setup.rs # Chrome launching and stealth setup
├── config/ # Type-safe config builder
├── content_saver/ # HTML/markdown saving pipeline
├── crawl_engine/ # Core crawling logic
├── crawl_events/ # Progress event streaming
├── kromekover/ # Browser stealth evasion
├── mcp/ # MCP tool implementations
├── page_extractor/ # Content and link extraction
├── search/ # Tantivy full-text search
├── web_search/ # Browser manager for searches
└── main.rs # HTTP server entry point
Contributions are welcome! Please:
cargo fmt and cargo clippyThis project is dual-licensed under:
You may choose either license for your use.
Built with:
Made with ❤️ by KODEGEN.ᴀɪ | Copyright © 2025 David Maple