| Crates.io | cadi-scraper |
| lib.rs | cadi-scraper |
| version | 2.0.0 |
| created_at | 2026-01-11 23:24:46.742587+00 |
| updated_at | 2026-01-12 06:25:34.479376+00 |
| description | CADI Scraper/Chunker utility for converting source code repos and file data into reusable CADI chunks |
| homepage | https://conflictingtheories.github.io/cadi |
| repository | https://github.com/ConflictingTheories/cadi |
| max_upload_size | |
| id | 2036705 |
| size | 175,451 |
CADI Scraper/Chunker utility for converting source code repos and file data into reusable CADI chunks.
cadi-scraper automatically analyzes source code projects and converts them into optimized, content-addressed chunks ready for distribution through CADI registries. It handles multiple programming languages, diverse file formats, and provides intelligent semantic chunking.
Add this to your Cargo.toml:
[dependencies]
cadi-scraper = "1.0"
use cadi_scraper::{Scraper, ScraperConfig, ScraperInput, ChunkingStrategy};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let config = ScraperConfig {
chunking_strategy: ChunkingStrategy::Semantic,
max_chunk_size: 50_000,
..Default::default()
};
let scraper = Scraper::new(config);
let input = ScraperInput::LocalPath("./my-project".into());
let output = scraper.scrape(input).await?;
println!("Created {} chunks", output.chunks.len());
println!("Total bytes: {}", output.statistics.total_bytes);
Ok(())
}
# Install
cargo install cadi
# Scrape a project
cadi scrape ./my-project --strategy semantic --output ./chunks
# Publish to registry
cadi publish --registry https://registry.example.com \
--auth-token TOKEN \
--namespace myorg/project
# See all options
cadi scrape --help
Creates one chunk per file. Fast, simple, preserves file structure.
ChunkingStrategy::ByFile
Analyzes code structure and chunks at logical boundaries (functions, classes, modules).
ChunkingStrategy::Semantic
Creates fixed-byte chunks, useful for uniform processing.
ChunkingStrategy::FixedSize
Creates parent chunks per file with children chunks for functions/classes.
ChunkingStrategy::Hierarchical
Creates chunks every N lines (default 100).
ChunkingStrategy::ByLineCount
export CADI_CHUNKING_STRATEGY="semantic"
export CADI_MAX_CHUNK_SIZE="52428800" # 50MB
export CADI_INCLUDE_OVERLAP="true"
export CADI_EXTRACT_API_SURFACE="true"
export CADI_DETECT_LICENSES="true"
Create ~/.cadi/scraper-config.yaml:
chunking_strategy: semantic
max_chunk_size: 52428800
include_overlap: true
extract_api_surface: true
detect_licenses: true
languages:
rust:
enabled: true
custom_patterns: []
python:
enabled: true
custom_patterns: []
let config = ScraperConfig {
chunking_strategy: ChunkingStrategy::Semantic,
max_chunk_size: 50_000,
include_overlap: true,
hierarchy: true,
extract_api: true,
detect_licenses: true,
..Default::default()
};
Scraping produces ScraperOutput with:
pub struct ScraperOutput {
pub chunks: Vec<ScrapedChunk>, // Generated chunks
pub manifest: Manifest, // Dependency graph
pub statistics: ScrapingStatistics, // Metrics
pub errors: Vec<String>, // Non-fatal errors
}
let mut config = ScraperConfig::default();
config.languages.insert("rust".to_string(), LanguageConfig {
enabled: true,
custom_patterns: vec![
r"#\[derive\((.*?)\)\]".to_string(),
],
});
use cadi_registry::RegistryClient;
let output = scraper.scrape(input).await?;
let client = RegistryClient::new(registry_url, auth_token);
for chunk in output.chunks {
client.publish_chunk(&chunk).await?;
}
let inputs = vec![
ScraperInput::LocalPath("./project1".into()),
ScraperInput::LocalPath("./project2".into()),
ScraperInput::Url("https://github.com/user/repo".into()),
];
for input in inputs {
let output = scraper.scrape(input).await?;
// Process output...
}
Automatically detects:
Recognizes SPDX identifiers:
Typical performance on modern hardware:
use cadi_scraper::error::Error;
match scraper.scrape(input).await {
Ok(output) => {
if !output.errors.is_empty() {
eprintln!("Warnings: {:?}", output.errors);
}
}
Err(Error::InvalidInput(msg)) => eprintln!("Invalid input: {}", msg),
Err(Error::Fetch(msg)) => eprintln!("Fetch failed: {}", msg),
Err(e) => eprintln!("Error: {}", e),
}
Part of the CADI ecosystem:
MIT License