omnivore-cli

Crates.ioomnivore-cli
lib.rsomnivore-cli
version0.2.0
created_at2025-08-09 20:15:23.160468+00
updated_at2025-08-11 03:21:34.486247+00
descriptionUniversal web scraper and code extractor CLI - crawl websites, analyze repositories, build knowledge graphs
homepagehttps://ov.pranavkarra.me
repositoryhttps://github.com/Pranav-Karra-3301/omnivore
max_upload_size
id1788211
size286,406
Pranav Karra (Pranav-Karra-3301)

documentation

https://docs.rs/omnivore-cli

README

omnivore-cli

🕷️ Omnivore - The Universal Web Scraper & Code Extractor

Crates.io Documentation License

A powerful command-line tool for web scraping, code analysis, and knowledge extraction. Omnivore can crawl websites, extract code from Git repositories, and build comprehensive knowledge graphs from the data it collects.

📚 Full Documentation: ov.pranavkarra.me/docs
🌐 Website: ov.pranavkarra.me

Features

  • 🌐 Advanced Web Crawling: Multi-threaded crawling with configurable depth, politeness delays, and smart content extraction
  • 📊 Knowledge Graph Generation: Automatically build knowledge graphs from crawled content
  • 🔍 Code Repository Analysis: Extract and analyze code from any Git repository or local directory
  • 🤖 AI-Powered Extraction: Intelligent content extraction using OpenAI GPT models
  • 🎯 Smart Filtering: Automatically detect project types and apply intelligent file filtering
  • 📁 Multiple Output Formats: JSON, Markdown, HTML, CSV, and plain text
  • 🔧 Highly Configurable: Extensive configuration options via CLI flags or config files

Installation

From crates.io

cargo install omnivore-cli

From source

git clone https://github.com/Pranav-Karra-3301/omnivore.git
cd omnivore
cargo install --path omnivore-cli

Quick Start

Web Crawling

# Basic crawl with 5 workers and depth of 3
omnivore crawl https://example.com --workers 5 --depth 3

# Crawl with AI extraction
omnivore crawl https://example.com --ai-extract --ai-model gpt-4o-mini

# JavaScript-rendered content with browser mode
omnivore crawl https://example.com --browser --wait 2000

Code Repository Analysis

# Analyze a GitHub repository
omnivore git https://github.com/user/repo --output repo-analysis.txt

# Analyze local directory (even non-git directories)
omnivore git ./my-project --output project-code.txt

# Filter by file types
omnivore git ./my-project --only rs,toml,md --output rust-project.txt

# Exclude patterns
omnivore git ./my-project --exclude "test,*.log,tmp/*" --output filtered-code.txt

Configuration

# Interactive setup wizard
omnivore setup

# Set OpenAI API key for AI features
omnivore config set openai_api_key YOUR_API_KEY

# View current configuration
omnivore config show

Key Commands

omnivore crawl

Crawl websites with advanced options:

  • Multi-threaded crawling with configurable workers
  • Respect robots.txt and politeness delays
  • Extract structured data, metadata, and content
  • Support for JavaScript-rendered pages (browser mode)
  • AI-powered intelligent extraction

omnivore git

Extract and analyze code repositories:

  • Smart project type detection (Next.js, Python, Rust, etc.)
  • Intelligent file filtering (skip node_modules, build artifacts, etc.)
  • Organized output with project structure and metadata
  • Support for both Git repositories and regular directories
  • Multiple output formats (text, JSON)

omnivore config

Manage configuration:

  • Set API keys for AI features
  • Configure default settings
  • Export/import configurations

Advanced Usage

AI-Powered Extraction

# Use GPT-4 for intelligent content extraction
omnivore crawl https://docs.example.com \
  --ai-extract \
  --ai-model gpt-4o \
  --ai-prompt "Extract API endpoints and their descriptions"

Browser Mode for JavaScript Sites

# Crawl JavaScript-heavy sites
omnivore crawl https://app.example.com \
  --browser \
  --wait 3000 \
  --screenshot \
  --interactive

Building Knowledge Graphs

# Generate a knowledge graph from crawled data
omnivore crawl https://wiki.example.com --output wiki.json
omnivore graph wiki.json --output knowledge-graph.db

Code Repository Analysis with Smart Defaults

# Omnivore automatically detects project type and applies smart filters
omnivore git https://github.com/vercel/next.js --output nextjs.txt

# Override smart defaults
omnivore git ./my-project --no-smart-filter --include "*.js,*.ts"

Configuration File

Create a .omnivore.toml file in your project or home directory:

[crawl]
default_workers = 10
default_depth = 3
respect_robots_txt = true
user_agent = "Omnivore/1.0"

[git]
smart_filter = true
max_file_size = 10485760  # 10MB
exclude_binary = true

[ai]
default_model = "gpt-4o-mini"
max_tokens = 2000

Examples

Crawl documentation site and extract API references

omnivore crawl https://docs.rust-lang.org \
  --depth 3 \
  --include-pattern "/std/*" \
  --ai-extract \
  --ai-prompt "Extract function signatures and descriptions" \
  --output rust-std-api.json

Analyze a TypeScript project

omnivore git ./my-typescript-project \
  --only ts,tsx,json \
  --exclude "node_modules,dist,*.test.ts" \
  --output project-analysis.txt

Create a knowledge graph from a website

# Step 1: Crawl the website
omnivore crawl https://en.wikipedia.org/wiki/Artificial_intelligence \
  --depth 2 \
  --max-pages 100 \
  --output ai-wiki.json

# Step 2: Generate knowledge graph
omnivore graph ai-wiki.json \
  --output ai-knowledge.db \
  --extract-entities \
  --extract-relationships

Environment Variables

  • OMNIVORE_CONFIG_DIR: Custom configuration directory
  • OPENAI_API_KEY: OpenAI API key for AI features
  • OMNIVORE_USER_AGENT: Custom user agent for crawling
  • OMNIVORE_LOG_LEVEL: Logging level (debug, info, warn, error)

Documentation

For comprehensive documentation, examples, and guides, visit:

Contributing

Contributions are welcome! Please check out our contributing guidelines.

License

This project is dual-licensed under either:

at your option.

Links

Commit count: 0

cargo fmt