| Crates.io | karkinos |
| lib.rs | karkinos |
| version | 0.0.1 |
| created_at | 2025-10-22 19:23:38.692724+00 |
| updated_at | 2025-10-22 19:23:38.692724+00 |
| description | Powerful and flexible web scraper with YAML configuration, supporting pagination, data transformations, caching, and multiple output formats |
| homepage | |
| repository | https://github.com/ggagosh/karkinos |
| max_upload_size | |
| id | 1896105 |
| size | 134,609 |
Καρκινος🦀🦀🦀 Powerful and flexible website scraper written in Rust 🦀🦀🦀
Inspired by scrape-it
cargo install --path .
# Basic usage
main config.krk.yaml
# Save to file
main config.krk.yaml -o output.json
# Export to CSV
main config.krk.yaml -o output.csv -f csv
config:
url: https://example.com
data:
title:
selector: h1
description:
selector: .description
config:
# Single URL
url: https://example.com
# OR multiple URLs for batch scraping
urls:
- https://example.com/page1
- https://example.com/page2
# Custom headers
headers:
User-Agent: "Mozilla/5.0"
Cookie: "session=abc123"
# Timeout in seconds (default: 30)
timeout: 60
# Number of retry attempts (default: 0)
retries: 3
# Delay between requests in milliseconds (default: 0)
delay: 1000
# Proxy configuration
proxy: http://proxy.example.com:8080
# Cache configuration
cacheDir: ./.cache
useCache: true
data:
# Simple text extraction
title:
selector: h1
# Extract from attribute
image:
selector: img.featured
attr: src
# Select nth element (0-indexed)
firstParagraph:
selector: p
nth: 0
# Default value if not found
author:
selector: .author
default: "Unknown"
# Disable trimming
rawText:
selector: .content
trim: false
data:
# Extract using regex
price:
selector: .price
regex: '\d+\.\d+'
toNumber: true
# Text replacement
cleanTitle:
selector: h1
replace: ["Breaking: ", ""]
# Case conversion
upperTitle:
selector: h1
uppercase: true
lowerTitle:
selector: h1
lowercase: true
# Strip HTML tags
cleanText:
selector: .content
stripHtml: true
# Type conversion
rating:
selector: .rating
toNumber: true
isActive:
selector: .status
toBoolean: true
data:
articles:
selector: article
data:
title:
selector: h2
author:
selector: .author
tags:
selector: .tag
data:
name:
selector: span
Automatically scrape multiple pages:
# Strategy 1: URL pattern with page numbers
config:
url: https://example.com/products
pagination:
pagePattern: "?page={page}"
startPage: 1
endPage: 10
stopOnEmpty: false
# Strategy 2: Follow "next" links
config:
url: https://example.com/blog
pagination:
nextSelector: "a.next-page"
maxPages: 20
stopOnEmpty: true
# Strategy 3: Full URL pattern
config:
url: https://example.com
pagination:
pagePattern: "https://example.com/search?q=rust&page={page}"
startPage: 1
maxPages: 5
Pagination Options:
pagePattern: URL pattern with {page} placeholdernextSelector: CSS selector for "next page" linkstartPage: Starting page number (default: 1)maxPages: Maximum pages to scrape (0 = unlimited for nextSelector)endPage: Ending page number (for pagePattern)stopOnEmpty: Stop if no results found on pageconfig:
url: https://news.ycombinator.com
data:
stories:
selector: .athing
data:
title:
selector: .titleline > a
score:
selector: .score
toNumber: true
config:
urls:
- https://example.com/page1
- https://example.com/page2
delay: 2000
headers:
User-Agent: "Karkinos/1.0"
data:
title:
selector: h1
content:
selector: article
config:
url: https://example.com/products
data:
products:
selector: .product
data:
name:
selector: .product-name
stripHtml: true
price:
selector: .price
regex: '\d+\.\d+'
toNumber: true
inStock:
selector: .availability
toBoolean: true
config:
url: https://example.com
cacheDir: ./.scrape-cache
useCache: true
timeout: 30
retries: 2
data:
content:
selector: .main-content
config:
url: https://example.com/blog
pagination:
nextSelector: "a.next-page"
maxPages: 10
stopOnEmpty: true
delay: 1000
data:
articles:
selector: article
data:
title:
selector: h2
date:
selector: .post-date
main config.krk.yaml -o output.json
main config.krk.yaml -o output.csv -f csv
Note: CSV output flattens simple fields. Nested arrays are JSON-encoded.
cargo run --bin gen
This creates krk-schema.json for configuration validation.
cargo test
MIT