Crates.io | crawly |
lib.rs | crawly |
version | 0.1.9 |
source | src |
created_at | 2023-08-11 11:35:22.37665 |
updated_at | 2024-06-13 00:19:10.274725 |
description | A lightweight async Web crawler in Rust, optimized for concurrent scraping while respecting `robots.txt` rules. |
homepage | https://ai-chat.it |
repository | https://github.com/aichat-bot/crawly |
max_upload_size | |
id | 941827 |
size | 26,186 |
A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt
rules.
robots.txt
: Automatically fetches and adheres to website scraping guidelines;Add crawly
to your Cargo.toml
:
[dependencies]
crawly = "^0.1"
A simple usage example:
use anyhow::Result;
use crawly::Crawler;
#[tokio::main]
async fn main() -> Result<()> {
let crawler = Crawler::new()?;
let results = crawler.crawl_url("https://example.com").await?;
for (url, content) in &results {
println!("URL: {}\nContent: {}", url, content);
}
Ok(())
}
For more refined control over the crawler's behavior, the CrawlerBuilder comes in handy:
use anyhow::Result;
use crawly::CrawlerBuilder;
#[tokio::main]
async fn main() -> Result<()> {
let crawler = CrawlerBuilder::new()
.with_max_depth(10)
.with_max_pages(100)
.with_max_concurrent_requests(50)
.with_rate_limit_wait_seconds(2)
.with_robots(true)
.build()?;
let results = crawler.start("https://www.example.com").await?;
for (url, content) in &results {
println!("URL: {}\nContent: {}", url, content);
}
Ok(())
}
This crate will detect Cloudflare hosted sites and if the header cf-mitigated
is found, the URL will be skipped
without
throwing any error.
Every function is instrumented, also this crate will emit some DEBUG messages for better comprehending the crawling flow.
Contributions, issues, and feature requests are welcome!
Feel free to check issues page. You can also take a look at the contributing guide.
This project is MIT licensed.