# Crabler-Tokio - Web crawler for Crabs [![CI][ci-badge]][ci-url] [![Crates.io][crates-badge]][crates-url] [![docs.rs][docs-badge]][docs-url] [![MIT licensed][mit-badge]][mit-url] [ci-badge]: https://github.com/matiu2/crabler-tokio/workflows/CI/badge.svg [ci-url]: https://github.com/matiu2/crabler-tokio/actions [crates-badge]: https://img.shields.io/crates/v/crabler-tokio.svg [crates-url]: https://crates.io/crates/crabler-tokio [docs-badge]: https://docs.rs/crabler-tokio/badge.svg [docs-url]: https://docs.rs/crabler-tokio [mit-badge]: https://img.shields.io/badge/license-MIT-blue.svg [mit-url]: LICENSE Asynchronous web scraper engine written in Rust. Ported from the [crabler](https://crates.io/crates/crabler) crate. The original work and author can be found here: The main difference between crabler and crabler-tokio is that we've swapped out these dependencies: * async_std -> tokio * surf -> reqwest The main motivation for this is to be closer to pure rust, and drop the requirments for libcurl and libssl. The `cargo tree` output saves a few imports: * 488 dependencies for crabler * 350 dependencies for crabler-tokio Features: * fully based on `tokio` * derive macro based API * struct-based API * stateful scraper (structs can hold state) * ability to download files * ability to schedule navigation jobs in an async manner * uses the `reqwest` crate for HTTP requests ## Usage Add this to your `Cargo.toml`: ```toml [dependencies] crabler-tokio = "0.1.0" ## Example ```rust,no_run extern crate crabler; use std::path::Path; use crabler::*; use reqwest::Url; #[derive(WebScraper)] #[on_response(response_handler)] #[on_html("a[href]", walk_handler)] struct Scraper {} impl Scraper { async fn response_handler(&self, response: Response) -> Result<()> { if response.url.ends_with(".png") && response.status == 200 { println!("Finished downloading {} -> {:?}", response.url, response.download_destination); } Ok(()) } async fn walk_handler(&self, mut response: Response, a: Element) -> Result<()> { if let Some(href) = a.attr("href") { // Create absolute URL let url = Url::parse(&href) .unwrap_or_else(|_| Url::parse(&response.url).unwrap().join(&href).unwrap()); // Attempt to download an image if href.ends_with(".png") { let image_name = url.path_segments().unwrap().last().unwrap(); let p = Path::new("/tmp").join(image_name); let destination = p.to_string_lossy().to_string(); if !p.exists() { println!("Downloading {}", destination); // Schedule crawler to download file to some destination // downloading will happen in the background, await here is just to wait for job queue response.download_file(url.to_string(), destination).await?; } else { println!("Skipping existing file {}", destination); } } else { // Or schedule crawler to navigate to a given url response.navigate(url.to_string()).await?; }; } Ok(()) } } #[tokio::main] async fn main() -> Result<()> { let scraper = Scraper {}; // Run scraper starting from given url and using 20 worker threads scraper.run(Opts::new().with_urls(vec!["https://www.rust-lang.org/"]).with_threads(20)).await } ``` ## Sample project (for the original crabler crate) [Gonzih/apod-nasa-scraper-rs](https://github.com/Gonzih/apod-nasa-scraper-rs/)