crabler

Crates.io	crabler
lib.rs	crabler
version	0.1.28
created_at	2020-02-24 21:37:52.062167+00
updated_at	2022-04-01 19:02:19.744048+00
description	Web scraper for Crabs
homepage	https://github.com/Gonzih/crabler
repository	https://github.com/Gonzih/crabler
max_upload_size
id	212121
size	24,488

Max Gonzih (Gonzih)

documentation

https://docs.rs/crabler

README

Crabler - Web crawler for Crabs

Asynchronous web scraper engine written in rust.

Features:

fully based on async-std
derive macro based api
struct based api
stateful scraper (structs can hold state)
ability to download files
ability to schedule navigation jobs in an async manner

Example

extern crate crabler;

use std::path::Path;

use crabler::*;
use surf::Url;

#[derive(WebScraper)]
#[on_response(response_handler)]
#[on_html("a[href]", walk_handler)]
struct Scraper {}

impl Scraper {
    async fn response_handler(&self, response: Response) -> Result<()> {
        if response.url.ends_with(".png") && response.status == 200 {
            println!("Finished downloading {} -> {:?}", response.url, response.download_destination);
        }
        Ok(())
    }

    async fn walk_handler(&self, mut response: Response, a: Element) -> Result<()> {
        if let Some(href) = a.attr("href") {
            // Create absolute URL
            let url = Url::parse(&href)
                .unwrap_or_else(|_| Url::parse(&response.url).unwrap().join(&href).unwrap());

            // Attempt to download an image
            if href.ends_with(".png") {
                let image_name = url.path_segments().unwrap().last().unwrap();
                let p = Path::new("/tmp").join(image_name);
                let destination = p.to_string_lossy().to_string();

                if !p.exists() {
                    println!("Downloading {}", destination);
                    // Schedule crawler to download file to some destination
                    // downloading will happen in the background, await here is just to wait for job queue
                    response.download_file(url.to_string(), destination).await?;
                } else {
                    println!("Skipping existing file {}", destination);
                }
            } else {
              // Or schedule crawler to navigate to a given url
              response.navigate(url.to_string()).await?;
            };
        }

        Ok(())
    }
}

#[async_std::main]
async fn main() -> Result<()> {
    let scraper = Scraper {};

    // Run scraper starting from given url and using 20 worker threads
    scraper.run(Opts::new().with_urls(vec!["https://www.rust-lang.org/"]).with_threads(20)).await
}

Sample project

Gonzih/apod-nasa-scraper-rs

Commit count: 121

crabler

documentation

README

Crabler - Web crawler for Crabs

Example

Sample project

cargo fmt