voyager ========================= [

](https://github.com/mattsse/voyager) [

](https://crates.io/crates/voyager) [

](https://docs.rs/voyager) [

](https://github.com/mattsse/voyager/actions?query=branch%3Amain) With voyager you can easily extract structured data from websites. Write your own crawler/scraper with voyager following a state machine model. ## Example The examples use [tokio](https://tokio.rs/) as its runtime, so your `Cargo.toml` could look like this: ```toml [dependencies] voyager = { version = "0.1" } tokio = { version = "1.8", features = ["full"] } ``` ### Declare your own Scraper and model ```rust // Declare your scraper, with all the selectors etc. struct HackernewsScraper { post_selector: Selector, author_selector: Selector, title_selector: Selector, comment_selector: Selector, max_page: usize, } /// The state model #[derive(Debug)] enum HackernewsState { Page(usize), Post, } /// The ouput the scraper should eventually produce #[derive(Debug)] struct Entry { author: String, url: Url, link: Option, title: String, } ``` ### Implement the `voyager::Scraper` trait A `Scraper` consists of two associated types: * `Output`, the type the scraper eventually produces * `State`, the type, the scraper can drag along several requests that eventually lead to an `Output` and the `scrape` callback, which is invoked after each received response. Based on the state attached to `response` you can supply the crawler with new urls to visit with, or without a state attached to it. Scraping is done with [causal-agent/scraper](https://github.com/causal-agent/scraper). ```rust impl Scraper for HackernewsScraper { type Output = Entry; type State = HackernewsState; /// do your scraping fn scrape( &mut self, response: Response, crawler: &mut Crawler, ) -> Result> { let html = response.html(); if let Some(state) = response.state { match state { HackernewsState::Page(page) => { // find all entries for id in html .select(&self.post_selector) .filter_map(|el| el.value().attr("id")) { // submit an url to a post crawler.visit_with_state( &format!("https://news.ycombinator.com/item?id={}", id), HackernewsState::Post, ); } if page < self.max_page { // queue in next page crawler.visit_with_state( &format!("https://news.ycombinator.com/news?p={}", page + 1), HackernewsState::Page(page + 1), ); } } HackernewsState::Post => { // scrape the entry let entry = Entry { // ... }; return Ok(Some(entry)) } } } Ok(None) } } ``` ### Setup and collect all the output Configure the crawler with via `CrawlerConfig`: * Allow/Block list of Domains * Delays between requests * Whether to respect the `Robots.txt` rules Feed your config and an instance of your scraper to the `Collector` that drives the `Crawler` and forwards the responses to your `Scraper`. ```rust use voyager::scraper::Selector; use voyager::*; use tokio::stream::StreamExt; #[tokio::main] async fn main() -> Result<(), Box> { // only fulfill requests to `news.ycombinator.com` let config = CrawlerConfig::default().allow_domain_with_delay( "news.ycombinator.com", // add a delay between requests RequestDelay::Fixed(std::time::Duration::from_millis(2_000)), ); let mut collector = Collector::new(HackernewsScraper::default(), config); collector.crawler_mut().visit_with_state( "https://news.ycombinator.com/news", HackernewsState::Page(1), ); while let Some(output) = collector.next().await { let post = output?; dbg!(post); } Ok(()) } ``` See [examples](./examples) for more. ### Inject async calls Sometimes it might be helpful to execute some other calls first, get a token etc., You submit `async` closures to the crawler to manually get a response and inject a state or drive a state to completion ```rust fn scrape( &mut self, response: Response, crawler: &mut Crawler, ) -> Result> { // inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved. crawler.crawl(move |client| async move { let state = response.state; let auth = client.post("some auth end point ").send()?.await?.json().await?; // do other async tasks etc.. let new_resp = client.get("the next html page").send().await?; Ok((new_resp, state)) }); // submit a crawling job that completes to `Self::Output` directly crawler.complete(move |client| async move { // do other async tasks to create a `Self::Output` instance let output = Self::Output{/*..*/}; Ok(Some(output)) }); Ok(None) } ``` ### Recover a state that got lost If the crawler encountered an error, due to a failed or disallowed http request, the error is reported as `CrawlError`, which carries the last valid state. The error then can be down casted. ```rust let mut collector = Collector::new(HackernewsScraper::default(), config); while let Some(output) = collector.next().await { match output { Ok(post) => {/**/} Err(err) => { // recover the state by downcasting the error if let Ok(err) = err.downcast::::State>>() { let last_state = err.state(); } } } } ``` Licensed under either of these: * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or https://www.apache.org/licenses/LICENSE-2.0) * MIT license ([LICENSE-MIT](LICENSE-MIT) or https://opensource.org/licenses/MIT)