Crates.io | scraper-macros |
lib.rs | scraper-macros |
version | 0.2.0 |
source | src |
created_at | 2021-03-23 21:45:30.402093 |
updated_at | 2022-11-08 00:44:48.98262 |
description | Macros implementation of #[derive(Scraper)] |
homepage | https://github.com/Its-its/xpath-scraper |
repository | https://github.com/Its-its/xpath-scraper |
max_upload_size | |
id | 372740 |
size | 9,888 |
Makes it easier to scrape websites with XPATH. Currently using my xpath parser which is incomplete, undocumented and used originally for teaching myself about parsing.
A Very simple example of this which is below and also in the example folder:
use std::io::Cursor;
use scraper_macros::Scraper;
use scraper_main::{
xpather,
ConvertFromValue,
ScraperMain
};
#[derive(Debug, Scraper)]
pub struct RedditList(
// Uses XPATH to find the item containers
#[scrape(xpath = r#"//div[contains(@class, "Post") and not(contains(@class, "promotedlink"))]"#)]
Vec<RedditListItem>
);
#[derive(Debug, Scraper)]
pub struct RedditListItem {
// URL of the post
#[scrape(xpath = r#".//a[@data-click-id="body"]/@href"#)]
pub url: Option<String>,
// Title of the post
#[scrape(xpath = r#".//a[@data-click-id="body"]/div/h3/text()"#)]
pub title: Option<String>,
// When it was posted
#[scrape(xpath = r#".//a[@data-click-id="timestamp"]/text()"#)]
pub timestamp: Option<String>,
// Amount of comments.
#[scrape(xpath = r#".//a[@data-click-id="comments"]/span/text()"#)]
pub comment_count: Option<String>,
// Vote count.
#[scrape(xpath = r#"./div[1]/div/div/text()"#)]
pub votes: Option<String>,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Request subreddit
let resp = reqwest::get("https://www.reddit.com/r/nocontextpics/").await?;
let data = resp.text().await?;
// Parse request into a Document.
let document = xpather::parse_doc(&mut Cursor::new(data));
// Scrape RedditList struct.
let list = RedditList::scrape(&document, None)?;
// Output the scraped.
println!("{:#?}", list);
Ok(())
}