scraper-main

Crates.ioscraper-main
lib.rsscraper-main
version0.3.1
sourcesrc
created_at2021-03-23 21:37:16.792057
updated_at2023-01-07 05:47:25.897456
descriptionThe core framework xpath parsing
homepagehttps://github.com/Its-its/xpath-scraper
repositoryhttps://github.com/Its-its/xpath-scraper
max_upload_size
id372735
size7,171
Its (Its-its)

documentation

README

XPATH Scraper

Makes it easier to scrape websites with XPATH. Currently using my xpath parser which is incomplete, undocumented and used originally for teaching myself about parsing.

A Very simple example of this which is below and also in the example folder:

use std::io::Cursor;

use scraper_main::{
	xpather,
	ConvertFromValue,
	ScraperMain,
	Scraper,
};

#[derive(Debug, Scraper)]
pub struct RedditList(
	// Uses XPATH to find the item containers
	#[scrape(xpath = r#"//div[contains(@class, "Post") and not(contains(@class, "promotedlink"))]"#)]
	Vec<RedditListItem>
);


#[derive(Debug, Scraper)]
pub struct RedditListItem {
	// URL of the post
	#[scrape(xpath = r#".//a[@data-click-id="body"]/@href"#)]
	pub url: Option<String>,

	// Title of the post
	#[scrape(xpath = r#".//a[@data-click-id="body"]/div/h3/text()"#)]
	pub title: Option<String>,

	// When it was posted
	#[scrape(xpath = r#".//a[@data-click-id="timestamp"]/text()"#)]
	pub timestamp: Option<String>,

	// Amount of comments.
	#[scrape(xpath = r#".//a[@data-click-id="comments"]/span/text()"#)]
	pub comment_count: Option<String>,

	// Vote count.
	#[scrape(xpath = r#"./div[1]/div/div/text()"#)]
	pub votes: Option<String>,
}


#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
	// Request subreddit
	let resp = reqwest::get("https://www.reddit.com/r/nocontextpics/").await?;

	// Return page data.
	let data = resp.text().await?;

	// Parse request into a Document.
	let document = xpather::parse_doc(&mut Cursor::new(data));

	// Scrape RedditList struct.
	let list = RedditList::scrape(&document, None)?;

	// Output the scraped.
	println!("{:#?}", list);

	Ok(())
}
Commit count: 41

cargo fmt