# Skyscraper - HTML scraping with XPath [![Dependency Status](https://deps.rs/repo/github/James-LG/Skyscraper/status.svg)](https://deps.rs/repo/github/James-LG/Skyscraper) [![License MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/James-LG/Skyscraper/blob/master/LICENSE) [![Crates.io](https://img.shields.io/crates/v/skyscraper.svg)](https://crates.io/crates/skyscraper) [![doc.rs](https://docs.rs/skyscraper/badge.svg)](https://docs.rs/skyscraper) Rust library to scrape HTML documents with XPath expressions. > This library is major-version 0 because there are still `todo!` calls for many xpath features. >If you encounter one that you feel should be prioritized, open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues). > > See the [Supported XPath Features](#supported-xpath-features) section for details. ## HTML Parsing Skyscraper has its own HTML parser implementation. The parser outputs a tree structure that can be traversed manually with parent/child relationships. ### Example: Simple HTML Parsing ```rust use skyscraper::html::{self, parse::ParseError}; let html_text = r##"
Hello world
"##; let document = html::parse(html_text)?; ``` ### Example: Traversing Parent/Child Relationships ```rust // Parse the HTML text into a document let text = r#""#; let document = html::parse(text)?; // Get the children of the root node let parent_node: DocumentNode = document.root_node; let children: Vec = parent_node.children(&document).collect(); assert_eq!(2, children.len()); // Get the parent of both child nodes let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing"); let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing"); assert_eq!(parent_node, parent_of_child0); assert_eq!(parent_node, parent_of_child1); ``` ## XPath Expressions Skyscraper is capable of parsing XPath strings and applying them to HTML documents. Below is a basic xpath example. Please see the [docs](https://docs.rs/skyscraper/latest/skyscraper/xpath/index.html) for more examples. ```rust use skyscraper::html; use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}}; use std::error::Error; fn main() -> Result<(), Box> { let html_text = r##"
Hello world
"##; let document = html::parse(html_text)?; let xpath_item_tree = XpathItemTree::from(&document); let xpath = xpath::parse("//div")?; let item_set = xpath.apply(&xpath_item_tree)?; assert_eq!(item_set.len(), 1); let mut items = item_set.into_iter(); let item = items .next() .unwrap(); let element = item .as_node()? .as_tree_node()? .data .as_element_node()?; assert_eq!(element.name, "div"); Ok(()) } ``` ### Supported XPath Features Below is a non-exhaustive list of all the features that are currently supported. 1. Basic xpath steps: `/html/body/div`, `//div/table//span` 1. Attribute selection: `//div/@class` 1. Text selection: `//div/text()` 1. Wildcard node selection: `//body/*` 1. Predicates: 1. Attributes: `//div[@class='hi']` 1. Indexing: `//div[1]` 1. Functions: 1. `fn:root()` 1. `contains(haystack, needle)` 1. Forward axes: 1. Child: `child::*` 1. Descendant: `descendant::*` 1. Attribute: `attribute::*` 1. DescendentOrSelf: `descendant-or-self::*` 1. (more coming soon) 1. Reverse axes: 1. Parent: `parent::*` 1. (more coming soon) 1. Treat expressions: `/html treat as node()` This should cover most XPath use-cases. If your use case requires an unimplemented feature, please open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).