[![Workflow Status](https://github.com/tanakh/easy-scraper/workflows/Rust/badge.svg)](https://github.com/tanakh/easy-scraper/actions?query=workflow%3A%22Rust%22) # easy-scraper HTML scraping library focused on easy to use. In this library, matching patterns are described as HTML DOM trees. You can write patterns intuitive and extract desired contents easily. ## Example ```rust use easy_scraper::Pattern; let doc = r#" "#; let pat = Pattern::new(r#" "#).unwrap(); let ms = pat.matches(doc); assert_eq!(ms.len(), 3); assert_eq!(ms[0]["foo"], "1"); assert_eq!(ms[1]["foo"], "2"); assert_eq!(ms[2]["foo"], "3"); ``` ## Syntax ### DOM Tree DOM trees are valid pattern. You can write placeholders in DOM trees. ```html ``` Patterns are matched if the pattern is subset of document. If the document is: ```html ``` there trees are subset of this. ```html ``` ```html ``` ```html ``` So, match result is ```json [ { "foo": "1" }, { "foo": "2" }, { "foo": "3" }, ] ``` ### Child Child nodes are matched to any descendants because of subset rule. For example, this pattern ```html
  • {{id}}
  • ``` matches against this document. ```html
    ``` ### Siblings To avoid useless matches, siblings are restricted to match only consective children of the same parent. For example, this pattern ```html ``` does not match to this document. ```html ``` And for this document, ```html ``` match results are: ```json [ { "foo": "1", "bar": "2" }, { "foo": "2", "bar": "3" }, ] ``` `{ "foo": 1, "bar": 3 }` is not contained, because there are not consective children. You can specify allow nodes between siblings by writing `...` in the pattern. ```html ``` Match result for this pattern is: ```json [ { "foo": "1", "bar": "2" }, { "foo": "1", "bar": "3" }, { "foo": "2", "bar": "3" }, ] `````` If you want to match siblings as subsequence instead of consective substring, you can use the `subseq` pattern. ```html
    AAAaaa
    BBBbbb
    CCCccc
    DDDddd
    EEEeee
    ``` For this document, ```html
    AAA{{a}}
    BBB{{b}}
    DDD{{d}}
    ``` this pattern matches. ```json [ { "a": "aaa", "b": "bbb", "d": "ddd" } ] ``` ### Attribute You can specify attributes in patterns. Attribute patterns match when pattern's attributes are subset of document's attributes. This pattern ```html
    {{foo}}
    ``` matches to this document. ```html
    Hello
    ``` You can also write placeholders in attributes. ```html {{title}} ``` Match result for ```html Google Yahoo ``` this document is: ```json [ { "url": "https://www.google.com", "title": "Google" }, { "url": "https://www.yahoo.com", "title": "Yahoo" }, ] ``` ### Partial text-node pattern You can write placeholders arbitrary positions in text-node. ```html ``` Match result for ```html ``` this document is: ```json [ { "a": "1", "b": "2" }, { "a": "3", "b": "4" }, { "a": "5", "b": "6" }, ] ``` You can also write placeholders in atteibutes position. ```html ``` Match result for ```html ``` this document is: ```json [ { "userid": "foo", "username": "Foo" }, { "userid": "bar", "username": "Bar" }, { "userid": "baz", "username": "Baz" }, ] ``` ### Whole subtree pattern The pattern `{{var:*}}` matches to whole sub-tree as string. ```html
    {{body:*}}
    ``` Match result for ```html Hello hoge World ``` this document is: ```json [ { "body": "HellohogeWorld" } ] ``` ### White-space White-space are ignored almost all positions. ## Restrictions * Whole sub-tree patterns must be the only one element of the parent node. This is valid: ```html
    {{foo:*}}
    ``` There are invalid: ```html
    hoge {{foo:*}}
    ``` ```html