[![Workflow Status](https://github.com/tanakh/easy-scraper/workflows/Rust/badge.svg)](https://github.com/tanakh/easy-scraper/actions?query=workflow%3A%22Rust%22)
# easy-scraper
HTML scraping library focused on easy to use.
In this library, matching patterns are described as HTML DOM trees.
You can write patterns intuitive and extract desired contents easily.
## Example
```rust
use easy_scraper::Pattern;
let doc = r#"
"#;
let pat = Pattern::new(r#"
"#).unwrap();
let ms = pat.matches(doc);
assert_eq!(ms.len(), 3);
assert_eq!(ms[0]["foo"], "1");
assert_eq!(ms[1]["foo"], "2");
assert_eq!(ms[2]["foo"], "3");
```
## Syntax
### DOM Tree
DOM trees are valid pattern. You can write placeholders in DOM trees.
```html
```
Patterns are matched if the pattern is subset of document.
If the document is:
```html
```
there trees are subset of this.
```html
```
```html
```
```html
```
So, match result is
```json
[
{ "foo": "1" },
{ "foo": "2" },
{ "foo": "3" },
]
```
### Child
Child nodes are matched to any descendants
because of subset rule.
For example, this pattern
```html
{{id}}
```
matches against this document.
```html
```
### Siblings
To avoid useless matches,
siblings are restricted to match
only consective children of the same parent.
For example, this pattern
```html
```
does not match to this document.
```html
```
And for this document,
```html
```
match results are:
```json
[
{ "foo": "1", "bar": "2" },
{ "foo": "2", "bar": "3" },
]
```
`{ "foo": 1, "bar": 3 }` is not contained, because there are not consective children.
You can specify allow nodes between siblings by writing `...` in the pattern.
```html
```
Match result for this pattern is:
```json
[
{ "foo": "1", "bar": "2" },
{ "foo": "1", "bar": "3" },
{ "foo": "2", "bar": "3" },
]
``````
If you want to match siblings as subsequence instead of consective substring,
you can use the `subseq` pattern.
```html
AAA | aaa |
BBB | bbb |
CCC | ccc |
DDD | ddd |
EEE | eee |
```
For this document,
```html
AAA | {{a}} |
BBB | {{b}} |
DDD | {{d}} |
```
this pattern matches.
```json
[
{
"a": "aaa",
"b": "bbb",
"d": "ddd"
}
]
```
### Attribute
You can specify attributes in patterns.
Attribute patterns match when pattern's attributes are subset of document's attributes.
This pattern
```html
{{foo}}
```
matches to this document.
```html
Hello
```
You can also write placeholders in attributes.
```html
{{title}}
```
Match result for
```html
Google
Yahoo
```
this document is:
```json
[
{ "url": "https://www.google.com", "title": "Google" },
{ "url": "https://www.yahoo.com", "title": "Yahoo" },
]
```
### Partial text-node pattern
You can write placeholders arbitrary positions in text-node.
```html
```
Match result for
```html
- A: 1, B: 2
- A: 3, B: 4
- A: 5, B: 6
```
this document is:
```json
[
{ "a": "1", "b": "2" },
{ "a": "3", "b": "4" },
{ "a": "5", "b": "6" },
]
```
You can also write placeholders in atteibutes position.
```html
```
Match result for
```html
```
this document is:
```json
[
{ "userid": "foo", "username": "Foo" },
{ "userid": "bar", "username": "Bar" },
{ "userid": "baz", "username": "Baz" },
]
```
### Whole subtree pattern
The pattern `{{var:*}}` matches to whole sub-tree as string.
```html
{{body:*}}
```
Match result for
```html
Hello
hoge
World
```
this document is:
```json
[
{ "body": "HellohogeWorld" }
]
```
### White-space
White-space are ignored almost all positions.
## Restrictions
* Whole sub-tree patterns must be the only one element of the parent node.
This is valid:
```html
{{foo:*}}
```
There are invalid:
```html
hoge {{foo:*}}
```
```html