Crates.io | web-grep |
lib.rs | web-grep |
version | 0.1.4 |
source | src |
created_at | 2021-01-18 15:25:49.484793 |
updated_at | 2021-02-21 07:12:00.055998 |
description | A Grep Tool for HTML or XML |
homepage | https://github.com/cympfh/web-grep |
repository | |
max_upload_size | |
id | 343546 |
size | 25,753 |
Grep for HTML or XML.
$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello
$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}
# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
<p>hello</p>
<div>
<p>world</p>
</div>
</body>
EOM
hello
world
# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
world
# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
here
This is just a CLI for an awesome library, tanakh/easy-scraper.
cargo install web-grep
$ web-grep <QUERY> [INPUT]
The QUERY
is a HTML (XML) Pattern.
Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes.
web-grep
has various placeholders for cases.
{}
If you need exact one placeholer in the pattern, use {}
.
<p>{}</p>
<p class="here">
<q>{}</q>
</p>
web-grep
outputs all texts matching for {}
.
$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3
{n}
<a href="{1}">{2}</a>
web-grep
outputs matched texts for {1}
, {2}
... in order, separated by \t
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga hoge
The delimiter can be specified with -F
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge
{xxx}
<a href="{href}">{innerHTML}</a>
The output can be formatted as JSON with --json
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}