| Crates.io | web-grep |
| lib.rs | web-grep |
| version | 0.1.4 |
| created_at | 2021-01-18 15:25:49.484793+00 |
| updated_at | 2021-02-21 07:12:00.055998+00 |
| description | A Grep Tool for HTML or XML |
| homepage | https://github.com/cympfh/web-grep |
| repository | |
| max_upload_size | |
| id | 343546 |
| size | 25,753 |
Grep for HTML or XML.
$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello
$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}
# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
<p>hello</p>
<div>
<p>world</p>
</div>
</body>
EOM
hello
world
# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
world
# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
here
This is just a CLI for an awesome library, tanakh/easy-scraper.
cargo install web-grep$ web-grep <QUERY> [INPUT]
The QUERY is a HTML (XML) Pattern.
Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes.
web-grep has various placeholders for cases.
{}If you need exact one placeholer in the pattern, use {}.
<p>{}</p>
<p class="here">
<q>{}</q>
</p>
web-grep outputs all texts matching for {}.
$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3
{n}<a href="{1}">{2}</a>
web-grep outputs matched texts for {1}, {2}... in order, separated by \t.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga hoge
The delimiter can be specified with -F.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge
{xxx}<a href="{href}">{innerHTML}</a>
The output can be formatted as JSON with --json.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}