web-grep

Crates.ioweb-grep
lib.rsweb-grep
version0.1.4
sourcesrc
created_at2021-01-18 15:25:49.484793
updated_at2021-02-21 07:12:00.055998
descriptionA Grep Tool for HTML or XML
homepagehttps://github.com/cympfh/web-grep
repository
max_upload_size
id343546
size25,753
/c/ympfh (cympfh)

documentation

README

web-grep

What this?

Grep for HTML or XML.

$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello
$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}
# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
  <p>hello</p>
  <div>
    <p>world</p>
  </div>
</body>
EOM
hello
world
# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
  <p class="not-here">hello</p>
  <div>
    <p class="here">world</p>
  </div>
</body>
EOM
world
# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
  <p class="not-here">hello</p>
  <div>
    <p class="here">world</p>
  </div>
</body>
EOM
here

How this?

This is just a CLI for an awesome library, tanakh/easy-scraper.

Installation

  1. Install cargo
    • Recommended Way: Install rustup
  2. Then,
    • cargo install web-grep

Usage

$ web-grep <QUERY> [INPUT]

The QUERY is a HTML (XML) Pattern.

Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes. web-grep has various placeholders for cases.

Placeholders

Anonymous Palceholder {}

If you need exact one placeholer in the pattern, use {}.

<p>{}</p>
<p class="here">
    <q>{}</q>
</p>

web-grep outputs all texts matching for {}.

$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3

Numbered Placeholders {n}

<a href="{1}">{2}</a>

web-grep outputs matched texts for {1}, {2}... in order, separated by \t.

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga	hoge

The delimiter can be specified with -F.

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge

Named Placeholders {xxx}

<a href="{href}">{innerHTML}</a>

The output can be formatted as JSON with --json.

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}
Commit count: 0

cargo fmt