scwape

Crates.ioscwape
lib.rsscwape
version0.1.5
sourcesrc
created_at2021-07-27 21:49:52.554134
updated_at2021-07-30 20:19:34.728365
descriptionA tool to scrape the web via CSS selectors
homepagehttps://github.com/tweoss/scraper
repositoryhttps://github.com/tweoss/scraper
max_upload_size
id428109
size58,475
(Tweoss)

documentation

README

Scwape

A command line tool in Rust to scrape data from websites via CSS selectors.

Install

cargo install scwape

Usage

# syntax: scwape <url> -s "#css-selector"
# get all elements with an href from wikipedia's home page
scwape "https://www.wikipedia.org/" -s '[href]'
# get mdn's list of css selectors
scwape "https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors" -s '[href*="/en-US/docs/Web/CSS/"]'
# syntax: scwape <file> -s "#css-selector" -f "format specifier\n"
# get the headers in mdn's list of css selectors. When specifying a format, appending a newline is typically desirable. 
curl "https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors" > selector.html
scwape selector.html -s "h2" -f "\id: \text\n"

The default format is \text\n, and extra format specifiers are ignored.

scwape <url_or_file> -s "#selector1" -s ".selector2" -f -f "format1\n" -f "format2\n" -f "format3\n"

is equivalent to

scwape <url_or_file> -s "#selector1" -s ".selector2" -f "format1\n" -f "format2\n"

The blank -f and the extra "format3\n are ignored.

The possible format specifiers are

  • Id (\id, the element id)
  • Name (\name, the element name)
  • Classes (\classes, the classes for the element)
  • Text (\text, the combined text of child nodes for the element)
  • Html (\html, the html of the element)
  • Attrs (\attrs, the attributes of the element)

\id will be replaced, but \\id, \\\id, and so on will not. Likewise for the other format specifiers.

The disparate -d option exists to allow for printing out each selector independently. The default behavior is to print the matching elements for all selectors in the order they appear. The disparate option would instead print the elements for the first selector, then the second, then the third and so on.

Completions

Fish and Bash shell completions are available on github and are generated upon cargo build. To generate your own, select the appropiate shell in build.rs, then run cargo build. The shell completion will be available in the completions directory. The list of available shells are in clap's documentation.

Commit count: 26

cargo fmt