waper

Crates.io	waper
lib.rs	waper
version	0.1.4
source	src
created_at	2023-05-07 07:18:49.066687+00
updated_at	2023-08-12 13:39:19.848046+00
description	A CLI tool to scrape HTML websites
homepage
repository	https://github.com/nkitsaini/waper
max_upload_size
id	858994
size	144,093

(nkitsaini)

documentation

README

Waper

Waper is a CLI tool to scrape html websites. Here is a simple usage

waper --seed-links "https://example.com/" --whitelist "https://example.com/.*" --whitelist "https://www.iana.org/domains/example"

This will scrape "https://example.com/" and save the html for each link found in a sqlite db with name waper_out.sqlite.

Installation

cargo install waper

CLI Usage

A CLI tool to scrape HTML websites

Usage: waper [OPTIONS]
       waper <COMMAND>

Commands:
  scrape      This is also default command, so it's optional to include in args
  completion  Print shell completion script
  help        Print this message or the help of the given subcommand(s)

Options:
  -w, --whitelist <WHITELIST>
          whitelist regexes: only these urls will be scanned other then seeds
  -b, --blacklist <BLACKLIST>
          blacklist regexes: these urls will never be scanned By default nothing will be blacklisted [default: a^]
  -s, --seed-links <SEED_LINKS>
          Links to start with
  -o, --output-file <OUTPUT_FILE>
          Sqlite output file [default: waper_out.sqlite]
  -m, --max-parallel-requests <MAX_PARALLEL_REQUESTS>
          Sqlite output file [default: 5]
  -i, --include-db-links
          Will also include unprocessed links from `links` table in db if present. Helpful when you want to continue the scraping from a previously unfinished session
  -v, --verbose
          Should verbose (debug) output
  -h, --help
          Print help
  -V, --version
          Print version

Querying data

Data is stored in sqlite db with schema defined in ./sqls/INIT.sql. There are three tables

results: Stores the content of all the request for which a response was recieved
errors: Stores the error message of all the cases where the request could not be completed
links: Stores the urls of both visited or unvisited links

Result can be queried using any sqlite client. Example using sqlite cli:

$ sqlite3 waper_out.sqlite 'select url, time, length(html) from results'
https://example.com/|2023-05-07 06:47:33|1256
https://www.iana.org/domains/example|2023-05-07 06:47:39|80

For beautiful output you can modify sqlite3 settings:

$ sqlite3 waper_out.sqlite '.headers on' '.mode column' 'select url, time, length(html) from results'
url                                   time                 length(html)
------------------------------------  -------------------  ------------
https://example.com/                  2023-05-07 06:47:33  1256
https://www.iana.org/domains/example  2023-05-07 06:47:39  80

To quickly search through all the urls you can use fzf:

sqlite3 waper_out.sqlite 'select url from links' | fzf

Planned improvements

Allow users to specify priority for urls, so some urls can be scraped before others
Support complex rate-limits
Allow continuation of previously stopped scraping
- Should continue working on IP roaming (auto-detect and continue)
Explicitly handling redirect
Allow users to modify part of request (like user-agent)
Improve storage efficiency by compressing/de-duping the html
Provide more visibility into how many urls are queued, at which rate are they getting processed etc
Support JS execution using ... (v8 or webkit, not many options)

Feedback

If you find any bugs or have any feature suggestions please file an issue on github.

Commit count: 25