Crates.io | scrapyard |
lib.rs | scrapyard |
version | 0.3.1 |
source | src |
created_at | 2023-10-30 13:00:08.80948 |
updated_at | 2023-11-03 14:35:30.55415 |
description | Automatic web scraper and RSS generator library |
homepage | |
repository | https://github.com/siriusmart/scrapyard |
max_upload_size | |
id | 1018456 |
size | 83,600 |
Automatic web scraper and RSS generator library
Get started by creating an event loop.
#[tokio::main]
async fn main() {
// initialise values
scrapyard::init(None).await;
// load feeds from a config file
// or create a default config file
let feeds_path = PathBuf::from("feeds.json");
let feeds = Feeds::load_json(&feeds_path).await
.unwrap_or_else(|| {
let default = Feeds::default();
default.save_json();
default
});
// start the event loop, this will not block
feeds.start_loop().await;
// as long as the program is running
// the feeds will be updated regularly
HttpServer::new(|| {})
.bind(("0.0.0.0", 8080)).unwrap()
.run().await.unwrap();
}
By default, config files can be found in ~/.config/scrapyard
(Linux),
/Users/[Username]/Library/Application/Support/scrapyard
(Mac) or
C:\Users\[Username]\AppData\Roaming\scrapyard
(Windows).
To change the config directory location, specify the path:
let config_path = PathBuf::from("/my/special/path");
scrapyard::init(Some(config_path)).await;
Here are all the options in the main configuration file scrapyard.json
.
{
"store": String, // i.e. /home/user/.local/share/scrapyard/
"max-retries": Number, // number of retries before giving up
"request-timeout": Number, // number of seconds before giving up request
"script-timeout": Number, // number of seconds before giving up on the extractor script
}
To add feeds, edit feeds.json
.
{
"origin": String, // origin of the feed
"label": String, // text id of the feed
"max-length": Number, // maximum number of items allowed in the feed
"fetch-length": Number, // maximum number of items allowed to be fetched each interval
"interval": Number, // number of seconds between fetching,
"idle-limit": Number, // number of seconds without requests to that feed before fetching stops
"sort": Boolean, // to sort by publish date or not
"extractor": [String], // all command line args to run the extractor, i.e. ["node", "extractor.js"]
"title": String, // displayed feed title
"link": String, // displayed feed source url
"description": String, // displayed feed description
"fetch": Boolean // should the crate fetch the content, or let the script do it
}
You can also include additional fields in PseudoChannel to overwrite default empty values.
Referencing functions under FeedOption, there are 2 types of fetch functions.
Force fetching always request for a new copy of the feed, ignoring the fetch interval. Lazy fetching only fetched a new copy when the existing copy is out of date. This is particularly relevant when used without the auto-fetch loop.
The extractor scripts must accept 1 command line argument and prints out 1 JSON
response to stdout, normal console.log()
in JS will do. You get the idea.
The first argument would specify a file path, within that file contains the arguments for the scraper.
Command line input:
{
"url": String, // origin of the info fetched
"webstr": String?, // response from the url, only if feed.fetch = true
"preexists": [ PseudoItem ], // don't output these again to avoid duplication
"lengthLeft": Number // maximum length before the fetch-length quota is met
// plus everything from feed.json
}
Expected output:
{
"items": [PseudoItem], // list of items extracted
"continuation": String? // optionally continue fetching in the next url
}
License: AGPL-3.0