# scrapyard Automatic web scraper and RSS generator library ### Quickstart Get started by creating an event loop. ```rust #[tokio::main] async fn main() { // initialise values scrapyard::init(None).await; // load feeds from a config file // or create a default config file let feeds_path = PathBuf::from("feeds.json"); let feeds = Feeds::load_json(&feeds_path).await .unwrap_or_else(|| { let default = Feeds::default(); default.save_json(); default }); // start the event loop, this will not block feeds.start_loop().await; // as long as the program is running // the feeds will be updated regularly HttpServer::new(|| {}) .bind(("0.0.0.0", 8080)).unwrap() .run().await.unwrap(); } ``` ### Configuration By default, config files can be found in `~/.config/scrapyard` (Linux), `/Users/[Username]/Library/Application/Support/scrapyard` (Mac) or `C:\Users\[Username]\AppData\Roaming\scrapyard` (Windows). To change the config directory location, specify the path: ```rust let config_path = PathBuf::from("/my/special/path"); scrapyard::init(Some(config_path)).await; ``` Here are all the options in the main configuration file `scrapyard.json`. ```json { "store": String, // i.e. /home/user/.local/share/scrapyard/ "max-retries": Number, // number of retries before giving up "request-timeout": Number, // number of seconds before giving up request "script-timeout": Number, // number of seconds before giving up on the extractor script } ``` #### Adding feeds To add feeds, edit `feeds.json`. ```json { "origin": String, // origin of the feed "label": String, // text id of the feed "max-length": Number, // maximum number of items allowed in the feed "fetch-length": Number, // maximum number of items allowed to be fetched each interval "interval": Number, // number of seconds between fetching, "idle-limit": Number, // number of seconds without requests to that feed before fetching stops "sort": Boolean, // to sort by publish date or not "extractor": [String], // all command line args to run the extractor, i.e. ["node", "extractor.js"] "title": String, // displayed feed title "link": String, // displayed feed source url "description": String, // displayed feed description "fetch": Boolean // should the crate fetch the content, or let the script do it } ``` You can also include additional fields in [PseudoChannel](https://docs.rs/scrapyard/latest/struct.PseudoChannel.html) to overwrite default empty values. #### Getting feeds Referencing functions under [FeedOption](https://docs.rs/scrapyard/latest/struct.FeedOption.html), there are 2 types of fetch functions. **Force fetching** always request for a new copy of the feed, ignoring the fetch interval. **Lazy fetching** only fetched a new copy when the existing copy is out of date. This is particularly relevant when used without the auto-fetch loop. #### Extractor scripts The extractor scripts must accept 1 command line argument and prints out 1 JSON response to stdout, normal `console.log()` in JS will do. You get the idea. The first argument would specify a file path, within that file contains the arguments for the scraper. Command line input: ```json { "url": String, // origin of the info fetched "webstr": String?, // response from the url, only if feed.fetch = true "preexists": [ PseudoItem ], // don't output these again to avoid duplication "lengthLeft": Number // maximum length before the fetch-length quota is met // plus everything from feed.json } ``` Expected output: ```json { "items": [PseudoItem], // list of items extracted "continuation": String? // optionally continue fetching in the next url } ``` License: AGPL-3.0