Documentation
How it Works
rsspls
fetches each page specified by the configuration and extracts elements
from the page using CSS selectors. For example elements are matched
to determine the title and content of the feed entry. The generated feeds are
written to an output directory. HTTP caching is used to only update the feed
when the source page changes.
Supported Platforms
rsspls
should work on all platforms supported by the Rust compiler
including Linux, macOS, Windows, and BSD. Pre-compiled binaries are available
for common platforms. See the install page for details.
Usage
rsspls [OPTIONS] -o OUTPUT_DIR
OPTIONS:
-h, --help
Prints this help information
-c, --config
Specify the path to the configuration file.
$XDG_CONFIG_HOME/rsspls/feeds.toml is used if not supplied.
-o, --output
Directory to write generated feeds to.
-V, --version
Prints version information
FILES:
~/$XDG_CONFIG_HOME/rsspls/feeds.toml rsspls configuration file.
~/$XDG_CONFIG_HOME/rsspls Configuration directory.
~/XDG_CACHE_HOME/rsspls Cache directory.
Note: XDG_CONFIG_HOME defaults to ~/.config, XDG_CACHE_HOME
defaults to ~/.cache.
Configuration
Unless specified via the --config
command line option rsspls
reads its
configuration from one of the following paths:
- UNIX-like systems:
$XDG_CONFIG_HOME/rsspls/feeds.toml
~/.config/rsspls/feeds.toml
ifXDG_CONFIG_HOME
is unset.
- Windows:
C:\Users\You\AppData\Roaming\rsspls\feeds.toml
The configuration file is in TOML format.
The parts of the page to extract for the feed are specified using CSS selectors.
Annotated Sample Configuration
The sample file below demonstrates all the parts of the configuration.
# The configuration must start with the [rsspls] section
[rsspls]
# Optional output directory to write the feeds to. If not specified it must be supplied via
# the --output command line option.
output = "/tmp"
# Optional proxy address. If specified, all requests will be routed through it.
# The address needs to be in the format: protocol://ip_address:port
# The supported protocols are: http, https, socks and socks5h.
# It can also be specified as environment variable `http_proxy` or `HTTPS_PROXY`.
# The config file takes precedence, then the env vars in the above order.
# proxy = socks5://10.64.0.1:1080
# Next is the array of feeds, each one starts with [[feed]]
[[feed]]
# The title of the channel in the feed
title = "My Great RSS Feed"
# The output filename without the output directory to write this feed to.
# Note: this is a filename only, not a path. It should not contain slashes.
filename = "wezm.rss"
# Optional User-Agent header to be set for the HTTP request.
# user_agent = "Mozilla/5.0"
# The configuration for the feed
[feed.config]
# The URL of the web page to generate the feed from.
url = "https://www.wezm.net/"
# A CSS selector to select elements on the page that represent items in the feed.
item = "article"
# A CSS selector relative to `item` to an element that will supply the title for the item.
heading = "h3"
# A CSS selector relative to `item` to an element that will supply the link for the item.
# Note: This element must have a `href` attribute.
# Note: If not supplied rsspls will attempt to use the heading selector for link for backwards
# compatibility with earlier versions. A message will be emitted in this case.
link = "h3 a"
# Optional CSS selector relative to `item` that will supply the content of the RSS item.
summary = ".post-body"
# Optional CSS selector relative to `item` that supplies media content (audio, video, image)
# to be added as an RSS enclosure.
# Note: The media URL must be given by the `src` or `href` attribute of the selected element.
# Note: Currently if the item does not match the media selector then it will be skipped.
# media = "figure img"
# Optional CSS selector relative to `item` that supples the publication date of the RSS item.
date = "time"
# Alternatively for more control `date` can be specified as a table:
# [feed.config.date]
# selector = "time"
# # Optional type of value being parsed.
# # Defaults to DateTime, can also be Date if you're parsing a value without a time.
# type = "Date"
# # format of the date to parse. See the following for the syntax
# # https://time-rs.github.io/book/api/format-description.html
# format = "[day padding:none]/[month padding:none]/[year]" # will parse 1/2/1934 style dates
# A second example feed
[[feed]]
title = "Example Site"
filename = "example.rss"
[feed.config]
url = "https://example.com/"
item = "div"
heading = "a"
The first example above (for my blog WezM.net) matches HTML that looks like this:
<section class="posts-section">
<h2>Recent Posts</h2>
<article id="garage-door-monitor">
<h3><a href="https://www.wezm.net/v2/posts/2022/garage-door-monitor/">Monitoring My Garage Door With a Raspberry Pi, Rust, and a 13Mb Linux System</a></h3>
<div class="post-metadata">
<div class="date-published">
<time datetime="2022-04-20T06:38:27+10:00">20 April 2022</time>
</div>
</div>
<div class="post-body">
<p>I’ve accidentally left our garage door open a few times. To combat this I built
a monitor that sends an alert via Mattermost when the door has been left open
for more than 5 minutes. This turned out to be a super fun project. I used
parts on hand as much as possible, implemented the monitoring application in
Rust, and then built a stripped down Linux image to run it.
</p>
</div>
<a href="https://www.wezm.net/v2/posts/2022/garage-door-monitor/">Continue Reading →</a>
</article>
<article id="monospace-kobo-ereader">
<!-- another article -->
</article>
<!-- more articles -->
<a href="https://www.wezm.net/v2/posts/">View more posts →</a>
</section>
output
Optional output directory to write the feeds to. If not specified it must be
supplied via the --output
command line option. Directory will be created if
it does not exist.
Tilde expansion is performed on the path in the config file. This allows you to
refer to the home directory of the user running rsspls
. For example,
~/Documents/rsspls
could be used to place the output in your Documents
folder.
proxy
Optional proxy address. If specified, all requests will be routed through it.
The address needs to be in the format: protocol://ip_address:port
The supported protocols are: http, https, socks and socks5h.
The proxy for http and https requests can also be specified with the
environment variables http_proxy
and HTTPS_PROXY
respectively.
The config file takes precedence over environment variables.
feed.title
The title of the channel in the generated feed.
feed.filename
The output filename to write this feed to. Note: this is a filename only, not a path. It should not contain slashes. It will be written to the output directory.
feed.config.url
The URL of the web page to generate the feed from. The page at this address will be fetched processed to turn it into a feed.
feed.config.item
A CSS selector to select elements on the page that represent items in the feed. The other CSS selectors match elements inside the elements that this selector matches.
feed.config.heading
A CSS selector relative to item
to an element that will supply the title for
the item in the feed.
feed.config.link
CSS selector relative to item
to an element that will supply the
link for the item in the feed.
Note: This element must have a href
attribute.
Note: If not supplied rsspls
will attempt to use the
feed.config.heading
selector as the link
element for backwards compatibility
with earlier versions. A warning message will be emitted in this case. It is
recommended to specify the link
selector explicitly.
feed.config.summary
Optional CSS selector relative to item
that will supply the content of the
RSS item. This value may be a single CSS selector, or an array of CSS
selectors.
The CSS selectors may also include a comma separated list of elements to match.
For example: summary = "p, blockquote"
will match p
or blockquote
elements, adding them to the RSS feed in the order then are encountered in the
HTML document.
The array form of summary
allows the order of the matched elements to be
controlled, enabling elements to be added to the feed in a different order to
the source HTML document. For example, summary = ["p", "blockquote"]
causes
rsspls
to make a pass over the source HTML document, adding p
elements to
the feed, followed by a pass adding blockquote
elements to the feed.
feed.config.date
The optional date
key in the configuration can be a string or a table. If it’s a
string then it’s used as CSS selector relative to item
to find the element
containing the date and rsspls
will attempt to automatically parse the value.
If automatic parsing fails you can manually specify the format using the table
form of date
, which looks like this:
[feed.config.date]
selector = "time" # required
type = "Date"
format = "[day padding:none]/[month padding:none]/[year]" # will parse 1/2/1934 style dates
If the element matched by the date
selector is a <time>
element then
rsspls
will first try to parse the value in the datetime
attribute if
present. If the attribute is missing or the element is not a time
element
then rsspls
will use the supplied format or attempt automatic parsing of the
text content of the element.
feed.config.date.selector
CSS selector relative to item
that supples the publication date of
the RSS item.
feed.config.date.type
Optional type of value being parsed. Either Date
or DateTime
.
type
is Date
when you want to parse just a date. Use DateTime
if you’re
parsing a date and time with the format. Defaults to DateTime
.
feed.config.date.format
Format description using the syntax described on this page: https://time-rs.github.io/book/api/format-description.html of how to parse the date.
feed.config.media
Optional CSS selector relative to item
that supplies media content (audio,
video, image) to be added as an RSS enclosure.
Note: The media URL must be given by the src
or href
attribute of the
selected element.
Note: Currently if the item does not match the media selector then it will be skipped.
Hosting, Updating, and Subscribing
In order to have the feeds update you will need to arrange for
rsspls
to be run periodically. You might do this with cron, systemd
timers, or the Windows equivalent.
To subscribe to feeds you can run rsspls
locally and use a feed reader that
supports local file feeds. Or, more likely it is expected that rsspls
will be
run on a web server that is serving the directory the feeds are written to.
Logging
rsspls
logs messages to stderr
. Logging can be controlled by the
RSSPLS_LOG
environment variable. Log level and target module can controlled
according to the env_logger documentation. For example, to enable
debug logging for rsspls
you would use:
RSSPLS_LOG=rsspls=debug
The supported log levels are:
error
warn
info
debug
trace
off
(disable logging)
The default log level is info
.
Caveats & Error Handling
rsspls
just fetches and parses the HTML of the web page you specify. It does
not run JavaScript. If the website is entirely generated by JavaScript (such as
Twitter) then rsspls
will not work.
If errors are encountered processing the page due to invalid selectors, or
missing elements an error message will be logged. If the error is non-recoverable
rsspls
will exit with a non-zero exit status.
If an error is encountered processing an item for the feed a warning will by
logged and processing will continue with the next item. rsspls
will still
exit with success (0) in this case.
Caching
When websites respond with cache headers rsspls
will make a conditional
request on subsequent runs and will not regenerate the feed if the server
responds with 304 Not Modified. Cache data is stored in
$XDG_CACHE_HOME/rsspls
, which defaults to ~/.cache/rsspls
on UNIX-like
systems or C:\Users\You\AppData\Local\rsspls
on Windows.