article-extractor

Crates.io	article-extractor
lib.rs	article-extractor
version	1.0.3
source	src
created_at	2024-10-20 17:56:31.963201
updated_at	2024-10-30 16:06:33.106164
description	Extract articles from HTML.
homepage
repository	https://github.com/rijkvp/article-extractor
max_upload_size
id	1416414
size	1,282,003

Rijk van Putten (rijkvp)

documentation

README

article-extractor

This is a non-aysnc fork of article_scraper containing only the article extraction functionallity (does not support web crawling).

It contains two ways of extracting articles from HTML:

1. Rust implementation of Full-Text RSS

This makes use of website specific extraction rules. Which has the advantage of fast & accurate results. The disadvantages however are: the config needs to be updated as the website changes and a new extraction rule is needed for every website.

A central repository of extraction rules and information about writing your own rules can be found here: ftr-site-config. Please consider contributing new rules or updates to it.

article_scraper embeds all the rules in the ftr-site-config repository for convenience. Custom and updated rules can be loaded from a user_configs path.

2. Mozilla Readability

In case the ftr-config based extraction fails the mozilla Readability algorithm will be used as a fall-back. This re-implementation tries to mimic the original as closely as possible.

Commit count: 314

article-extractor

documentation

README

article-extractor

1. Rust implementation of Full-Text RSS

2. Mozilla Readability

cargo fmt