| Crates.io | article-extractor |
| lib.rs | article-extractor |
| version | 1.0.4 |
| created_at | 2024-10-20 17:56:31.963201+00 |
| updated_at | 2025-05-28 18:02:58.278584+00 |
| description | Extract articles from HTML. |
| homepage | |
| repository | https://github.com/rijkvp/article-extractor |
| max_upload_size | |
| id | 1416414 |
| size | 1,308,277 |
This is a non-aysnc fork of article_scraper containing only the article extraction functionallity (does not support web crawling).
It contains two ways of extracting articles from HTML:
This makes use of website specific extraction rules. Which has the advantage of fast & accurate results. The disadvantages however are: the config needs to be updated as the website changes and a new extraction rule is needed for every website.
A central repository of extraction rules and information about writing your own rules can be found here: ftr-site-config. Please consider contributing new rules or updates to it.
article_scraper embeds all the rules in the ftr-site-config repository for convenience. Custom and updated rules can be loaded from a user_configs path.
In case the ftr-config based extraction fails the mozilla Readability algorithm will be used as a fall-back. This re-implementation tries to mimic the original as closely as possible.