| Crates.io | readability-rust |
| lib.rs | readability-rust |
| version | 0.1.0 |
| created_at | 2025-07-22 17:46:55.72464+00 |
| updated_at | 2025-07-22 17:46:55.72464+00 |
| description | A Rust port of Mozilla's Readability library for extracting article content from web pages |
| homepage | |
| repository | https://github.com/dreampuf/readability-rust |
| max_upload_size | |
| id | 1763813 |
| size | 29,181,173 |
A Rust port of Mozilla's Readability.js library for extracting readable content from web pages.
This library provides functionality to parse HTML documents and extract the main article content, removing navigation, ads, and other clutter to present clean, readable text.
Add this to your Cargo.toml:
[dependencies]
readability-rust = "0.1.0"
use readability_rust::{Readability, ReadabilityOptions};
let html = r#"
<!DOCTYPE html>
<html>
<head>
<title>Sample Article</title>
<meta name="author" content="John Doe">
</head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article...</p>
<p>More substantial content here...</p>
</article>
<aside>Sidebar content to be removed</aside>
</body>
</html>
"#;
let mut parser = Readability::new(html, None).unwrap();
if let Some(article) = parser.parse() {
println!("Title: {:?}", article.title);
println!("Author: {:?}", article.byline);
println!("Content: {:?}", article.content);
println!("Text Length: {:?}", article.length);
}
use readability_rust::{Readability, ReadabilityOptions};
let options = ReadabilityOptions {
debug: true,
char_threshold: 250,
keep_classes: true,
..Default::default()
};
let mut parser = Readability::new(html, Some(options)).unwrap();
let article = parser.parse();
use readability_rust::is_probably_readerable;
let html = "<html><body><p>Short content</p></body></html>";
if is_probably_readerable(html, None) {
println!("This page likely contains readable content");
} else {
println!("This page may not have substantial content");
}
The library includes a command-line tool for processing HTML files:
cargo install readability-rust
# Process a local HTML file
readability-rust -i article.html
# Process from stdin
cat article.html | readability-rust
# Output as JSON
readability-rust -i article.html -f json
# Output as plain text
readability-rust -i article.html -f text
# Check if content is readable
readability-rust -i article.html --check
# Debug mode with verbose output
readability-rust -i article.html --debug
Usage: readability-rust [OPTIONS]
Options:
-i, --input <FILE> Input HTML file (use '-' for stdin)
-o, --output <FILE> Output file (default: stdout)
-f, --format <FORMAT> Output format [default: json] [possible values: json, text, html]
--base-uri <URI> Base URI for resolving relative URLs
--debug Enable debug output
--check Only check if content is readable
--char-threshold <N> Minimum character threshold [default: 500]
--keep-classes Keep CSS classes in output
--disable-json-ld Disable JSON-LD parsing
-h, --help Print help
-V, --version Print version
ReadabilityThe main parser struct for extracting content from HTML documents.
ReadabilityOptionsConfiguration options for customizing parsing behavior:
debug: Enable debug loggingchar_threshold: Minimum character count for contentkeep_classes: Preserve CSS classes in outputdisable_json_ld: Skip JSON-LD metadata parsingArticleRepresents extracted article content:
title: Article titlecontent: Cleaned HTML contenttext_content: Plain text contentlength: Content length in charactersbyline: Author informationexcerpt: Article excerpt/descriptionsite_name: Site namelang: Content languagepublished_time: Publication dateis_probably_readerable(html: &str, options: Option<ReadabilityOptions>) -> boolDetermines if an HTML document likely contains readable content.
This implementation follows Mozilla's Readability.js algorithm:
The library includes comprehensive tests covering:
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test categories
cargo test test_article_parsing
cargo test test_metadata_extraction
cargo test test_readability_assessment
This project includes the original Mozilla Readability.js library as a submodule for reference:
# Initialize the submodule
git submodule update --init --recursive
# View the original JavaScript implementation
ls mozilla-readability/
The original implementation can be found at: https://github.com/mozilla/readability
The Rust implementation provides significant performance benefits:
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
git clone https://github.com/dreampuf/readability-rs.git
cd readability-rs
git submodule update --init --recursive
cargo build
cargo test
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
The original Mozilla Readability.js library is also licensed under Apache License 2.0.