Crates.io | dom-content-extraction |
lib.rs | dom-content-extraction |
version | 0.3.15 |
created_at | 2023-03-25 06:56:05.164269+00 |
updated_at | 2025-09-23 04:58:13.882493+00 |
description | Rust implementation of Content extraction via text density paper |
homepage | https://github.com/oiwn/dom-content-extraction |
repository | https://github.com/oiwn/dom-content-extraction |
max_upload_size | |
id | 819964 |
size | 216,335 |
A Rust library for extracting main content from web pages using text density analysis. This is an implementation of the Content Extraction via Text Density (CETD) algorithm described in the paper by Fei Sun, Dandan Song and Lejian Liao: Content Extraction via Text Density.
Web pages often contain a lot of peripheral content like navigation menus, advertisements, footers, and sidebars. This makes it challenging to extract just the main content programmatically. This library helps solve this problem by:
DOM Content Extraction includes Unicode support for handling multilingual content:
This ensures accurate content extraction from web pages in any language, with proper handling of:
MSRV is 1.85 due to 2024 edition. Living on the edge!
Basic usage example:
use scraper::Html;
use dom_content_extraction::get_content;
fn main() {
let html = r#"<!DOCTYPE html><html><body>
<nav>Home | About</nav>
<main>
<article>
<h1>Main Article</h1>
<p>This is the primary content that contains enough text to maintain proper density metrics. The paragraph needs sufficient length to establish text-to-link ratio.</p>
<p>Second paragraph adds more textual density to ensure the content extraction algorithm works correctly.</p>
<a href="\#">Related link</a>
</article>
</main>
<footer>Copyright 2024</footer>
</body></html>"#;
let document = Html::parse_document(html);
let content = get_content(&document).unwrap();
println!("{}", content);
}
Add it it with:
cargo add dom-content-extraction
or add to you Cargo.toml
dom-content-extraction = "0.3"
To enable markdown output support:
dom-content-extraction = { version = "0.3", features = ["markdown"] }
Read the docs!
dom-content-extraction documentation
use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html};
let html = "<html><body><article><h1>Title</h1><p>Content</p></article></body></html>";
let document = Html::parse_document(html);
let mut dtree = DensityTree::from_document(&document)?;
dtree.calculate_density_sum()?;
// Extract as markdown
let markdown = extract_content_as_markdown(&dtree, &document)?;
println!("{}", markdown);
# Ok::<(), dom_content_extraction::DomExtractionError>(())
Check examples.
This one will extract content from generated "lorem ipsum" page
cargo run --example check -- lorem-ipsum
This one prints node with highest density:
cargo run --example check -- test4
Extract content as markdown from lorem ipsum (requires markdown feature):
cargo run --example check -- lorem-ipsum-markdown
There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:
https://sigwac.org.uk/cleaneval/
and unpack archives into data/
directory.
cargo run --example ce_score
As far as i see there is problem opening some files:
Error processing file 730: Failed to read file: "data/finalrun-input/730.html"
Caused by:
stream did not contain valid UTF-8
But overall extraction works pretty well:
Overall Performance:
Files processed: 370
Average Precision: 0.87
Average Recall: 0.82
Average F1 Score: 0.75
The crate includes a command-line binary tool dce
(DOM Content Extraction) for
extracting main content from HTML documents. It supports both local files and
remote URLs as input sources.
The binary is included by default. You can install it using cargo:
cargo install dom-content-extraction
dce [OPTIONS]
Options:
-u, --url <URL> URL to fetch HTML content from
-f, --file <FILE> Local HTML file to process
-o, --output <FILE> Output file (stdout if not specified)
--format <FORMAT> Output format [default: text] [possible values: text, markdown]
-h, --help Print help
-V, --version Print version
Note: Either --url
or --file
must be specified, but not both.
To extract content as markdown format, use the --format markdown
option:
# Extract as markdown from URL
cargo run --bin dce -- --url "https://example.com" --format markdown
# Extract as markdown from file and save to output
cargo run --bin dce -- --file input.html --format markdown --output content.md
Note: Markdown output requires the markdown
feature to be enabled.
markdown
feature)Extract content from a URL and print to stdout:
dce --url "https://example.com/article"
Process a local HTML file and save to output file:
dce --file input.html --output extracted.txt
Extract from URL and save directly to file:
dce --url "https://example.com/page" --output content.txt
The binary functionality requires the following additional dependencies:
clap
: Command-line argument parsingreqwest
: HTTP client for URL fetchingtempfile
: Temporary file managementurl
: URL parsing and validationanyhow
: Error handlinghtmd
: HTML to markdown conversion (for markdown feature)These dependencies are only included when building with the default cli
feature. The markdown
feature requires the htmd
dependency.