Crates.io | dom-content-extraction |
lib.rs | dom-content-extraction |
version | 0.3.4 |
source | src |
created_at | 2023-03-25 06:56:05.164269 |
updated_at | 2024-11-14 07:39:48.276252 |
description | Rust implementation of Content extraction via text density paper |
homepage | https://github.com/oiwn/dom-content-extraction |
repository | https://github.com/oiwn/dom-content-extraction |
max_upload_size | |
id | 819964 |
size | 120,242 |
A Rust library for extracting main content from web pages using text density analysis. This is an implementation of the Content Extraction via Text Density (CETD) algorithm described in the paper by Fei Sun, Dandan Song and Lejian Liao:
Content Extraction via Text Density.
Web pages often contain a lot of peripheral content like navigation menus, advertisements, footers, and sidebars. This makes it challenging to extract just the main content programmatically. This library helps solve this problem by:
Basic usage example:
use dom_content_extraction::{DensityTree, get_node_text};
let dtree = DensityTree::from_document(&document)?; // Takes a scraper::Html document
// Get nodes sorted by text density
let sorted_nodes = dtree.sorted_nodes();
let densest_node = sorted_nodes.last().unwrap();
// Extract text from the node with highest density
println!("{}", get_node_text(densest_node.node_id, &document)?);
// For more accurate content extraction:
dtree.calculate_density_sum()?;
let main_content = dtree.extract_content(&document)?;
println!("{}", main_content);
Add it it with:
cargo add dom-content-extraction
or add to you Cargo.toml
dom-content-extraction = "0.3"
Read the docs!
dom-content-extraction documentation
Check examples.
This one will extract content from generated "lorem ipsum" page
cargo run --example check -- lorem-ipsum
There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:
https://sigwac.org.uk/cleaneval/
and unpack archives into data/
directory.
cargo run --example ce_score
As far as i see there is problem opening some files:
Error processing file 730: Failed to read file: "data/finalrun-input/730.html"
Caused by:
stream did not contain valid UTF-8
But overall extraction works pretty well:
Overall Performance:
Files processed: 370
Average Precision: 0.87
Average Recall: 0.82
Average F1 Score: 0.75