Crates.io | dom-content-extraction |
lib.rs | dom-content-extraction |
version | 0.3.2 |
source | src |
created_at | 2023-03-25 06:56:05.164269 |
updated_at | 2024-08-14 10:22:38.041631 |
description | Rust implementation of Content extraction via text density paper |
homepage | https://github.com/oiwn/dom-content-extraction |
repository | https://github.com/oiwn/dom-content-extraction |
max_upload_size | |
id | 819964 |
size | 118,898 |
Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:
Content Extraction via Text Density (CETD)
use dom_content_extraction::{DensityTree, get_node_text};
let dtree = DensityTree::from_document(&document); // &scraper::Html
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;
println!("{}", get_node_text(node_id, &document));
dtree.calculate_density_sum();
let extracted_content = dtree.extract_content(&document);
println!("{}", extracted_content;
Check examples.
This one will extract content from generated "lorem ipsum" page
cargo run --example check -- lorem-ipsum
There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:
https://sigwac.org.uk/cleaneval/
and unpack archives into data/
directory.
cargo run --example ce_score
As far as i see there is problem opening some files:
Error processing file 730: Failed to read file: "data/finalrun-input/730.html"
Caused by:
stream did not contain valid UTF-8
But overall extraction works pretty well:
Overall Performance:
Files processed: 370
Average Precision: 0.87
Average Recall: 0.82
Average F1 Score: 0.75