dom-content-extraction

Crates.iodom-content-extraction
lib.rsdom-content-extraction
version0.3.2
sourcesrc
created_at2023-03-25 06:56:05.164269
updated_at2024-08-14 10:22:38.041631
descriptionRust implementation of Content extraction via text density paper
homepagehttps://github.com/oiwn/dom-content-extraction
repositoryhttps://github.com/oiwn/dom-content-extraction
max_upload_size
id819964
size118,898
oiwn (oiwn)

documentation

https://docs.rs/dom-content-extraction/latest/dom_content_extraction/

README

dom-content-extraction

GitHub branch checks state | Crates.io

Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:

Content Extraction via Text Density (CETD)

use dom_content_extraction::{DensityTree, get_node_text};

let dtree = DensityTree::from_document(&document); // &scraper::Html 
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;

println!("{}", get_node_text(node_id, &document));

dtree.calculate_density_sum();
let extracted_content = dtree.extract_content(&document);

println!("{}", extracted_content;

Run examples

Check examples.

This one will extract content from generated "lorem ipsum" page

cargo run --example check -- lorem-ipsum 

There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:

https://sigwac.org.uk/cleaneval/

and unpack archives into data/ directory.

cargo run --example ce_score

As far as i see there is problem opening some files:

Error processing file 730: Failed to read file: "data/finalrun-input/730.html"

Caused by:
    stream did not contain valid UTF-8

But overall extraction works pretty well:

Overall Performance:
  Files processed: 370
  Average Precision: 0.87
  Average Recall: 0.82
  Average F1 Score: 0.75  

Read documentation on docs.rs

Desired features

  • implement normal scoring
  • create real world dataset
  • improve algo
Commit count: 57

cargo fmt