dom-content-extraction

Crates.io	dom-content-extraction
lib.rs	dom-content-extraction
version	0.3.4
source	src
created_at	2023-03-25 06:56:05.164269
updated_at	2024-11-14 07:39:48.276252
description	Rust implementation of Content extraction via text density paper
homepage	https://github.com/oiwn/dom-content-extraction
repository	https://github.com/oiwn/dom-content-extraction
max_upload_size
id	819964
size	120,242

oiwn (oiwn)

documentation

https://docs.rs/dom-content-extraction/latest/dom_content_extraction/

README

dom-content-extraction

A Rust library for extracting main content from web pages using text density analysis. This is an implementation of the Content Extraction via Text Density (CETD) algorithm described in the paper by Fei Sun, Dandan Song and Lejian Liao:

Content Extraction via Text Density.

What Problem Does This Solve?

Web pages often contain a lot of peripheral content like navigation menus, advertisements, footers, and sidebars. This makes it challenging to extract just the main content programmatically. This library helps solve this problem by:

Analyzing the text density patterns in HTML documents
Identifying content-rich sections versus navigational/peripheral elements
Extracting the main content while filtering out noise
Handling various HTML layouts and structures

Key Features

Build a density tree representing text distribution in the HTML document
Calculate composite text density using multiple metrics
Extract main content blocks based on density patterns
Support for nested HTML structures
Efficient processing of large documents
Error handling for malformed HTML

Usage

Basic usage example:

use dom_content_extraction::{DensityTree, get_node_text};

let dtree = DensityTree::from_document(&document)?; // Takes a scraper::Html document

// Get nodes sorted by text density
let sorted_nodes = dtree.sorted_nodes();
let densest_node = sorted_nodes.last().unwrap();

// Extract text from the node with highest density
println!("{}", get_node_text(densest_node.node_id, &document)?);

// For more accurate content extraction:
dtree.calculate_density_sum()?;
let main_content = dtree.extract_content(&document)?;
println!("{}", main_content);

Installation

Add it it with:

cargo add dom-content-extraction

or add to you Cargo.toml

dom-content-extraction = "0.3"

Documentation

Read the docs!

dom-content-extraction documentation

Run examples

Check examples.

This one will extract content from generated "lorem ipsum" page

cargo run --example check -- lorem-ipsum

There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:

https://sigwac.org.uk/cleaneval/

and unpack archives into data/ directory.

cargo run --example ce_score

As far as i see there is problem opening some files:

Error processing file 730: Failed to read file: "data/finalrun-input/730.html"

Caused by:
    stream did not contain valid UTF-8

But overall extraction works pretty well:

Overall Performance:
  Files processed: 370
  Average Precision: 0.87
  Average Recall: 0.82
  Average F1 Score: 0.75

Read documentation on docs.rs

Desired features

implement normal scoring
create real world dataset

Commit count: 93

dom-content-extraction

documentation

README

dom-content-extraction

What Problem Does This Solve?

Key Features

Usage

Installation

Documentation

Run examples

Desired features

cargo fmt