mlscraper-rust

Crates.io	mlscraper-rust
lib.rs	mlscraper-rust
version	0.1.2
created_at	2023-05-22 13:28:10.524046+00
updated_at	2023-05-23 20:51:50.73527+00
description	Scrape structured data from HTML documents automatically
homepage
repository	https://github.com/hilbigan/mlscraper-rust
max_upload_size
id	870678
size	408,171

Aaron Hilbig (hilbigan)

documentation

README

mlscraper-rust

Generate CSS selectors for web scraping automagically

This project is inspired by the python package mlscraper, but uses a different, more scalable and configurable approach to achieve equally good results.

Example

This is a small example (the same as given by mlscraper) to demonstrate how mlscraper-rust generates short CSS selectors automatically.

You can run this example by running cargo run --release --example small in this directory.

All we have to do is to tell mlscraper-rust what values we expect to extract from the web page...

    let html = reqwest::blocking::get("http://quotes.toscrape.com/author/Albert-Einstein/")
        .expect("request") // Scrappy error handling for demonstration purposes
        .text()
        .expect("text");

    let result = mlscraper_rust::train(
        vec![html.as_str()],
        vec![
            AttributeBuilder::new("name")
                .values(&[Some("Albert Einstein")])
                .build(),

            AttributeBuilder::new("born")
                .values(&[Some("March 14, 1879")])
                .build(),
        ],
        Default::default(),
        1
    ).expect("training");

    println!("{:?}", result.selectors());

... and it outputs the best (i.e. most concise) selectors it was able to find:

{"born": .author-born-date, "name": h3}

We can now use the trained result object to scrape similar pages:

    let html = reqwest::blocking::get("http://quotes.toscrape.com/author/J-K-Rowling")
        .expect("request")
        .text()
        .expect("text");

    let dom = result.parse(&html)
        .expect("parse");

    result.attributes()
        .for_each(|attr| {
            println!("{attr}: {:?}", result.get_value(&dom, attr).ok().flatten())
        })

This prints:

born: Some("July 31, 1965")
name: Some("J.K. Rowling")

As with the original mlscraper, mlscraper-rust unleashes its full potential when providing multiple input files and multiple attribute values, for example:

    // ------- 8< ---------------------
    // ... excerpt from examples/big.rs
    let result = train(
        // Multiple input documents
        htmls.iter().map(|s| s.as_ref()).collect(),
        vec![
            // We expect this value to be "Defeat" on the first page, "Victory" 
            // on the second, etc.
            AttributeBuilder::new("team0result")
                .values(&[Some("Defeat"), Some("Victory"), Some("Victory")])
                .build(),
    // ------------------- >8 ---------

mlscraper-rust will automatically generate CSS selectors that work on all the input documents for all the provided values.

Advantages over mlscraper (Python)

Better performance: Instead of testing $O(2^n)$ possible selectors, we generate CSS selectors randomly and improve them iteratively using a basic fuzzing algorithm. See performance comparison below.
Smaller footprint: mlscraper (Python) was on occasion killed by oomkiller on my machine while processing a 30kb HTML file. Our implementation has no problems with many documents and attributes (although we could use some Multithreading) -- see examples/big.rs.
Proper handling of missing data: We allow values to be missing from some training examples and provide different strategies of handling these cases (see MissingDataStrategy).
Proper handling of duplicate data: If a value is present multiple time, you can control which elements should be preferrably selected (see MultipleMatchesStrategy).
Configurable data sources: You can define what should count as the "text" of an HTML tag.
Filtering: You can add custom filters to control what kind of CSS selectors are generated!

Performance Comparison

We compare mlscraper and mlscraper_rust's performance on two Amazon product pages (Apple iPhone, Samsung Galaxy) which have been downloaded to python_comparison/{amazon_iphone, amazon_galaxy}.html.

You can read the used benchmarking code in python_comparison/amazon.py (original mlscraper python library) and examples/amazon.rs (ours).

We compare the time each method takes for "training", i.e., generating suitable selectors. We use the average time of five runs.

Scraping Task	Time Original mlscraper	Time Ours	Speed-Up	Selector Original mlscraper	Selector Ours
Extract product name	1771 ms	25 ms	71x	`#landingImage`	`#landingImage` or `#comparison_image`
Extract product price	1122 ms	21 ms	53x	`#base-product-price`	`#base-product-price`
Name + price at once	6193 ms	34 ms	182x	as above	as above
Find "Add to Cart" button	? (> 5 min)	16 ms	-	-	`#comparison_add_to_cart_button3-announce`

Large-scale Example

All of these advantages are demonstrated in the large-scale example big.rs that you can run using cargo run --release --example big.

It scrapes various match data from leagueofgraphs.com.

mlscraper-rust offers a function to highlight what elements have been selected in the DOM with a red border. After letting the program run for a bit, this is the output for the "big" example:

Highlighted elements

Usage

In your project's Cargo.toml:

[dependencies]
mlscraper-rust = "0.1.2"

Optionally, add features = ["serde"] to enable (de)serialization of the TrainingResults using serde.

Commit count: 43

mlscraper-rust

documentation

README

mlscraper-rust

Generate CSS selectors for web scraping automagically

Example

Advantages over mlscraper (Python)

Performance Comparison

Large-scale Example

Usage

cargo fmt