extrablatt_v2

Crates.io	extrablatt_v2
lib.rs	extrablatt_v2
version	0.5.0
created_at	2025-11-22 14:02:44.110261+00
updated_at	2025-12-15 11:22:46.340536+00
description	News, articles and text scraper
homepage
repository	https://github.com/LdDl/extrablatt_v2
max_upload_size
id	1945379
size	521,698

Dimitrii Lopanov (LdDl)

documentation

https://docs.rs/extrablatt_v2/

README

extrablatt_v2

This is fork of an original repository "extrablatt" with some updated dependencies.

Customizable article scraping & curation library and CLI. Also runs in Wasm.

Original project kinda supports WASM: Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/

Inspired by newspaper.

Html Scraping is done via select.rs.

Features

News url identification
Text extraction
Top image extraction
All image extraction
Keyword extraction
Author extraction
Publishing date
References

Customizable for specific news sites/layouts via the Extractor trait.

Diffences from original extrablatt

Updated dependencies
More heuristics for article body/authors and etc data extraction
Reoganized code structure
More references to newspaper4k ideas
Configurable threads num
Proxy support - route requests through HTTP/HTTPS/SOCKS5 proxies if needed
I am not used to use WASM or CLI in this fork, so those parts are mostly untouched and I can't guarantee they work as expected.

Documentation

Full Documentation https://docs.rs/extrablatt_v2

Example

Extract all Articles from news outlets.

use extrablatt_v2::Extrablatt;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let site = Extrablatt::builder("https://some-news.com/")?.build().await?;

    let mut stream = site.into_stream();
    
    while let Some(article) = stream.next().await {
        if let Ok(article) = article {
            println!("article '{:?}'", article.content.title)
        } else {
            println!("{:?}", article);
        }
    }

    Ok(())
}

Proxy Support

Route all HTTP requests through a proxy server:

use extrablatt_v2::Extrablatt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let site = Extrablatt::builder("https://some-news.com/")?
        .proxy("http://127.0.0.1:8080")  // HTTP proxy
        // .proxy("socks5://127.0.0.1:1080")  // SOCKS5 proxy
        .build()
        .await?;

    // All requests now go through the proxy
    let mut stream = site.into_stream();
    // ...

    Ok(())
}

Supported proxy formats:

http://host:port - HTTP proxy
https://host:port - HTTPS proxy
socks5://host:port - SOCKS5 proxy
http://user:password@host:port - HTTP proxy with authentication

Disabling System Proxy

By default, reqwest reads proxy settings from environment variables (HTTP_PROXY, HTTPS_PROXY, ALL_PROXY). To ignore system proxy settings and make direct connections, use .no_system_proxy():

use extrablatt_v2::Extrablatt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let site = Extrablatt::builder("https://some-news.com/")?
        .no_system_proxy()  // Ignore HTTP_PROXY/HTTPS_PROXY env vars
        .build()
        .await?;

    // Requests connect directly, ignoring system proxy
    Ok(())
}

Testing Proxy Manually

Use mitmproxy via Docker to verify requests go through the proxy:

# Terminal 1: Start mitmproxy
docker run --rm -it -p 8080:8080 mitmproxy/mitmproxy

# Terminal 2: Run the test example
cargo run --example proxy_manual_test -- http://127.0.0.1:8080

You should see the HTTP request appear in mitmproxy's console, proving traffic is routed through the proxy.

=== Proxy Test ===
Target URL: http://httpbin.org/ip
Proxy: Some("http://127.0.0.1:8080")
Configuring proxy: http://127.0.0.1:8080
Connecting...
SUCCESS: Connected through proxy!
If using mitmproxy, you should see the request in the proxy console.

Note: HTTPS requests through mitmproxy will fail with certificate errors (expected behavior - mitmproxy intercepts SSL). For testing, use HTTP URLs or configure your system to trust mitmproxy's CA certificate.

Command Line

Install

cargo install extrablatt_v2 --features="cli"

Usage

USAGE:
    extrablatt_v2 <SUBCOMMAND>

SUBCOMMANDS:
    article     Extract a set of articles
    category    Extract all articles found on the page
    help        Prints this message or the help of the given subcommand(s)
    site        Extract all articles from a news source.

Extract a set of specific articles and store the result as json

extrablatt_v2 article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"

License

Licensed under either of these:

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)

Commit count: 0

extrablatt_v2

documentation

README

extrablatt_v2

Features

Diffences from original extrablatt

Documentation

Example

Proxy Support

Disabling System Proxy

Testing Proxy Manually

Command Line

Install

Usage

Extract a set of specific articles and store the result as json

License

cargo fmt