| Crates.io | extrablatt_v2 |
| lib.rs | extrablatt_v2 |
| version | 0.5.0 |
| created_at | 2025-11-22 14:02:44.110261+00 |
| updated_at | 2025-12-15 11:22:46.340536+00 |
| description | News, articles and text scraper |
| homepage | |
| repository | https://github.com/LdDl/extrablatt_v2 |
| max_upload_size | |
| id | 1945379 |
| size | 521,698 |
This is fork of an original repository "extrablatt" with some updated dependencies.
Customizable article scraping & curation library and CLI. Also runs in Wasm.
Original project kinda supports WASM: Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/
Inspired by newspaper.
Html Scraping is done via select.rs.
Customizable for specific news sites/layouts via the Extractor trait.
Full Documentation https://docs.rs/extrablatt_v2
Extract all Articles from news outlets.
use extrablatt_v2::Extrablatt;
use futures::StreamExt;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let site = Extrablatt::builder("https://some-news.com/")?.build().await?;
let mut stream = site.into_stream();
while let Some(article) = stream.next().await {
if let Ok(article) = article {
println!("article '{:?}'", article.content.title)
} else {
println!("{:?}", article);
}
}
Ok(())
}
Route all HTTP requests through a proxy server:
use extrablatt_v2::Extrablatt;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let site = Extrablatt::builder("https://some-news.com/")?
.proxy("http://127.0.0.1:8080") // HTTP proxy
// .proxy("socks5://127.0.0.1:1080") // SOCKS5 proxy
.build()
.await?;
// All requests now go through the proxy
let mut stream = site.into_stream();
// ...
Ok(())
}
Supported proxy formats:
http://host:port - HTTP proxyhttps://host:port - HTTPS proxysocks5://host:port - SOCKS5 proxyhttp://user:password@host:port - HTTP proxy with authenticationBy default, reqwest reads proxy settings from environment variables (HTTP_PROXY, HTTPS_PROXY, ALL_PROXY).
To ignore system proxy settings and make direct connections, use .no_system_proxy():
use extrablatt_v2::Extrablatt;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let site = Extrablatt::builder("https://some-news.com/")?
.no_system_proxy() // Ignore HTTP_PROXY/HTTPS_PROXY env vars
.build()
.await?;
// Requests connect directly, ignoring system proxy
Ok(())
}
Use mitmproxy via Docker to verify requests go through the proxy:
# Terminal 1: Start mitmproxy
docker run --rm -it -p 8080:8080 mitmproxy/mitmproxy
# Terminal 2: Run the test example
cargo run --example proxy_manual_test -- http://127.0.0.1:8080
You should see the HTTP request appear in mitmproxy's console, proving traffic is routed through the proxy.
=== Proxy Test ===
Target URL: http://httpbin.org/ip
Proxy: Some("http://127.0.0.1:8080")
Configuring proxy: http://127.0.0.1:8080
Connecting...
SUCCESS: Connected through proxy!
If using mitmproxy, you should see the request in the proxy console.
Note: HTTPS requests through mitmproxy will fail with certificate errors (expected behavior - mitmproxy intercepts SSL). For testing, use HTTP URLs or configure your system to trust mitmproxy's CA certificate.
cargo install extrablatt_v2 --features="cli"
USAGE:
extrablatt_v2 <SUBCOMMAND>
SUBCOMMANDS:
article Extract a set of articles
category Extract all articles found on the page
help Prints this message or the help of the given subcommand(s)
site Extract all articles from a news source.
extrablatt_v2 article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"
Licensed under either of these: