webpage-info

Crates.io	webpage-info
lib.rs	webpage-info
version	1.0.1
created_at	2025-11-25 06:14:15.924173+00
updated_at	2025-11-25 06:25:27.717317+00
description	Modern library to extract webpage metadata: title, description, OpenGraph, Schema.org, links, and more
homepage
repository	https://github.com/robdotec/webpage-info
max_upload_size
id	1949250
size	139,727

rob.ec (robdotec)

documentation

https://docs.rs/webpage-info

README

webpage-info

A fast, safe Rust library for extracting metadata from web pages. Parses HTML to extract titles, descriptions, OpenGraph data, Schema.org JSON-LD, links, and more.

Features

Fast HTML parsing with cached CSS selectors (~92 MiB/s throughput)
OpenGraph metadata extraction (og:title, og:image, etc.)
Schema.org JSON-LD structured data parsing
Link extraction with URL resolution
Text content extraction excluding scripts and styles
Async HTTP fetching with security protections
SSRF protection blocks requests to private IPs by default
Resource limits prevent memory exhaustion attacks

Installation

[dependencies]
webpage-info = "1.0"

For HTML parsing only (no HTTP client):

[dependencies]
webpage-info = { version = "1.0", default-features = false }

Quick Start

Fetch and parse a URL

use webpage_info::WebpageInfo;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let info = WebpageInfo::fetch("https://example.com").await?;

    println!("Title: {:?}", info.html.title);
    println!("Description: {:?}", info.html.description);
    println!("OpenGraph: {:?}", info.html.opengraph.title);
    println!("Links: {}", info.html.links.len());

    Ok(())
}

Parse HTML directly

use webpage_info::HtmlInfo;

let html = r#"
    <html>
    <head>
        <title>My Page</title>
        <meta property="og:title" content="OpenGraph Title">
    </head>
    <body>
        <a href="/about">About</a>
    </body>
    </html>
"#;

let info = HtmlInfo::from_string(html, Some("https://example.com/"))?;

assert_eq!(info.title, Some("My Page".to_string()));
assert_eq!(info.opengraph.title, Some("OpenGraph Title".to_string()));
assert_eq!(info.links[0].url, "https://example.com/about");

Custom HTTP options

use webpage_info::{WebpageInfo, HttpOptions};
use std::time::Duration;

let options = HttpOptions::new()
    .timeout(Duration::from_secs(10))
    .max_body_size(5 * 1024 * 1024)  // 5 MB
    .user_agent("MyBot/1.0");

let info = WebpageInfo::fetch_with_options("https://example.com", options).await?;

Extracted Data

HtmlInfo

Field	Type	Description
`title`	`Option<String>`	Document title from `<title>` tag
`description`	`Option<String>`	Meta description
`language`	`Option<String>`	Language from `<html lang="...">`
`canonical_url`	`Option<String>`	Canonical URL from `<link rel="canonical">`
`feed_url`	`Option<String>`	RSS/Atom feed URL
`text_content`	`String`	Extracted text (scripts/styles excluded)
`meta`	`HashMap<String, String>`	All meta tags
`opengraph`	`Opengraph`	OpenGraph metadata
`schema_org`	`Vec<SchemaOrg>`	Schema.org JSON-LD data
`links`	`Vec<Link>`	All links in the document

OpenGraph

let og = &info.opengraph;
println!("Type: {:?}", og.og_type);      // "article", "website", etc.
println!("Title: {:?}", og.title);
println!("Description: {:?}", og.description);
println!("Site: {:?}", og.site_name);
println!("Images: {:?}", og.images);     // Vec<OpengraphMedia>
println!("Videos: {:?}", og.videos);

Schema.org

for schema in &info.schema_org {
    println!("Type: {}", schema.schema_type);  // "Article", "Product", etc.
    println!("Name: {:?}", schema.get_str("name"));
    println!("Full JSON: {}", schema.value);
}

Security

SSRF Protection

By default, requests to private/internal IP addresses are blocked:

Localhost (127.0.0.1, ::1)
Private networks (10.x, 172.16-31.x, 192.168.x)
Link-local (169.254.x - includes cloud metadata endpoints)
Internal hostnames (.local, .internal)

To disable (not recommended for user-supplied URLs):

let options = HttpOptions::new().block_private_ips(false);

Resource Limits

Default limits prevent resource exhaustion:

Limit	Default	Option
Response body	10 MB	`max_body_size()`
Links	10,000	-
Schema.org items	100	-
Text content	1 MB	-
OpenGraph media	100 each	-

Performance

Benchmarks on sample HTML (9KB document):

Operation	Time	Throughput
Full parse	~96 µs	92 MiB/s
1000 links	~725 µs	1.4M links/s
Text extraction	~59 µs	-
Schema.org (complex)	~6 µs	-

Run benchmarks:

cargo bench

Examples

# Fetch and display webpage info
cargo run --example fetch_example

License

MIT License - see LICENSE for details.

Commit count: 0

webpage-info

documentation

README

webpage-info

Features

Installation

Quick Start

Fetch and parse a URL

Parse HTML directly

Custom HTTP options

Extracted Data

HtmlInfo

OpenGraph

Schema.org

Security

SSRF Protection

Resource Limits

Performance

Examples

License

cargo fmt