| Crates.io | webpage-info |
| lib.rs | webpage-info |
| version | 1.0.1 |
| created_at | 2025-11-25 06:14:15.924173+00 |
| updated_at | 2025-11-25 06:25:27.717317+00 |
| description | Modern library to extract webpage metadata: title, description, OpenGraph, Schema.org, links, and more |
| homepage | |
| repository | https://github.com/robdotec/webpage-info |
| max_upload_size | |
| id | 1949250 |
| size | 139,727 |
A fast, safe Rust library for extracting metadata from web pages. Parses HTML to extract titles, descriptions, OpenGraph data, Schema.org JSON-LD, links, and more.
[dependencies]
webpage-info = "1.0"
For HTML parsing only (no HTTP client):
[dependencies]
webpage-info = { version = "1.0", default-features = false }
use webpage_info::WebpageInfo;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let info = WebpageInfo::fetch("https://example.com").await?;
println!("Title: {:?}", info.html.title);
println!("Description: {:?}", info.html.description);
println!("OpenGraph: {:?}", info.html.opengraph.title);
println!("Links: {}", info.html.links.len());
Ok(())
}
use webpage_info::HtmlInfo;
let html = r#"
<html>
<head>
<title>My Page</title>
<meta property="og:title" content="OpenGraph Title">
</head>
<body>
<a href="/about">About</a>
</body>
</html>
"#;
let info = HtmlInfo::from_string(html, Some("https://example.com/"))?;
assert_eq!(info.title, Some("My Page".to_string()));
assert_eq!(info.opengraph.title, Some("OpenGraph Title".to_string()));
assert_eq!(info.links[0].url, "https://example.com/about");
use webpage_info::{WebpageInfo, HttpOptions};
use std::time::Duration;
let options = HttpOptions::new()
.timeout(Duration::from_secs(10))
.max_body_size(5 * 1024 * 1024) // 5 MB
.user_agent("MyBot/1.0");
let info = WebpageInfo::fetch_with_options("https://example.com", options).await?;
| Field | Type | Description |
|---|---|---|
title |
Option<String> |
Document title from <title> tag |
description |
Option<String> |
Meta description |
language |
Option<String> |
Language from <html lang="..."> |
canonical_url |
Option<String> |
Canonical URL from <link rel="canonical"> |
feed_url |
Option<String> |
RSS/Atom feed URL |
text_content |
String |
Extracted text (scripts/styles excluded) |
meta |
HashMap<String, String> |
All meta tags |
opengraph |
Opengraph |
OpenGraph metadata |
schema_org |
Vec<SchemaOrg> |
Schema.org JSON-LD data |
links |
Vec<Link> |
All links in the document |
let og = &info.opengraph;
println!("Type: {:?}", og.og_type); // "article", "website", etc.
println!("Title: {:?}", og.title);
println!("Description: {:?}", og.description);
println!("Site: {:?}", og.site_name);
println!("Images: {:?}", og.images); // Vec<OpengraphMedia>
println!("Videos: {:?}", og.videos);
for schema in &info.schema_org {
println!("Type: {}", schema.schema_type); // "Article", "Product", etc.
println!("Name: {:?}", schema.get_str("name"));
println!("Full JSON: {}", schema.value);
}
By default, requests to private/internal IP addresses are blocked:
127.0.0.1, ::1)10.x, 172.16-31.x, 192.168.x)169.254.x - includes cloud metadata endpoints).local, .internal)To disable (not recommended for user-supplied URLs):
let options = HttpOptions::new().block_private_ips(false);
Default limits prevent resource exhaustion:
| Limit | Default | Option |
|---|---|---|
| Response body | 10 MB | max_body_size() |
| Links | 10,000 | - |
| Schema.org items | 100 | - |
| Text content | 1 MB | - |
| OpenGraph media | 100 each | - |
Benchmarks on sample HTML (9KB document):
| Operation | Time | Throughput |
|---|---|---|
| Full parse | ~96 µs | 92 MiB/s |
| 1000 links | ~725 µs | 1.4M links/s |
| Text extraction | ~59 µs | - |
| Schema.org (complex) | ~6 µs | - |
Run benchmarks:
cargo bench
# Fetch and display webpage info
cargo run --example fetch_example
MIT License - see LICENSE for details.