# Webpage.rs [![crates.io](https://img.shields.io/crates/v/webpage.svg)](https://crates.io/crates/webpage) [![docs.rs](https://img.shields.io/docsrs/webpage)](https://docs.rs/webpage) _Small library to fetch info about a web page: title, description, language, HTTP info, links, RSS feeds, Opengraph, Schema.org, and more_ ## Usage ```rust use webpage::{Webpage, WebpageOptions}; let info = Webpage::from_url("http://www.rust-lang.org/en-US/", WebpageOptions::default()) .expect("Could not read from URL"); // the HTTP transfer info let http = info.http; assert_eq!(http.ip, "54.192.129.71".to_string()); assert!(http.headers[0].starts_with("HTTP")); assert!(http.body.starts_with("")); assert_eq!(http.url, "https://www.rust-lang.org/en-US/".to_string()); // followed redirects (HTTPS) assert_eq!(http.content_type, "text/html".to_string()); // the parsed HTML info let html = info.html; assert_eq!(html.title, Some("The Rust Programming Language".to_string())); assert_eq!(html.description, Some("A systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.".to_string())); assert_eq!(html.opengraph.og_type, "website".to_string()); ``` You can also get HTML info about local data: ```rust use webpage::HTML; let html = HTML::from_file("index.html", None); // or let html = HTML::from_string(input, None); ``` ## Features ### Serialization If you need to be able to serialize the data provided by the library using [serde](https://serde.rs/), you can include specify the `serde` *feature* while declaring your dependencies in `Cargo.toml`: ```toml webpage = { version = "2.0", features = ["serde"] } ``` ### No curl dependency The `curl` feature is enabled by default but is optional. This is useful if you do not need a HTTP client but already have the HTML data at hand. ## All fields ```rust pub struct Webpage { pub http: HTTP, // info about the HTTP transfer pub html: HTML, // info from the parsed HTML doc } pub struct HTTP { pub ip: String, pub transfer_time: Duration, pub redirect_count: u32, pub content_type: String, pub response_code: u32, pub headers: Vec, // raw headers from final request pub url: String, // effective url pub body: String, } pub struct HTML { pub title: Option, pub description: Option, pub url: Option, // canonical url pub feed: Option, // RSS feed typically pub language: Option, // as specified, not detected pub text_content: String, // all tags stripped from body pub links: Vec, // all links in the document pub meta: HashMap, // flattened down list of meta properties pub opengraph: Opengraph, pub schema_org: Vec, } pub struct Link { pub url: String, // resolved url of the link pub text: String, // anchor text } pub struct Opengraph { pub og_type: String, pub properties: HashMap, pub images: Vec, pub videos: Vec, pub audios: Vec, } // Facebook's Opengraph structured data pub struct OpengraphObject { pub url: String, pub properties: HashMap, } // Google's schema.org structured data pub struct SchemaOrg { pub schema_type: String, pub value: serde_json::Value, } ``` ## Options The following HTTP configurations are available: ```rust pub struct WebpageOptions { allow_insecure: false, follow_location: true, max_redirections: 5, timeout: Duration::from_secs(10), useragent: "Webpage - Rust crate - https://crates.io/crates/webpage".to_string(), headers: vec!["X-My-Header: 1234".to_string()], } // usage let mut options = WebpageOptions::default(); options.allow_insecure = true; let info = Webpage::from_url(&url, options).expect("Halp, could not fetch"); ```