xhtmlchardet

Crates.io	xhtmlchardet
lib.rs	xhtmlchardet
version	2.2.0
created_at	2015-08-31 08:29:15.902594+00
updated_at	2022-01-26 00:08:33.967717+00
description	Character set detection for XML and HTML
homepage	https://github.com/wezm/xhtmlchardet
repository	https://github.com/wezm/xhtmlchardet
max_upload_size
id	2970
size	15,659

Wesley Moore (wezm)

documentation

https://docs.rs/xhtmlchardet/

README

xhtmlchardet

Basic character set detection for XML and HTML in Rust.

Minimum Supported Rust Version: 1.24.0

Example

use std::io::Cursor;
extern crate xhtmlchardet;

let text = b"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><channel><title>Example</title></channel>";
let mut text_cursor = Cursor::new(text.to_vec());
let detected_charsets: Vec<String> = xhtmlchardet::detect(&mut text_cursor, None).unwrap();
assert_eq!(detected_charsets, vec!["iso-8859-1".to_string()]);

Rationale

I wrote a feed crawler that needed to determine the character set of fetched content so that it could be normalised to UTF-8. Initially I used the uchardet crate but I encountered some situations where it misdetected the charset. I collected all these edge cases together and built a test suite. Then I implemented this crate, which passes all of those tests. It uses a fairly naïve approach derived from section F of the XML specification.

Commit count: 42

xhtmlchardet

documentation

README

xhtmlchardet

Example

Rationale

cargo fmt