Crates.io | html5gum |
lib.rs | html5gum |
version | 0.7.0 |
source | src |
created_at | 2021-11-24 20:15:09.110423 |
updated_at | 2024-10-30 15:20:38.967149 |
description | A WHATWG-compliant HTML5 tokenizer and tag soup parser. |
homepage | |
repository | https://github.com/untitaker/html5gum |
max_upload_size | |
id | 487275 |
size | 352,160 |
html5gum
is a WHATWG-compliant HTML tokenizer.
use std::fmt::Write;
use html5gum::{Tokenizer, Token};
let html = "<title >hello world</title>";
let mut new_html = String::new();
for Ok(token) in Tokenizer::new(html) {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", String::from_utf8_lossy(&tag.name)).unwrap();
}
Token::String(hello_world) => {
write!(new_html, "{}", String::from_utf8_lossy(&hello_world)).unwrap();
}
Token::EndTag(tag) => {
write!(new_html, "</{}>", String::from_utf8_lossy(&tag.name)).unwrap();
}
_ => panic!("unexpected input"),
}
}
assert_eq!(new_html, "<title>hello world</title>");
html5gum
provides multiple kinds of APIs:
Emitter
for maximum performance, see the custom_emitter.rs
example.callback_emitter.rs
example.tree-builder
feature, html5gum can be integrated with html5ever
and scraper
. See the scraper.rs
example.html5gum
fully implements 13.2.5 of the WHATWG HTML
spec, i.e. is able to tokenize HTML documents and passes html5lib's tokenizer
test suite. Since it is just a tokenizer, this means:
html5gum
does not implement charset
detection.
This implementation takes and returns bytes, but assumes UTF-8. It recovers
gracefully from invalid UTF-8.html5gum
does not correct mis-nested
tags.html5gum
doesn't implement the DOM, and unfortunately in the HTML spec,
constructing the DOM ("tree construction") influences how tokenization is
done. For an example of which problems this causes see this example
code.html5gum
does not generally qualify as a browser-grade HTML parser as
per the WHATWG spec. This can change in the future, see issue
21.With those caveats in mind, html5gum
can pretty much parse tokenize
anything that browsers can. However, using the experimental tree-builder
feature, html5gum can be integrated with html5ever
and scraper
. See the
scraper.rs
example.
jetscii
, and can be disabled via crate features (see Cargo.toml
)html5gum
was created out of a need to parse HTML tag soup efficiently. Previous options were to:
use quick-xml or
xmlparser with some hacks to make
either one not choke on bad HTML. For some (rather large) set of HTML input
this works well (particularly quick-xml
can be configured to be very
lenient about parsing errors) and parsing speed is stellar. But neither can
parse all HTML.
For my own usecase html5gum
is about 2x slower than quick-xml
.
use html5ever's own
tokenizer
to avoid as much tree-building overhead as possible. This was functional but
had poor performance for my own usecase (10-15x slower than quick-xml
).
use lol-html, which would probably
perform at least as well as html5gum
, but comes with a closure-based API
that I didn't manage to get working for my usecase.
Why is this library called html5gum
?
G.U.M: Giant Unreadable Match-statement
<insert "how it feels to chew 5 gum parse HTML" meme here>
Licensed under the MIT license, see ./LICENSE
.