Crates.io | html5tokenizer |
lib.rs | html5tokenizer |
version | 0.5.2 |
source | src |
created_at | 2021-04-08 13:57:57.588536 |
updated_at | 2023-09-28 09:12:56.083386 |
description | An HTML5 tokenizer with code span support. |
homepage | |
repository | https://git.push-f.com/html5tokenizer |
max_upload_size | |
id | 380863 |
size | 572,207 |
Spec-compliant HTML parsing requires both tokenization and tree-construction.
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a NaiveParser
that may be used as follows:
use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};
let html = "<title >hello world</title>";
let mut new_html = String::new();
for token in NaiveParser::new(html).flatten() {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", tag.name).unwrap();
}
Token::Char(c) => {
write!(new_html, "{c}").unwrap();
}
Token::EndTag(tag) => {
write!(new_html, "</{}>", tag.name).unwrap();
}
Token::EndOfFile => {},
_ => panic!("unexpected input"),
}
}
assert_eq!(new_html, "<title>hello world</title>");
This library can provide source spans. For an example, see
examples/spans.rs
, which produces the following output:
note:
┌─ file.html:1:2
│
1 │ <img src=example.jpg alt="some description">
│ ^^^ ^^^ ^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^ attr value
│ │ │ │ │
│ │ │ │ attr name
│ │ │ attr value
│ │ attr name
│ tag name
This crate does not yet implement tree construction
(which is necessary for spec-compliant HTML parsing).
This crate does not yet implement character encoding detection.
The tokenizer passes the html5lib tokenizer test suite. The library is not yet fuzz tested.
html5tokenizer was forked from html5gum 0.2.1, which was created by Markus Unterwaditzer who deserves major props for implementing all 80 (!) tokenizer states.
For details please refer to the changelog.
Licensed under the MIT license, see the LICENSE file.