| Crates.io | xhtml_parser |
| lib.rs | xhtml_parser |
| version | 0.2.10 |
| created_at | 2025-06-10 14:49:03.591889+00 |
| updated_at | 2025-07-15 15:06:29.666787+00 |
| description | Non-validating XHTML Tree-based parser. |
| homepage | |
| repository | https://github.com/turgu1/xhtml_parser |
| max_upload_size | |
| id | 1707190 |
| size | 8,247,038 |
This is a simple XML/XHTML parser that constructs a read-only tree structure similar to a DOM from an Vec<u8> XML/XHTML file representation. This is used by the author for EPub reader embedded applications.
Loosely based on the PUGIXML parsing method and structure that is described here, it is an in-place parser: all strings are kept in the received Vec<u8> for which the parser takes ownership. Its content is modified to expand entities to their UTF-8 representation (in attribute values and PCData). Position index of elements is preseved in the vector. Tree nodes are kept to their minimum size for low-memory-constrained environments. A single pre-allocated vector contains all the nodes of the tree. Its maximum size depends on the xxx_node_count feature selected (see below).
The parsing process is limited to normal tags, attributes, and PCData content. No processing instruction (<? .. ?>), comment (<!-- .. -->), CDATA (<![CDATA .. ]]>), DOCTYPE (<!DOCTYPE .. >), or DTD inside DOCTYPE ([ ... ]) is retrieved. Basic validation is done to the XHTML structure to ensure content coherence.
unsafe construct.namespace_removal feature).&, <, >, ', and "), Unicode numerical character references (&#xhhhh; and &#nnnn;), and XHTML-related entities (as described here) are translated to their UTF-8 representation (parse_escapes feature).For performance comparison, a series of 20 runs were done with both PUGIXML (GNU C++) and this crate, using -O3 optimization and parsing the same 5.5 MB XML file containing 25K nodes and 25K attributes. Used the last version of PUGIXML and this crate with the default options. The values shown are the average summation of the durations with their standard deviation. Results may vary depending on the computer performance and many other aspects (system load, operating system, compiler versions, enabled options/features, data caching, etc.).
PUGIXML |
XHTML_PARSER |
|
|---|---|---|
| Average Duration | 5856 µS | 3380 µS |
| Std Deviation | 266 µS | 88 µS |
Here is a table showing the effect that some feature combinaisons may have on the DOM-like structure sizes. The nodes and attributes structure element sizes are shown (separated with a '/'), depending on the following features:
none, use_cstr, forward_only, use_cstr and forward_only combined.xxxx_node_count, xxxx_attr_count, xxxx_xml_size.none |
use_cstr |
forward_only |
use_cstr &forward_only |
|
|---|---|---|---|---|
small_node_countsmall_attr_countsmall_xml_size |
18 / 8 | 16 / 4 | 14 / 8 | 12 / 4 |
small_node_countsmall_attr_countmedium_xml_size |
24 / 16 | 20 / 8 | 20 / 16 | 16 / 8 |
medium_node_countmedium_attr_countmedium_xml_size |
36 / 16 | 32 / 8 | 28 / 16 | 24 / 8 |
medium_node_countmedium_attr_countlarge_xml_size |
48 / 32 | 40 / 16 | 40 / 32 | 32 / 16 |
The parser is open-source and can be freely used and modified under the terms of the MIT license.
default: Enables the default features of the parser.namespace_removal: Enables removal of XML namespaces from tag names during parsing. Default is enabled.parse_escapes: Enables parsing of character escapes sequences (&..;) in PCData nodes. Default is enabled.keep_ws_only_pcdata: all PCData nodes that are composed of whitespace only will be kept. Default is disabled.trim_pcdata: trim whitespaces at beginning and end of PCData nodes. Default is disabled.small_node_count: Uses 16-bit indices for the nodes vector. Default is enabled.medium_node_count: Uses 32-bit indices for the nodes vector. Default is disabled.large_node_count: Uses 64-bit indices for the nodes vector. Default is disabled.small_attr_count: Uses 16-bit indices for the attributes vector. Default is enabled.medium_attr_count: Uses 32-bit indices for the attributes vector. Default is disabled.large_attr_count: Uses 64-bit indices for the attributes vector. Default is disabled.small_xml_size: Allow XML files up to 64KB in length. Default is disabled.medium_xml_size: Allow XML files up to 4GB in length. Default is enabled.large_xml_size: Allow XML files up to 16 HexaBytes in length. Default is disabled.use_cstr: Uses an index into a null-terminated [u8] slice (C-style string) instead of a Range to represent string locations in the XML content. Default is disabled.forward_only: Removes node information and methods that permit going backward in the node structure. Default is disabled.all_features to get all features enabled under a single one, but without the following: xxxx_node_count, xxxx_attr_count, and xxxx_xml_size.PCData.Document::check_closing_tag() method located in the parser.rs file.style tag allowed in github).CStr retrieval methods for node names, attribute names and values, and PCData when the use_cstr feature is enabled.New forward_only feature: This feature removes node information and methods that permit going backward in the node structure. This is to diminish the amount of memory required to keep the nodes structure, useful for memory-constrained context when backward displacement is not used. See the section on size effects for more information, combined or not with the use_cstr feature.
Some code refactoring.
#[inline] adjustments for better performance.small_attr_count, medium_attr_count, and large_attr_count features to use 16, 32, or 64-bit indices for the attributes vector, respectively. small_attr_count is the default value.small_xml_size, medium_xml_size, and large_xml_size features to accept xml file with a maximum size of 64KB (16-bit indices), 4GB (32-bit indices), or 16 HexaBytes (64-bit indices) respectively. medium_xml_size is the default value.Performance comparison revisited
memchr for char search instead of .iter().position(). Can easily be changed through the parser::seach_char!() macro.memchr crate is used without the std option.use_cstr: By using indices into null-terminated [u8] slices instead of a range of indices (to keep the location of strings located in the XML document), this feature reduces the size of nodes to 20 bytes instead of 24 (17% gain in size for each node). For attributes, the size is reduced from 16 bytes to 8 bytes (50% gain in size for each attribute). This change optimizes the memory required to keep the XML DOM-like tree accessible, which is particularly beneficial for embedded applications where available memory is limited. Note that using this feature reduces the overall performance of the parser by approximately 5% to 10%.all_features to get all features enabled under a single one, but without the following: small_node_count, medium_node_count, and large_node_count.no_feature feature was removed.&..;) are translated.keep_ws_only_pcdata feature enabled, whitespace only nodes are created after a first element tag is encountered.Chartype enum cleanup.no_feature feature.small_node_count, medium_node_count, and large_node_count features to use 16, 32, or 64-bit indices for the nodes vector, respectively. small_node_count is the default value.<!DOCTYPE .. > and <![CDATA[ .. ]]> bypassing parser algorithm.keep_ws_only_pcdata: all PCData nodes that are composed of whitespace only will be kept. Default is disabled.trim_ws_pcdata: trim whitespaces at beginning and end of PCData nodes. Default is disabled.parse_escapes feature to add attribute values that are parsed for escapes sequences when that feature is enabled.parser method is no longer public outside of this crate.Nodes iterator to access document nodes in the sequence of creation. Accessible through the Document::all_nodes(), Document::descendants() and Node::descendants() methods.use declarations.Added pub fn is(&self, name: &str) -> bool method to Attribute and Node modules.
Added pub use entries in lib.rs to simplify usage in calling applications. All examples and tests have been modified in accordance with this change.
Added Display trait definition for the ParseXmlError enum in the defs module.
Removed the position field of the node_info struct as the information is available through the range fields of the NodeType enum.
Initial release.