Crates.io | xml_oxide |
lib.rs | xml_oxide |
version | 0.3.0 |
source | src |
created_at | 2017-03-04 23:59:29.329511 |
updated_at | 2021-12-09 22:04:40.201199 |
description | XML SAX parser implementation that parses any well-formed XML defined in the W3C Spec |
homepage | |
repository | https://github.com/fatihpense/rust_xml_oxide |
max_upload_size | |
id | 8815 |
size | 139,817 |
Rust XML parser implementation that parses any well-formed XML defined in the W3C Spec in a streaming way.
unsafe
is used for function std::str::from_utf8_unchecked
. It is used on a slice of bytes that is already checked to be a valid UTF8 string with std::str::from_utf8
before. The performance saving is not tested though.RefCell<Vec<u8>>
to circular::Buffer
passed Rust borrow checks. I'm leaving this note as a reference. RefCell
is used because Rust is too restrictive for using mutables in conditional loops. Hopefully, non-lexical lifetimes will get better over time.namespace-aware=false
option to parse otherwise valid XML 1.0 documents .In this example StartElement and EndElement events are counted. Note that you can find more examples under tests
directory.
StartElement
also include empty tags. Checked by is_empty
.&
or <
comes in its own event(Not in Characters
).use std::fs::File;
use xml_oxide::{sax::parser::Parser, sax::Event};
fn main() {
println!("Starting...");
let mut counter: usize = 0;
let mut end_counter: usize = 0;
let now = std::time::Instant::now();
let f = File::open("./tests/xml_files/books.xml").unwrap();
let mut p = Parser::from_reader(f);
loop {
let res = p.read_event();
match res {
Ok(event) => match event {
Event::StartDocument => {}
Event::EndDocument => {
break;
}
Event::StartElement(el) => {
//You can differantiate between Starting Tag and Empty Element Tag
if !el.is_empty {
counter = counter + 1;
// print every 10000th element name
if counter % 10000 == 0 {
println!("%10000 start {}", el.name);
}
}
}
Event::EndElement(el) => {
end_counter += 1;
if el.name == "feed" {
break;
}
}
Event::Characters(_) => {}
Event::Reference(_) => {}
_ => {}
},
Err(err) => {
println!("{}", err);
break;
}
}
}
println!("Start event count:{}", counter);
println!("End event count:{}", end_counter);
let elapsed = now.elapsed();
println!("Time elapsed: {:.2?}", elapsed);
}
I tried to specify a push parser interface like the Java SAX library and implement it in 2017. The idea was to provide an interface that can have multiple implementations in the community. It was working(albeit slowly) but the main problem was that a push parser is not ergonomic in Rust. After thinking for a long time and learning more about Rust I decided to implement a pull parser. Currently, the SAX(pull) interface is just an enum
and its behavior(like the possibility of splitting characters for each call).
If you want to use xml_sax
interface to implement another parser we can discuss improving the interface. Currently, it is integrated into this crate.
Why a pull parser? section in pulldown-cmark
is a great explanation.
The current interface is inspired by quick-xml
, xml-rs
, and Java libraries.
nom
is a great library. It is just a crystallized & better version of what you would do naively at first try(I know). It also shows the power of composability in Rust.