| Crates.io | kindaxml |
| lib.rs | kindaxml |
| version | 0.1.0 |
| created_at | 2025-12-05 13:39:32.035961+00 |
| updated_at | 2025-12-05 13:39:32.035961+00 |
| description | Close-enough, XML-ish annotation parsing with deterministic recovery for LLM output. |
| homepage | https://github.com/soraxas/kindaxml |
| repository | https://github.com/soraxas/kindaxml |
| max_upload_size | |
| id | 1968214 |
| size | 78,395 |
kindaxml) — close-enough, XML-ish markup for LLM outputKindaXML is an XML-inspired annotation DSL designed for LLM-generated text. It keeps the familiar <tag attr=...> shape, but the parser is tolerant: it recovers from missing end tags, missing quotes, and other common “almost XML” mistakes.
KindaXML is not XML (and not meant to be parsed by strict XML parsers). Think: well-formed-ish.
LLMs are good at emitting XML-like text, but strict XML breaks easily. KindaXML aims to be:
KindaXML’s primary output is a stream of text segments, each optionally annotated:
[
{"text": "We shipped last week", "ann": [{"tag":"cite","attrs":{"id":"1"}}]},
{"text": ". ", "ann": []},
{"text": "Details", "ann": [{"tag":"note","attrs":{}}]}
]
KindaXML intentionally avoids deep nesting. In fact, it auto-closes open tags when the next tag begins, which keeps structures shallow and robust.
<tag ...></tag><tag .../>Tag names match:
[A-Za-z][A-Za-z0-9_\-:.]*
Supported forms:
a="x"a='x'a=x (unquoted)a (boolean attribute; implies true)= is allowed.A tag begins at < and ends at the first >.
If a quote starts inside the tag but never closes, it is implicitly closed at >.
Example:
<cite id='1,2>text</cite>
Parses as:
tag = citeid = "1,2" (quote recovered)textIf a start tag is open and the parser encounters the next <something...>, the current tag is implicitly closed immediately before that next <.
This is the core rule that prevents runaway structures.
Example:
<A>hello <B>world</B>
<A> auto-closes before <B>.
If a tag never closes, it’s recovered according to its configured span strategy (below).
<tag .../> is treated as a marker annotation at that position (or optionally “annotate next token”, configurable).
KindaXML is annotation-first. Each recognized tag can be configured with a span strategy:
inline (normal XML-ish)If <tag> ... </tag> is present, annotate the inner range.
retro_line (great for citations)If <cite ...> is unclosed, annotate the text on the current line before the tag (from last emitted newline to the tag start), optionally trimming punctuation/whitespace.
Example:
We shipped last week <cite id=1>.
The cite attaches to We shipped last week (not the punctuation).
forward_until_tag: annotate from the end of <tag ...> to the next tag start.forward_until_newline: annotate until newline.forward_next_token: annotate the next token/word.noop: ignore tag if unclosed (marker-only tags).You instruct the LLM to use a whitelist of recognized tags, but the parser can handle unknown tags in one of three modes:
strip (default-friendly): drop unknown tag markup, keep inner textpassthrough: keep unknown tags as literal texttreat_as_text: don’t parse unknown tags at all; treat <...> as textKindaXML can support XML’s CDATA form:
<![CDATA[]]>Inside CDATA, nothing is parsed as tags.
Example:
<note><![CDATA[
Use < and > freely here. Even <fake tags>.
]]></note>
If ]]> is missing, CDATA runs to end-of-document (recovered).
(If you prefer simpler escaping, you can also support \< and \> as literals.)
use kindaxml::{parse, ParserConfig, UnknownMode};
fn main() {
let mut cfg = ParserConfig::default();
cfg.recognized_tags = ["cite", "note"].into_iter().map(String::from).collect();
cfg.case_sensitive_tags = false;
cfg.unknown_mode = UnknownMode::Strip;
let input = "We shipped <cite id=1>last week</cite>.";
let parsed = parse(input, &cfg);
for segment in parsed.segments {
println!("{:?} -> {:?}", segment.text, segment.annotations);
}
}
ParserConfig exposes toggles for unknown tags, per-tag recovery strategies, case sensitivity, punctuation trimming, and auto-close behavior. The default config is conservative and strips unknown tags.
Run the runnable demo with cargo run --example basic to see the original snippets alongside their parsed segments and markers.
Input:
We shipped <cite id="1">last week</cite>.
Output (conceptual):
We shipped (no annotations)last week (annotated: cite{id=1}). (no annotations)Input:
We shipped last week <cite id=1>.
Output:
We shipped last week (annotated: cite{id=1}).Input:
<cite id='1, 2>Evidence</cite>
Recovered as id="1,2".
Input:
alpha <note>bravo <cite id=9> charlie
<note> auto-closes before <cite ...><cite> is unclosed and recovered by its strategyKindaXML is not a DOM language. If you try to nest, the “auto-close on next tag” rule will flatten it.
Bad idea:
<A>outer <B>inner</B> outer</A>
KindaXML outcome: <A> likely ends before <B>, and </A> may become stray.
Guidance: don’t nest; prefer sibling tags.
Example:
<tag a="x y z b=2>
KindaXML will recover by closing the quote at > and treat the entire remaining text as part of a. This is intentional: recovery is bounded to the tag.
Guidance: keep attributes simple; use CDATA for messy text.
Because auto-close flattens structure, you may get stray </tag>. By default, recognized stray end tags are dropped; unknown ones can be passed through (configurable).
Tell the model:
<cite> <note> <todo> <risk> ... (whitelist)... statement <cite id=1>.</>: <![CDATA[ ... ]]>