kindaxml

Crates.iokindaxml
lib.rskindaxml
version0.1.0
created_at2025-12-05 13:39:32.035961+00
updated_at2025-12-05 13:39:32.035961+00
descriptionClose-enough, XML-ish annotation parsing with deterministic recovery for LLM output.
homepagehttps://github.com/soraxas/kindaxml
repositoryhttps://github.com/soraxas/kindaxml
max_upload_size
id1968214
size78,395
Tin Lai (soraxas)

documentation

README

KindaXML (kindaxml) — close-enough, XML-ish markup for LLM output

KindaXML is an XML-inspired annotation DSL designed for LLM-generated text. It keeps the familiar <tag attr=...> shape, but the parser is tolerant: it recovers from missing end tags, missing quotes, and other common “almost XML” mistakes.

KindaXML is not XML (and not meant to be parsed by strict XML parsers). Think: well-formed-ish.


Why KindaXML?

LLMs are good at emitting XML-like text, but strict XML breaks easily. KindaXML aims to be:

  • LLM-friendly: angle brackets and attributes feel natural in prompts.
  • Deterministic recovery: malformed input still produces predictable output.
  • Annotation-first: tags annotate spans of text rather than building a complex DOM.
  • Configurable: recognized tags are whitelisted, unknown tags can be stripped or preserved.

Design: Annotation DSL (Option A) + a pinch of “blocks”

KindaXML’s primary output is a stream of text segments, each optionally annotated:

[
  {"text": "We shipped last week", "ann": [{"tag":"cite","attrs":{"id":"1"}}]},
  {"text": ". ", "ann": []},
  {"text": "Details", "ann": [{"tag":"note","attrs":{}}]}
]

KindaXML intentionally avoids deep nesting. In fact, it auto-closes open tags when the next tag begins, which keeps structures shallow and robust.


Syntax overview

Tags

  • Start tag: <tag ...>
  • End tag: </tag>
  • Self-closing tag: <tag .../>

Tag names match:

[A-Za-z][A-Za-z0-9_\-:.]*

Attributes

Supported forms:

  • a="x"
  • a='x'
  • a=x (unquoted)
  • a (boolean attribute; implies true)
  • Whitespace around = is allowed.

Parsing rules (the “close enough” part)

1) Tag boundary detection

A tag begins at < and ends at the first >.

If a quote starts inside the tag but never closes, it is implicitly closed at >.

Example:

<cite id='1,2>text</cite>

Parses as:

  • tag = cite
  • id = "1,2" (quote recovered)
  • inner text = text

2) Auto-close on encountering another tag

If a start tag is open and the parser encounters the next <something...>, the current tag is implicitly closed immediately before that next <.

This is the core rule that prevents runaway structures.

Example:

<A>hello <B>world</B>

<A> auto-closes before <B>.

3) Missing end tags are tolerated

If a tag never closes, it’s recovered according to its configured span strategy (below).

4) Self-closing tags

<tag .../> is treated as a marker annotation at that position (or optionally “annotate next token”, configurable).


Span strategies (how KindaXML decides what a tag annotates)

KindaXML is annotation-first. Each recognized tag can be configured with a span strategy:

inline (normal XML-ish)

If <tag> ... </tag> is present, annotate the inner range.

retro_line (great for citations)

If <cite ...> is unclosed, annotate the text on the current line before the tag (from last emitted newline to the tag start), optionally trimming punctuation/whitespace.

Example:

We shipped last week <cite id=1>.

The cite attaches to We shipped last week (not the punctuation).

Other useful strategies (optional)

  • forward_until_tag: annotate from the end of <tag ...> to the next tag start.
  • forward_until_newline: annotate until newline.
  • forward_next_token: annotate the next token/word.
  • noop: ignore tag if unclosed (marker-only tags).

Unknown tags

You instruct the LLM to use a whitelist of recognized tags, but the parser can handle unknown tags in one of three modes:

  • strip (default-friendly): drop unknown tag markup, keep inner text
  • passthrough: keep unknown tags as literal text
  • treat_as_text: don’t parse unknown tags at all; treat <...> as text

Escaping / literal text (CDATA support)

KindaXML can support XML’s CDATA form:

  • Start: <![CDATA[
  • End: ]]>

Inside CDATA, nothing is parsed as tags.

Example:

<note><![CDATA[
Use < and > freely here. Even <fake tags>.
]]></note>

If ]]> is missing, CDATA runs to end-of-document (recovered).

(If you prefer simpler escaping, you can also support \< and \> as literals.)


Using the Rust crate

use kindaxml::{parse, ParserConfig, UnknownMode};

fn main() {
    let mut cfg = ParserConfig::default();
    cfg.recognized_tags = ["cite", "note"].into_iter().map(String::from).collect();
    cfg.case_sensitive_tags = false;
    cfg.unknown_mode = UnknownMode::Strip;

    let input = "We shipped <cite id=1>last week</cite>.";
    let parsed = parse(input, &cfg);

    for segment in parsed.segments {
        println!("{:?} -> {:?}", segment.text, segment.annotations);
    }
}

ParserConfig exposes toggles for unknown tags, per-tag recovery strategies, case sensitivity, punctuation trimming, and auto-close behavior. The default config is conservative and strips unknown tags.


Examples

Run the runnable demo with cargo run --example basic to see the original snippets alongside their parsed segments and markers.

Closed tag (inline span)

Input:

We shipped <cite id="1">last week</cite>.

Output (conceptual):

  • We shipped (no annotations)
  • last week (annotated: cite{id=1})
  • . (no annotations)

Unclosed cite (retro_line)

Input:

We shipped last week <cite id=1>.

Output:

  • We shipped last week (annotated: cite{id=1})
  • .
  • (tag removed)

Broken quote recovery

Input:

<cite id='1, 2>Evidence</cite>

Recovered as id="1,2".

Auto-close on next tag

Input:

alpha <note>bravo <cite id=9> charlie
  • <note> auto-closes before <cite ...>
  • <cite> is unclosed and recovered by its strategy

Failure cases / limitations (by design)

Nesting will not behave like XML

KindaXML is not a DOM language. If you try to nest, the “auto-close on next tag” rule will flatten it.

Bad idea:

<A>outer <B>inner</B> outer</A>

KindaXML outcome: <A> likely ends before <B>, and </A> may become stray.

Guidance: don’t nest; prefer sibling tags.

Attribute ambiguity in severely malformed tags

Example:

<tag a="x y z b=2>

KindaXML will recover by closing the quote at > and treat the entire remaining text as part of a. This is intentional: recovery is bounded to the tag.

Guidance: keep attributes simple; use CDATA for messy text.

Stray end tags

Because auto-close flattens structure, you may get stray </tag>. By default, recognized stray end tags are dropped; unknown ones can be passed through (configurable).


Recommended prompting style for LLMs

Tell the model:

  • Use only these tags: <cite> <note> <todo> <risk> ... (whitelist)
  • Do not nest tags
  • Prefer postfix citations: ... statement <cite id=1>.
  • Use CDATA for code or text with </>: <![CDATA[ ... ]]>
Commit count: 0

cargo fmt