kindaxml

Crates.io	kindaxml
lib.rs	kindaxml
version	0.1.0
created_at	2025-12-05 13:39:32.035961+00
updated_at	2025-12-05 13:39:32.035961+00
description	Close-enough, XML-ish annotation parsing with deterministic recovery for LLM output.
homepage	https://github.com/soraxas/kindaxml
repository	https://github.com/soraxas/kindaxml
max_upload_size
id	1968214
size	78,395

Tin Lai (soraxas)

documentation

README

KindaXML (`kindaxml`) — close-enough, XML-ish markup for LLM output

KindaXML is an XML-inspired annotation DSL designed for LLM-generated text. It keeps the familiar <tag attr=...> shape, but the parser is tolerant: it recovers from missing end tags, missing quotes, and other common “almost XML” mistakes.

KindaXML is not XML (and not meant to be parsed by strict XML parsers). Think: well-formed-ish.

Why KindaXML?

LLMs are good at emitting XML-like text, but strict XML breaks easily. KindaXML aims to be:

LLM-friendly: angle brackets and attributes feel natural in prompts.
Deterministic recovery: malformed input still produces predictable output.
Annotation-first: tags annotate spans of text rather than building a complex DOM.
Configurable: recognized tags are whitelisted, unknown tags can be stripped or preserved.

Design: Annotation DSL (Option A) + a pinch of “blocks”

KindaXML’s primary output is a stream of text segments, each optionally annotated:

[
  {"text": "We shipped last week", "ann": [{"tag":"cite","attrs":{"id":"1"}}]},
  {"text": ". ", "ann": []},
  {"text": "Details", "ann": [{"tag":"note","attrs":{}}]}
]

KindaXML intentionally avoids deep nesting. In fact, it auto-closes open tags when the next tag begins, which keeps structures shallow and robust.

Syntax overview

Attributes

Supported forms:

a="x"
a='x'
a=x (unquoted)
a (boolean attribute; implies true)
Whitespace around = is allowed.

Parsing rules (the “close enough” part)

1) Tag boundary detection

A tag begins at < and ends at the first >.

If a quote starts inside the tag but never closes, it is implicitly closed at >.

Example:

<cite id='1,2>text</cite>

Parses as:

tag = cite
id = "1,2" (quote recovered)
inner text = text

2) Auto-close on encountering another tag

If a start tag is open and the parser encounters the next <something...>, the current tag is implicitly closed immediately before that next <.

This is the core rule that prevents runaway structures.

Example:

<A>hello <B>world</B>

<A> auto-closes before <B>.

3) Missing end tags are tolerated

If a tag never closes, it’s recovered according to its configured span strategy (below).

4) Self-closing tags

<tag .../> is treated as a marker annotation at that position (or optionally “annotate next token”, configurable).

Span strategies (how KindaXML decides what a tag annotates)

KindaXML is annotation-first. Each recognized tag can be configured with a span strategy:

`inline` (normal XML-ish)

If <tag> ... </tag> is present, annotate the inner range.

`retro_line` (great for citations)

If <cite ...> is unclosed, annotate the text on the current line before the tag (from last emitted newline to the tag start), optionally trimming punctuation/whitespace.

Example:

We shipped last week <cite id=1>.

The cite attaches to We shipped last week (not the punctuation).

Other useful strategies (optional)

forward_until_tag: annotate from the end of <tag ...> to the next tag start.
forward_until_newline: annotate until newline.
forward_next_token: annotate the next token/word.
noop: ignore tag if unclosed (marker-only tags).

Unknown tags

You instruct the LLM to use a whitelist of recognized tags, but the parser can handle unknown tags in one of three modes:

strip (default-friendly): drop unknown tag markup, keep inner text
passthrough: keep unknown tags as literal text
treat_as_text: don’t parse unknown tags at all; treat <...> as text

Escaping / literal text (CDATA support)

KindaXML can support XML’s CDATA form:

Start: <![CDATA[
End: ]]>

Inside CDATA, nothing is parsed as tags.

Example:

<note><![CDATA[
Use < and > freely here. Even <fake tags>.
]]></note>

If ]]> is missing, CDATA runs to end-of-document (recovered).

(If you prefer simpler escaping, you can also support \< and \> as literals.)

Using the Rust crate

use kindaxml::{parse, ParserConfig, UnknownMode};

fn main() {
    let mut cfg = ParserConfig::default();
    cfg.recognized_tags = ["cite", "note"].into_iter().map(String::from).collect();
    cfg.case_sensitive_tags = false;
    cfg.unknown_mode = UnknownMode::Strip;

    let input = "We shipped <cite id=1>last week</cite>.";
    let parsed = parse(input, &cfg);

    for segment in parsed.segments {
        println!("{:?} -> {:?}", segment.text, segment.annotations);
    }
}

ParserConfig exposes toggles for unknown tags, per-tag recovery strategies, case sensitivity, punctuation trimming, and auto-close behavior. The default config is conservative and strips unknown tags.

Examples

Run the runnable demo with cargo run --example basic to see the original snippets alongside their parsed segments and markers.

Closed tag (inline span)

Input:

We shipped <cite id="1">last week</cite>.

Output (conceptual):

We shipped (no annotations)
last week (annotated: cite{id=1})
. (no annotations)

Unclosed cite (retro_line)

Input:

We shipped last week <cite id=1>.

Output:

We shipped last week (annotated: cite{id=1})
.
(tag removed)

Broken quote recovery

Input:

<cite id='1, 2>Evidence</cite>

Recovered as id="1,2".

Auto-close on next tag

Input:

alpha <note>bravo <cite id=9> charlie

<note> auto-closes before <cite ...>
<cite> is unclosed and recovered by its strategy

Failure cases / limitations (by design)

Nesting will not behave like XML

KindaXML is not a DOM language. If you try to nest, the “auto-close on next tag” rule will flatten it.

Bad idea:

<A>outer <B>inner</B> outer</A>

KindaXML outcome: <A> likely ends before <B>, and </A> may become stray.

Guidance: don’t nest; prefer sibling tags.

Attribute ambiguity in severely malformed tags

Example:

<tag a="x y z b=2>

KindaXML will recover by closing the quote at > and treat the entire remaining text as part of a. This is intentional: recovery is bounded to the tag.

Guidance: keep attributes simple; use CDATA for messy text.

Stray end tags

Because auto-close flattens structure, you may get stray </tag>. By default, recognized stray end tags are dropped; unknown ones can be passed through (configurable).

Recommended prompting style for LLMs

Tell the model:

Use only these tags: <cite> <note> <todo> <risk> ... (whitelist)
Do not nest tags
Prefer postfix citations: ... statement <cite id=1>.
Use CDATA for code or text with </>: <![CDATA[ ... ]]>

Commit count: 0

kindaxml

documentation

README

KindaXML (`kindaxml`) — close-enough, XML-ish markup for LLM output

Why KindaXML?

Design: Annotation DSL (Option A) + a pinch of “blocks”

Syntax overview

Tags

Attributes

Parsing rules (the “close enough” part)

1) Tag boundary detection

2) Auto-close on encountering another tag

3) Missing end tags are tolerated

4) Self-closing tags

Span strategies (how KindaXML decides what a tag annotates)

`inline` (normal XML-ish)

`retro_line` (great for citations)

Other useful strategies (optional)

Unknown tags

Escaping / literal text (CDATA support)

Using the Rust crate

Examples

Closed tag (inline span)

Unclosed cite (retro_line)

Broken quote recovery

Auto-close on next tag

Failure cases / limitations (by design)

Nesting will not behave like XML

Attribute ambiguity in severely malformed tags

Stray end tags

Recommended prompting style for LLMs

cargo fmt

kindaxml

documentation

README

KindaXML (kindaxml) — close-enough, XML-ish markup for LLM output

Why KindaXML?

Design: Annotation DSL (Option A) + a pinch of “blocks”

Syntax overview

Tags

Attributes

Parsing rules (the “close enough” part)

1) Tag boundary detection

2) Auto-close on encountering another tag

3) Missing end tags are tolerated

4) Self-closing tags

Span strategies (how KindaXML decides what a tag annotates)

inline (normal XML-ish)

retro_line (great for citations)

Other useful strategies (optional)

Unknown tags

Escaping / literal text (CDATA support)

Using the Rust crate

Examples

Closed tag (inline span)

Unclosed cite (retro_line)

Broken quote recovery

Auto-close on next tag

Failure cases / limitations (by design)

Nesting will not behave like XML

Attribute ambiguity in severely malformed tags

Stray end tags

Recommended prompting style for LLMs

cargo fmt

KindaXML (`kindaxml`) — close-enough, XML-ish markup for LLM output

`inline` (normal XML-ish)

`retro_line` (great for citations)