| Crates.io | html-mumu |
| lib.rs | html-mumu |
| version | 0.2.0-rc.4 |
| created_at | 2025-08-15 15:49:55.35547+00 |
| updated_at | 2025-10-04 16:58:59.611345+00 |
| description | HTML manipulation and tools plugin for the Lava/MuMu language |
| homepage | https://lava.nu11.uk |
| repository | https://gitlab.com/tofo/html-mumu |
| max_upload_size | |
| id | 1797021 |
| size | 191,589 |
Version: 0.2.0-rc.1
Repository: gitlab.com/tofo/html-mumu
License: MIT OR Apache-2.0 (dual)
html-mumu is a lean HTML toolkit for the Lava/MuMu ecosystem. It focuses on practical scraping/transformation tasks without pulling in a full DOM parser. Most operations use tolerant regular expressions and simple heuristics so they’re fast, dependency-light, and easy to compose in MuMu flows.
matches, contains_text, has_attr, is_article, domain_is).select, split, links, tables, extract_text_stage, from_string).<table> markup into MuMu arrays.regex and once_cell. Compatible with native and wasm32 targets.Design trade-off: This crate prioritizes good enough HTML handling with predictable performance. For highly irregular markup, nested/invalid HTML, or complex CSS selectors, a full parser would be more robust.
Most functions accept either:
Value::KeyedArray) with common fields:
outer_html, inner_html, text, tag, attrs (keyed map), base_url.Where needed, functions try to coerce the input to HTML using:
SingleString(s) → sStrArray([s]) → sKeyedArray → prefer outer_html/inner_html/textReturn conventions:
_ (e.g., an absent attribute or canonical URL).Bool.StrArray or MixedArray (for 2-D tables).NO_MORE_DATA.Supported by select/split/each_attr:
tag — matches elements by tag name (case-insensitive)..class — matches tokens in the class attribute.#id — matches exact id.These are best-effort and regex-based; they do not implement the full CSS spec.
html:from_string(value) -> transform<string>
One-shot source that yields the given HTML once, then ends.
html:select(selector, source) -> transform<string>
For each upstream item, emits outer HTML of all elements matching selector.
html:split(selector, source) -> transform<string>
Same engine as select; intended semantically for document chunking.
html:each_attr(selector, name, source) -> transform<string>
For each element matching selector, emits the attribute value name (if present).
html:links(source) -> transform<string>
Emits all anchor hrefs found upstream. If a base URL is known, links are resolved to absolute.
html:tables(source) -> transform<string>
Emits outer HTML of each <table> found upstream.
html:extract_text_stage(source) -> transform<string>
Emits visible text for each upstream item (tags/scripts/styles removed, whitespace normalized).
Partial application &
_: Stages support building up arguments iteratively. Supplying_for a slot defers it.
html:extract_text(value) -> string
Visible text (script/style/noscript stripped; tags removed; whitespace collapsed).
Alias: html:text.
html:inner_html(value) -> string | _
Node inner_html if present; else heuristically strip the outer tag; else node text or raw string.
html:outer_html(value) -> string | _
Node outer_html if present; else inner_html or text; else returns the given string.
html:attr(name, value) -> string | _
Attribute from node attrs[name] or opening tag of a string element.
html:has_attr(name, value[, expected]) -> bool
true if attribute exists (or equals expected if provided).
html:matches(selector, value) -> bool
Checks whether value matches a minimal selector (tag / .class / #id).
html:contains_text(pattern, value) -> bool
Case-insensitive substring test against visible text.
html:is_article(value) -> bool
Heuristic check: prefers <article>…</article> or sufficiently long paragraph blocks.
html:absolute_url(href, base_like) -> string
Resolve href against base_like, where base can be a URL string, a node with base_url, or HTML containing <base href="…">.
html:domain_is(domain, value) -> bool
Extracts a likely URL from value (node base_url, attrs.href, a string URL, or <base href>); compares domains (case-insensitive, strips www.).
html:canonical_url(value) -> string | _
Extracts <link rel="canonical" href="…">.
html:meta(name_or_property, value) -> string | _
Reads <meta name="…"> or <meta property="…"> and returns its content.
html:jsonld(value) -> StrArray
Returns raw JSON strings from <script type="application/ld+json"> blocks.
html:next_href(value[, base_like]) -> string | _
Finds “next page” links via common patterns (rel="next", class~="next", or suggestive link text like “next”, “older”, »). Returns an absolute URL if base is known.
html:table_to_2d(value) -> MixedArray(StrArray[])value into a 2-D structure (rows as StrArray). Designed for simple tables.html:strip_tags(value [, allowed]) -> stringvalue.allowed (array or comma-separated string). script/style/noscript are always dropped.html:absolute_url and streaming html:links can form absolute URLs using:
base_url<base href="…"> in the HTMLbase_like parameterThe resolver supports:
http/https schemes//host/path)/path) and relative paths with . / .. normalization_ instead of error (e.g., absent attribute/meta).NO_MORE_DATA.AGAIN is reserved for future async waits (not used in this crate).regex = "1.11"), compiled once via once_cell and reused.select/split first, then mapping.Cargo_lock. Calling extend("html") in Lava/MuMu hosts registers all functions.wasm32: does not export Cargo_lock. Call register_all(interp) from your host.core-mumu = 0.9.0-rc.3 and selects host/wasm features via target-specific dependency sections.Feature flags in this crate (
host,web) are markers for ecosystem parity; target-specificcore-mumufeatures actually control host vs wasm behavior.
# Visible text from a snippet
html:extract_text("<p>Hello <b>world</b></p>") # → "Hello world"
# Does a node/string contain text?
html:contains_text("privacy", html_value) # → true/false
# Resolve a link against a base
html:absolute_url("../page/2", "https://example.com/blog/1")
# Stream all links (already resolved if base is known)
links = html:links(html:from_string(page_html))
links() # → "https://example.com/a"
links() # → "https://example.com/b"
# ... then NO_MORE_DATA
src/share/… — small, pure helpers (selectors, URL utils, text stripping, table parsing, readability-ish, pagination).src/register/… — one MuMu function per file, each responsible for registering a single public symbol.src/lib.rs — wires up register_all and the Cargo_lock entrypoint for native builds.core-mumu 0.9.0-rc.3.wasm32 (no crossterm/libloading in wasm).extend("html") on native; call register_all(interp) on wasm/static.Issues and merge requests are welcome at https://gitlab.com/tofo/html-mumu.
Please keep changes small and additive; the crate values predictable behavior and low dependency surface.
Acknowledgements
core-mumu and related plugins for patterns around dynamic registration and Flow stages.regex and once_cell maintainers for foundational crates used here.Licensed under either of:
at your option.
See the repository for the full text of each license.