html-mumu

Crates.iohtml-mumu
lib.rshtml-mumu
version0.2.0-rc.4
created_at2025-08-15 15:49:55.35547+00
updated_at2025-10-04 16:58:59.611345+00
descriptionHTML manipulation and tools plugin for the Lava/MuMu language
homepagehttps://lava.nu11.uk
repositoryhttps://gitlab.com/tofo/html-mumu
max_upload_size
id1797021
size191,589
(justifiedmumu)

documentation

README

html-mumu — HTML helpers for the Lava/MuMu language

Version: 0.2.0-rc.1
Repository: gitlab.com/tofo/html-mumu
License: MIT OR Apache-2.0 (dual)

html-mumu is a lean HTML toolkit for the Lava/MuMu ecosystem. It focuses on practical scraping/transformation tasks without pulling in a full DOM parser. Most operations use tolerant regular expressions and simple heuristics so they’re fast, dependency-light, and easy to compose in MuMu flows.


Highlights

  • Mapper functions that turn HTML (or node-like values) into strings or structured MuMu values (e.g., text extraction, attributes, metadata).
  • Predicate functions for quick filtering (matches, contains_text, has_attr, is_article, domain_is).
  • Streaming stages that integrate with Flow-style pipelines (select, split, links, tables, extract_text_stage, from_string).
  • URL utilities & metadata: resolve relative links, read canonical URL/meta tags, extract JSON-LD, and detect next-page links.
  • Table helpers to coerce simple <table> markup into MuMu arrays.
  • No heavy deps: uses regex and once_cell. Compatible with native and wasm32 targets.

Design trade-off: This crate prioritizes good enough HTML handling with predictable performance. For highly irregular markup, nested/invalid HTML, or complex CSS selectors, a full parser would be more robust.


Data model & inputs

Most functions accept either:

  • a raw HTML string, or
  • a node-like MuMu value (Value::KeyedArray) with common fields:
    • outer_html, inner_html, text, tag, attrs (keyed map), base_url.

Where needed, functions try to coerce the input to HTML using:

  • SingleString(s)s
  • StrArray([s])s
  • KeyedArray → prefer outer_html/inner_html/text

Return conventions:

  • Missing value → the MuMu placeholder _ (e.g., an absent attribute or canonical URL).
  • PredicatesBool.
  • CollectionsStrArray or MixedArray (for 2-D tables).
  • Stages → a zero-argument transform function that yields one item per tick and ends with NO_MORE_DATA.

Selectors (minimal CSS-like)

Supported by select/split/each_attr:

  • tag — matches elements by tag name (case-insensitive).
  • .class — matches tokens in the class attribute.
  • #id — matches exact id.

These are best-effort and regex-based; they do not implement the full CSS spec.


Function reference

Sources & stages (streaming)

  • html:from_string(value) -> transform<string>
    One-shot source that yields the given HTML once, then ends.

  • html:select(selector, source) -> transform<string>
    For each upstream item, emits outer HTML of all elements matching selector.

  • html:split(selector, source) -> transform<string>
    Same engine as select; intended semantically for document chunking.

  • html:each_attr(selector, name, source) -> transform<string>
    For each element matching selector, emits the attribute value name (if present).

  • html:links(source) -> transform<string>
    Emits all anchor hrefs found upstream. If a base URL is known, links are resolved to absolute.

  • html:tables(source) -> transform<string>
    Emits outer HTML of each <table> found upstream.

  • html:extract_text_stage(source) -> transform<string>
    Emits visible text for each upstream item (tags/scripts/styles removed, whitespace normalized).

Partial application & _: Stages support building up arguments iteratively. Supplying _ for a slot defers it.

Mappers & predicates

  • html:extract_text(value) -> string
    Visible text (script/style/noscript stripped; tags removed; whitespace collapsed).
    Alias: html:text.

  • html:inner_html(value) -> string | _
    Node inner_html if present; else heuristically strip the outer tag; else node text or raw string.

  • html:outer_html(value) -> string | _
    Node outer_html if present; else inner_html or text; else returns the given string.

  • html:attr(name, value) -> string | _
    Attribute from node attrs[name] or opening tag of a string element.

  • html:has_attr(name, value[, expected]) -> bool
    true if attribute exists (or equals expected if provided).

  • html:matches(selector, value) -> bool
    Checks whether value matches a minimal selector (tag / .class / #id).

  • html:contains_text(pattern, value) -> bool
    Case-insensitive substring test against visible text.

  • html:is_article(value) -> bool
    Heuristic check: prefers <article>…</article> or sufficiently long paragraph blocks.

URLs & metadata

  • html:absolute_url(href, base_like) -> string
    Resolve href against base_like, where base can be a URL string, a node with base_url, or HTML containing <base href="…">.

  • html:domain_is(domain, value) -> bool
    Extracts a likely URL from value (node base_url, attrs.href, a string URL, or <base href>); compares domains (case-insensitive, strips www.).

  • html:canonical_url(value) -> string | _
    Extracts <link rel="canonical" href="…">.

  • html:meta(name_or_property, value) -> string | _
    Reads <meta name="…"> or <meta property="…"> and returns its content.

  • html:jsonld(value) -> StrArray
    Returns raw JSON strings from <script type="application/ld+json"> blocks.

  • html:next_href(value[, base_like]) -> string | _
    Finds “next page” links via common patterns (rel="next", class~="next", or suggestive link text like “next”, “older”, »). Returns an absolute URL if base is known.

Tables

  • html:table_to_2d(value) -> MixedArray(StrArray[])
    Converts the first table in value into a 2-D structure (rows as StrArray). Designed for simple tables.

Tag stripping

  • html:strip_tags(value [, allowed]) -> string
    With 1 arg: remove all tags from value.
    With 2 args: keep only tags listed in allowed (array or comma-separated string). script/style/noscript are always dropped.

URL resolution details

html:absolute_url and streaming html:links can form absolute URLs using:

  • A node’s base_url
  • A <base href="…"> in the HTML
  • A provided base_like parameter

The resolver supports:

  • http/https schemes
  • Protocol-relative URLs (//host/path)
  • Root-relative (/path) and relative paths with . / .. normalization

Error handling & signals

  • Arity/type errors → descriptive error strings.
  • Missing data → returns the placeholder _ instead of error (e.g., absent attribute/meta).
  • Stages:
    • Yield one item per call.
    • End of stream → NO_MORE_DATA.
    • Non-blocking by design; AGAIN is reserved for future async waits (not used in this crate).

Performance notes

  • All selectors and many utilities are regex based (regex = "1.11"), compiled once via once_cell and reused.
  • Functions aim to be allocation-aware, but streaming HTML or extremely large documents can still be expensive—prefer narrowing with select/split first, then mapping.

Build targets & integration

  • Native (non-wasm): exports a dynamic loader entrypoint named Cargo_lock. Calling extend("html") in Lava/MuMu hosts registers all functions.
  • wasm32: does not export Cargo_lock. Call register_all(interp) from your host.
  • The crate depends on core-mumu = 0.9.0-rc.3 and selects host/wasm features via target-specific dependency sections.

Feature flags in this crate (host, web) are markers for ecosystem parity; target-specific core-mumu features actually control host vs wasm behavior.


Quick examples

# Visible text from a snippet
html:extract_text("<p>Hello <b>world</b></p>")  # → "Hello world"

# Does a node/string contain text?
html:contains_text("privacy", html_value)       # → true/false

# Resolve a link against a base
html:absolute_url("../page/2", "https://example.com/blog/1")

# Stream all links (already resolved if base is known)
links = html:links(html:from_string(page_html))
links()  # → "https://example.com/a"
links()  # → "https://example.com/b"
# ... then NO_MORE_DATA

Directory layout (high level)

  • src/share/… — small, pure helpers (selectors, URL utils, text stripping, table parsing, readability-ish, pagination).
  • src/register/…one MuMu function per file, each responsible for registering a single public symbol.
  • src/lib.rs — wires up register_all and the Cargo_lock entrypoint for native builds.

Compatibility

  • MuMu/Core: designed for core-mumu 0.9.0-rc.3.
  • Platforms: native and wasm32 (no crossterm/libloading in wasm).
  • Host loaders: dynamic loading via extend("html") on native; call register_all(interp) on wasm/static.

Contributing

Issues and merge requests are welcome at https://gitlab.com/tofo/html-mumu.
Please keep changes small and additive; the crate values predictable behavior and low dependency surface.

Acknowledgements

  • Tom Fotheringham and the MuMu/Lava community for design and stewardship across the plugin ecosystem.
  • Contributors to core-mumu and related plugins for patterns around dynamic registration and Flow stages.
  • The Rust regex and once_cell maintainers for foundational crates used here.

License

Licensed under either of:

  • MIT license
  • Apache-2.0 license

at your option.

See the repository for the full text of each license.

Commit count: 0

cargo fmt