html2json

Crates.iohtml2json
lib.rshtml2json
version0.5.7
created_at2026-01-01 05:56:41.365683+00
updated_at2026-01-25 17:32:17.511429+00
descriptionHTML to JSON extractor
homepage
repositoryhttps://github.com/qretaio/html2json
max_upload_size
id2015808
size188,749
Dinesh Bhattarai (dineshdb)

documentation

README

html2json

A Rust port of cheerio-json-mapper.


Overview

  • Input: HTML source + Extractor spec (JSON)
  • Output: JSON matching the structure defined in the spec
  • Available as: Rust crate, CLI tool, and WebAssembly npm package

Installation

npm / WebAssembly

npm install @qretaio/html2json

From crates.io (Rust)

cargo install html2json --features cli

From source

cargo install --path . --features cli
# or from a git repository
cargo install --git https://github.com/qretaio/html2json --features cli

Using just

just install

Usage

JavaScript / TypeScript

import { extract } from "@qretaio/html2json";

const html = `
  <article class="post">
    <h2>My Article</h2>
    <p class="author">John Doe</p>
    <div class="tags">
      <span>rust</span>
      <span>wasm</span>
    </div>
  </article>
`;

const spec = {
  title: "h2",
  author: ".author",
  tags: [
    {
      $: ".tags span",
      name: "$",
    },
  ],
};

const result = await extract(html, spec);
console.log(result);
// {
//   "title": "My Article",
//   "author": "John Doe",
//   "tags": [{"name": "rust"}, {"name": "wasm"}]
// }

CLI

# Extract from file
html2json examples/hn.html --spec examples/hn.json

# Extract from stdin (pipe from curl)
curl -s https://news.ycombinator.com/ | html2json --spec examples/hn.json

# Extract from stdin (pipe from cat)
cat examples/hn.html | html2json --spec examples/hn.json

# Check output matches expected JSON (useful for testing/CI)
html2json examples/hn.html --spec examples/hn.json --check expected.json

CLI Options

  • --spec, -s <FILE> - Path to JSON extractor spec file (required)
  • --check, -c <FILE> - Compare output against expected JSON file. Exits with 0 if match, 1 if differ (with colored diff).

Spec Format

The spec is a JSON object where each key defines an output field and each value defines a CSS selector to extract that field.

Basic Selectors

{
  "title": "h1",
  "description": "p.description"
}

Attributes

{
  "link": "a.main | attr:href",
  "image": "img.hero | attr:src"
}

Pipes (Transformations)

{
  "title": "h1 | trim",
  "slug": "h1 | lower | regex:\\s+-",
  "price": ".price | regex:\\$(\\d+\\.\\d+) | parseAs:int"
}

Available pipes:

  • trim - Trim whitespace
  • lower - Convert to lowercase
  • upper - Convert to uppercase
  • substr:start:end - Extract substring
  • regex:pattern - Regex capture (first group)
  • parseAs:int - Parse as integer
  • parseAs:float - Parse as float
  • attr:name - Get attribute value
  • void - Extract from void elements, useful for extracting xml

Collections (Arrays)

{
  "items": [
    {
      "$": ".item",
      "title": "h2",
      "description": "p"
    }
  ]
}

Scoping ($ selector)

{
  "$": "article",
  "title": "h1",
  "paragraphs": ["p"]
}

Fallback Operators (||)

{
  "title": "h1.main || h1.fallback || h1"
}

Optional Fields (?)

{
  "title": "h1",
  "description?": "p.description"
}

Optional fields that return null are removed from the output.

LICENSE

MIT

Commit count: 18

cargo fmt