genson-cli

Crates.io	genson-cli
lib.rs	genson-cli
version	0.4.2
created_at	2025-08-20 15:12:35.045149+00
updated_at	2025-09-25 21:54:23.320873+00
description	Command-line interface for JSON schema inference
homepage	https://github.com/lmmx/polars-genson
repository	https://github.com/lmmx/polars-genson
max_upload_size
id	1803541
size	88,921

Louis Maddox (lmmx)

documentation

https://docs.rs/genson-cli/

README

Genson CLI

A command-line tool for JSON schema inference with support for both regular and NDJSON.

Built on top of genson-core, this CLI tool provides a simple yet powerful interface for generating JSON schemas from JSON data files or standard input.

It was mainly for testing but might be useful in its own right as a standalone binary for schema inference.

Installation

cargo binstall genson-cli

or regular cargo install if you like building from source.

Usage

Basic Examples

# From a JSON file
genson-cli data.json

# From standard input
echo '{"name": "Alice", "age": 30}' | genson-cli

# From stdin with multiple JSON objects
cat multiple-objects.json | genson-cli

NDJSON Support

# Process newline-delimited JSON
genson-cli --ndjson data.jsonl

# From stdin
cat events.ndjson | genson-cli --ndjson

Array Handling

# Treat top-level arrays as object streams (default)
genson-cli data.json

# Preserve array structure
genson-cli --no-ignore-array array-data.json

Command Line Options

genson-cli - JSON schema inference tool

USAGE:
    genson-cli [OPTIONS] [FILE]

ARGS:
    <FILE>    Input JSON file (reads from stdin if not provided)

OPTIONS:
    -h, --help            Print this help message
    --no-ignore-array     Don't treat top-level arrays as object streams
    --ndjson              Treat input as newline-delimited JSON
    --avro                Output Avro schema instead of JSON Schema
    --normalise           Normalise the input data against the inferred schema
    --coerce-strings      Coerce numeric/boolean strings to schema type during normalisation
    --keep-empty          Keep empty arrays/maps instead of turning them into nulls
    --map-threshold <N>   Treat objects with >N keys as map candidates (default 20)
    --force-type k:v,...  Force field(s) to 'map' or 'record'
                          Example: --force-type labels:map,claims:record
    --map-encoding <mode> Choose map encoding (mapping|entries|kv)
                          mapping = Avro/JSON object (shared dict)
                          entries = list of single-entry objects (individual dicts)
                          kv      = list of {key,value} objects
    --wrap-root <field>   Wrap top-level schema under this required field

EXAMPLES:
    genson-cli data.json
    echo '{"name": "test"}' | genson-cli
    genson-cli --ndjson multi-line.jsonl

Normalisation

Normalisation rewrites raw JSON data so that every record conforms to a single inferred Avro schema. This is especially useful when input data is jagged, inconsistent, or comes from semi-structured sources.

Features:

Converts empty arrays/maps to null (default), or preserves them with --keep-empty.
Ensures missing keys are present with null values.
Handles unions (e.g. ["null", "string"] where values may be either).
Optionally coerces numeric/boolean strings into real types (--coerce-strings).

Examples

Simple Object Schema

Input:

{"name": "Alice", "age": 30, "active": true}

Command:

echo '{"name": "Alice", "age": 30, "active": true}' | genson-cli

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "active": {
      "type": "boolean"
    }
  },
  "required": [
    "age",
    "active", 
    "name"
  ]
}

Avro Schema

echo '{"name": "Alice", "age": 30, "active": true}' | genson-cli --avro

Output:

{
  "type": "record",
  "name": "document",
  "namespace": "genson",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "active",
      "type": "boolean"
    }
  ]
}

Multiple Objects Schema

Input file (users.json):

{"name": "Alice", "age": 30, "scores": [95, 87]}
{"name": "Bob", "age": 25, "city": "NYC", "active": true}
{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}

Command:

genson-cli users.json

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "type": "array",
      "items": {
        "type": "integer"
      }
    },
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": ["role"]
    }
  },
  "required": ["age", "name"]
}

NDJSON Processing

Input file (events.ndjson):

{"event": "login", "user": "alice", "timestamp": "2024-01-01T10:00:00Z"}
{"event": "logout", "user": "alice", "timestamp": "2024-01-01T11:00:00Z", "duration": 3600}
{"event": "login", "user": "bob", "timestamp": "2024-01-01T10:30:00Z", "ip": "192.168.1.100"}

Command:

genson-cli --ndjson events.ndjson

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "event": {
      "type": "string"
    },
    "user": {
      "type": "string"
    },
    "timestamp": {
      "type": "string"
    },
    "duration": {
      "type": "integer"
    },
    "ip": {
      "type": "string"
    }
  },
  "required": ["event", "timestamp", "user"]
}

Array Schema

Input file (array.json):

[
  {"id": 1, "name": "Product A"},
  {"id": 2, "name": "Product B", "category": "electronics"}
]

Command (treat as object stream - default):

genson-cli array.json

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "id": {
      "type": "integer"
    },
    "name": {
      "type": "string"
    },
    "category": {
      "type": "string"
    }
  },
  "required": ["id", "name"]
}

Command (preserve array structure):

genson-cli --no-ignore-array array.json

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "id": {
        "type": "integer"
      },
      "name": {
        "type": "string"
      },
      "category": {
        "type": "string"
      }
    },
    "required": ["id", "name"]
  }
}

Empty Values

Input (empty.json):

{"id": "Q1", "labels": {}}
{"id": "Q2", "labels": {"en": "Hello"}}

Command:

genson-cli --ndjson --normalise empty.json

Output:

{"id": "Q1", "labels": null}
{"id": "Q2", "labels": {"en": "Hello"}}

String Coercion

Input (stringy.json):

{"id": "42", "active": "true"}
{"id": 7, "active": false}

Command (default):

genson-cli --ndjson --normalise stringy.json

Output (no coercion, strings remain strings):

{"id": null, "active": null}
{"id": 7, "active": false}

Command (with coercion):

genson-cli --ndjson --normalise --coerce-strings data.json

Output:

{"id": 42, "active": true}
{"id": 7, "active": false}

Error Handling

The CLI provides clear error messages for common issues:

Invalid JSON

$ echo '{"invalid": json}' | genson-cli
Error: Invalid JSON input at index 1: expected value at line 1 column 13 - JSON: {"invalid": json}

File Not Found

$ genson-cli nonexistent.json
Error: No such file or directory (os error 2)

Empty Input

$ echo '' | genson-cli  
Error: No JSON strings provided

Performance

Parallel Processing: Automatically uses multiple cores for large datasets
Memory Efficient: Streams large files without loading everything into memory
Fast Parsing: Uses SIMD-accelerated JSON parsing where available

For a 100MB NDJSON file with 1M records:

Processing time: ~5-10 seconds (depending on CPU cores)
Memory usage: <100MB (constant regardless of file size)
Schema accuracy: 100% type detection

Integration

The CLI tool is part of the larger polars-genson ecosystem:

genson-core: Core Rust library
polars-genson: Python plugin for Polars
polars-jsonschema-bridge: Type conversion utilities

Use Cases

Data Analysis Pipeline

# Extract schema from API responses
curl https://api.example.com/users | genson-cli > users-schema.json

# Process log files
genson-cli --ndjson application.log > log-schema.json

# Validate data structure
cat data.json | genson-cli | jq '.properties | keys'

Schema-Driven Development

# Generate schema for documentation
genson-cli sample-data.json > api-schema.json

# Validate API responses match expected schema
# (combine with tools like ajv-cli for validation)

Data Migration

# Understand structure of legacy data
genson-cli legacy-export.json > legacy-schema.json

# Compare schemas between different data sources
diff <(genson-cli source1.json) <(genson-cli source2.json)

Advanced Usage

Processing Large Files

For very large JSON files, consider using streaming tools:

# Process large file in chunks
split -l 10000 large-file.ndjson chunk_
for chunk in chunk_*; do
    genson-cli --ndjson "$chunk" > "schema_${chunk}.json"
done

# Merge resulting schemas (requires additional tooling)

Custom Schema URIs

The tool supports different schema versions:

# Default: http://json-schema.org/schema#
genson-cli data.json

# The schema URI is automatically included in output

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

Licensed under the MIT License. See LICENSE](https://github.com/lmmx/polars-genson/blob/master/LICENSE) for details.

Commit count: 166

genson-cli

documentation

README

Genson CLI

Installation

Usage

Basic Examples

NDJSON Support

Array Handling

Command Line Options

Normalisation

Examples

Simple Object Schema

Avro Schema

Multiple Objects Schema

NDJSON Processing

Array Schema

Empty Values

String Coercion

Error Handling

Invalid JSON

File Not Found

Empty Input

Performance

Integration

Use Cases

Data Analysis Pipeline

Schema-Driven Development

Data Migration

Advanced Usage

Processing Large Files

Custom Schema URIs

Contributing

License

cargo fmt