Crates.io | genson-core |
lib.rs | genson-core |
version | 0.4.1 |
created_at | 2025-08-20 13:47:56.287013+00 |
updated_at | 2025-09-25 21:54:03.50531+00 |
description | Core library for JSON schema inference using genson-rs |
homepage | https://github.com/lmmx/polars-genson |
repository | https://github.com/lmmx/polars-genson |
max_upload_size | |
id | 1803448 |
size | 311,969 |
Fast and robust Rust library for JSON schema inference: pre-validates JSON to avoid panics, handles errors properly.
Adapts the genson-rs
library's SIMD parallelism after first checking the string with serde_json
in a streaming pass without allocating values.
This is the core library that powers both the genson-cli command-line tool and the polars-genson Python plugin. It includes a vendored and enhanced version of the genson-rs library with added safety features and comprehensive error handling.
Add this to your Cargo.toml
:
[dependencies]
genson-core = "0.1.2"
serde_json
in your dependencies but don't activate its preserve_order
feature,
genson-core
schema properties will not be in insertion order. This may be an unwelcome surprise!serde_json = { version = "1.0", features = ["preserve_order"] }
use genson_core::{infer_json_schema, SchemaInferenceConfig};
fn main() -> Result<(), String> {
let json_strings = vec![
r#"{"name": "Alice", "age": 30, "scores": [95, 87]}"#.to_string(),
r#"{"name": "Bob", "age": 25, "city": "NYC", "active": true}"#.to_string(),
];
let result = infer_json_schema(&json_strings, None)?;
println!("Processed {} JSON objects", result.processed_count);
println!("Schema: {}", serde_json::to_string_pretty(&result.schema)?);
Ok(())
}
use genson_core::{infer_json_schema, SchemaInferenceConfig};
let config = SchemaInferenceConfig {
ignore_outer_array: true, // Treat top-level arrays as streams of objects
delimiter: Some(b'\n'), // Enable NDJSON processing
schema_uri: Some("AUTO".to_string()), // Auto-detect schema URI
};
let result = infer_json_schema(&json_strings, Some(config))?;
let ndjson_data = vec![
r#"
{"user": "alice", "action": "login"}
{"user": "bob", "action": "logout"}
{"user": "charlie", "action": "login", "ip": "192.168.1.1"}
"#.to_string()
];
let config = SchemaInferenceConfig {
delimiter: Some(b'\n'), // Enable NDJSON mode
..Default::default()
};
let result = infer_json_schema(&ndjson_data, Some(config))?;
For more control over the schema building process:
use genson_core::genson_rs::{get_builder, build_json_schema, BuildConfig};
let mut builder = get_builder(Some("https://json-schema.org/draft/2020-12/schema"));
let build_config = BuildConfig {
delimiter: None,
ignore_outer_array: true,
};
let mut json_bytes = br#"{"field": "value"}"#.to_vec();
let schema = build_json_schema(&mut builder, &mut json_bytes, &build_config);
let final_schema = builder.to_schema();
In addition to inferring schemas, genson-core
can normalise arbitrary JSON values against an Avro schema.
This is useful when working with jagged or heterogeneous data where rows may encode the same field in different ways.
null
by default (configurable)."42"
→ 42
, "true"
→ true
).use genson_core::normalise::{normalise_value, normalise_values, NormaliseConfig};
use serde_json::json;
let schema = json!({
"type": "record",
"name": "doc",
"fields": [
{"name": "id", "type": "int"},
{"name": "labels", "type": {"type": "map", "values": "string"}}
]
});
let cfg = NormaliseConfig::default();
let input = json!({"id": 42, "labels": {}});
let normalised = normalise_value(input, &schema, &cfg);
assert_eq!(normalised, json!({"id": 42, "labels": null}));
NormaliseConfig
lets you control behaviour:
let cfg = NormaliseConfig {
empty_as_null: true, // [] and {} become null (default)
coerce_string: false, // "42" becomes null not coerced from string (default)
};
Input values:
{"id": 7, "labels": {"en": "Hello"}}
{"id": "42", "labels": {}}
Normalised (default):
{"id": 7, "labels": {"en": "Hello"}}
{"id": null, "labels": null}
Normalised (with coerce_string = true
):
{"id": 7, "labels": {"en": "Hello"}}
{"id": 42, "labels": null}
Parallel Processing
The library automatically uses parallel processing for:
Memory Optimisation
The library has been put together so as to avoid panics. That said, if a panic does occur, it will
be caught. This was left in after solving the initial panic problem, and should not be seen in
practice, since the JSON is always pre-validated with serde_json
and panics only occurred when the
JSON was invalid. Please report any examples you find that panic along with the JSON that caused it
if possible.
The library provides comprehensive error handling that catches and converts internal panics into proper error messages:
let invalid_json = vec![r#"{"invalid": json}"#.to_string()];
match infer_json_schema(&invalid_json, None) {
Ok(result) => println!("Success: {:?}", result),
Err(error) => {
// Will contain a descriptive error message instead of panicking
eprintln!("JSON parsing failed: {}", error);
}
}
Error messages include:
The library accurately infers:
string
, number
, integer
, boolean
, null
object
, array
This library uses OrderMap to preserve the original field ordering from JSON input:
// Input: {"z": 1, "b": 2, "a": 3}
// Output schema will maintain z -> b -> a ordering
When processing multiple JSON objects, schemas are intelligently merged:
// Object 1: {"name": "Alice", "age": 30}
// Object 2: {"name": "Bob", "city": "NYC"}
// Merged schema: name (required), age (optional), city (optional)
This crate is designed as the foundation for:
This crate is part of the polars-genson project. See the main repository for the contribution and development docs.
Licensed under the MIT License. See LICENSE](https://github.com/lmmx/polars-genson/blob/master/LICENSE) for details.
Contains vendored and adapted code from the Apache 2.0 licensed genson-rs crate.