# JSN _A queryable, streaming, JSON pull-parser with low allocation overhead._ - **Pull parser?**: The parser is implemented as an iterator that emits tokens - **Streaming?**: The JSON document being parsed is never fully loaded into memory. It is read & validated byte by byte. This makes it ideal for dealing with large JSON documents - **Queryable?** You can configure the parser to only emit & allocate tokens for the parts of the input you are interested in. JSON is expected to conform to [RFC 8259](https://datatracker.ietf.org/doc/html/rfc8259). However, [newline-delimited JSON](https://github.com/ndjson/ndjson-spec) and [concatenated json](https://en.wikipedia.org/wiki/JSON_streaming#Concatenated_JSON) formats are also supported. Input can come from any source that implements the `Read` trait (e.g. a file, byte slice, network socket etc..) ## Basic Usage ```rust use jsn::{TokenReader, mask::*, Format}; use std::error::Error; fn main() -> Result<(), Box> { let data = r#" { "name": "John Doe", "age": 43, "nicknames": [ "joe" ], "phone": { "carrier": "Verizon", "numbers": [ "+44 1234567", "+44 2345678" ] } } { "name": "Jane Doe", "age": 32, "nicknames": [ "J" ], "phone": { "carrier": "AT&T", "numbers": ["+33 38339"] } } "#; let mask = key("numbers").and(index(0)) .or(key("name")) .or(key("age")); let mut iter = TokenReader::new(data.as_bytes()) .with_mask(mask) .with_format(Format::Concatenated) .into_iter(); assert_eq!(iter.next().unwrap()?, "John Doe"); assert_eq!(iter.next().unwrap()?, 43); assert_eq!(iter.next().unwrap()?, "+44 1234567"); assert_eq!(iter.next().unwrap()?, "Jane Doe"); assert_eq!(iter.next().unwrap()?, 32); assert_eq!(iter.next().unwrap()?, "+33 38339"); assert_eq!(iter.next(), None); Ok(()) } ``` ## Quick Explanation Like traditional streaming parsers, the parser emits JSON tokens. The twist is that you can query them in a "fun" way. The best analogy is [bitmasks](https://stackoverflow.com/questions/10493411/what-is-bit-masking). If you can use a bitwise `AND` to extract a bit pattern: ```text input : 0101 0101 AND bitmask : 0000 1111 = pattern : 0000 0101 ``` Why can't you use a bitwise `AND` to extract a JSON token pattern? ```text input : { "hello": { "name" : "world" } } AND json mask : {something that extracts a "hello" key} = pattern : _ ________ { "name" : "world" } _ ``` That `{something that extracts a "hello" key}` is what this crate provides. ## Memory Footprint `jsn` allows you to select the parts of your JSON that are of interest. What you do with those parts and how long you keep them in memory is up to you. To illustrate this, I'll use the Valgrind DHAT tool to profile the heap memory usage of two similar programs. Both programs read & extract keys from a JSON file. I'll be using the sf-city-lots json file (189 MB) from [here](https://raw.githubusercontent.com/zemirco/sf-city-lots-json/33c27c137784a96d0fbd7f329dceda6cc7f49fa3/citylots.json). - `examples/store-tokens.rs`: This program keeps the extracted tokens in a Vec - `examples/print-tokens.rs`: This program prints the tokens as they are encountered ```shell valgrind --tool=dhat ./target/profiling/examples/store-tokens ~/downloads/citylots.json # ==1146722== Total: 13,823,524 bytes in 196,541 blocks # ==1146722== At t-gmax: 7,529,044 bytes in 196,515 blocks ``` ```shell valgrind --tool=dhat ./target/profiling/examples/print-tokens ~/downloads/citylots.json # ==1152944== Total: 1,240,708 bytes in 196,524 blocks # ==1152944== At t-gmax: 9,367 bytes in 9 blocks ``` The first number (Total) is the total amount of heap memory that was allocated by the program during its execution. The second number (At t-gmax) is the maximum amount of allocated memory at any one time during execution Unsurprisingly, `store-tokens.rs` has a higher footprint. Yet, the crate's utility is still obvious because the total memory allocated (13 MB) is still an order of magnitude less than the size of the file (189 MB). Things get better when you can operate immediately on tokens as they are yielded (i.e. you do not accumulate them). Not only do you allocate less in total, but your footprint is much much smaller. `print-tokens.rs` ripped through the file while using at most 7KB of heap memory at any one time.