file_to_json

Crates.iofile_to_json
lib.rsfile_to_json
version0.1.6
created_at2025-11-08 16:16:42.437533+00
updated_at2025-11-25 01:15:16.310614+00
descriptionConvert arbitrary text-based files into JSON using local parsers and an OpenRouter-powered fallback.
homepage
repositoryhttps://github.com/tomtang/file_to_json
max_upload_size
id1923019
size8,358,921
Tom Tang (shiba4life)

documentation

README

file_to_json

file_to_json is a Rust library that converts arbitrary text-based files into JSON. It understands a set of common structured formats locally (CSV, JSON, YAML, TOML) and falls back to an OpenRouter-hosted LLM for any formats it does not recognise.

Features

  • Local parsers for CSV, JSON, YAML, and TOML.
  • Automatic PDF text extraction before calling the LLM.
  • OpenRouter LLM fallback (default text model: anthropic/claude-3.7-sonnet).
  • Automatic chunking for large text payloads to stay within LLM limits.
  • Safe guards against sending large or non-UTF-8 payloads to the LLM.
  • Vision-aware fallback for common image formats (JPEG/PNG/GIF/WebP) that captions images via OpenRouter and emits structured metadata.
  • Simple API returning serde_json::Value.
  • Configurable fallback strategies for large files (chunking or code generation).

Installation

Add the crate to your project:

cargo add file_to_json --git https://github.com/your-org/file_to_json

(Replace the repository URL with where you host the crate.)

For Contributors

This repository uses Git LFS to manage large example files. After cloning, you'll need to:

  1. Install Git LFS: brew install git-lfs (macOS) or see git-lfs.github.com
  2. Initialize: git lfs install
  3. Pull large files: git lfs pull

See examples/README.md for more details.

Usage

use file_to_json::{Converter, FallbackStrategy, OpenRouterConfig};
use std::time::Duration;

fn main() -> Result<(), file_to_json::ConvertError> {
    let config = OpenRouterConfig {
        api_key: "sk-or-...".to_string(),
        model: "anthropic/claude-3.7-sonnet".to_string(),
        timeout: Duration::from_secs(60),
        fallback_strategy: FallbackStrategy::Chunked,
        vision_model: Some("anthropic/claude-3.7-sonnet".to_string()),
        max_image_bytes: 5 * 1024 * 1024, // 5 MiB
    };
    
    let converter = Converter::new(config)?;
    let value = converter.convert_path("data/sample.csv")?;
    println!("{}", serde_json::to_string_pretty(&value)?);
    Ok(())
}

Configuration

The OpenRouterConfig struct accepts the following fields:

  • api_keyrequired. Your OpenRouter API key.
  • model – optional. Defaults to anthropic/claude-3.7-sonnet.
  • timeout – optional. Request timeout duration. Defaults to 60 seconds.
  • fallback_strategy – optional. FallbackStrategy::Chunked (default) or FallbackStrategy::CodeGeneration.
  • vision_model – optional. Defaults to anthropic/claude-3.5-sonnet. Must support image inputs and JSON output.
  • max_image_bytes – optional. Maximum size (bytes) of image payloads; defaults to 5242880 (5 MiB).

Behaviour

  1. If the file extension is recognised, the crate parses it locally.
  2. If the file looks like a supported image (JPEG/PNG/GIF/WebP) it is base64-encoded and sent to the configured vision model, which is prompted to return JSON metadata containing a summary, tags, objects, dominant_colors, and confidence.
  3. Otherwise it sends the UTF-8 content (after extracting text for PDFs) to OpenRouter. For inputs that exceed 128 KiB the fallback strategy determines how to proceed:
    • chunked (default): splits the input into ≤128 KiB segments, converts each chunk, and merges the returned JSON (arrays concatenated, objects shallow-merged, mixed types wrapped in an array). Works best when each chunk shares a compatible structure.
    • code: sends the first/middle/last 10 lines to the model, asks for Python 3 code that can parse the full file, writes that code to a temporary script, and executes it locally to produce JSON (requires python3 on the PATH).
  4. The result is returned as serde_json::Value.

Binary files are rejected unless they are supported images (handled by the vision model), can be converted to UTF-8 text (e.g. PDFs via the built-in extractor), or can be handled by the code-generation fallback.

Example: image captioning

Running the bundled example on a JPEG:

cargo run --example convert -- ./examples/data/einstein.jpg <API_KEY>

produces structured JSON similar to:

{
  "summary": "A black and white portrait of an elderly person with wild white hair.",
  "tags": ["portrait", "black and white", "historical"],
  "objects": ["face", "hair", "jacket"],
  "dominant_colors": ["black", "white", "grey"],
  "confidence": 0.98
}

Testing

cargo test

License

This project is distributed under the terms of the MIT license.

Commit count: 0

cargo fmt