csv_deserializer

Crates.iocsv_deserializer
lib.rscsv_deserializer
version0.1.4
created_at2025-12-24 17:48:22.033136+00
updated_at2025-12-30 18:35:51.489602+00
descriptionA rust library to load a csv in a rusty typed way, all values are converted to the most appropriate enum
homepage
repositoryhttps://github.com/AliothCancer/csv_deserializer
max_upload_size
id2003625
size46,374
(AliothCancer)

documentation

README

What is this repo?

This repo contains a rust binary (main.rs) which translate a csv table in rust types, every column is converted into a Vec of enum representing all the unique types, if a column is of String type then every unique String will be deserialized as an enum variant (see iris dataset example).

  • The binary will output all the rust code to stdout so it can be easily piped to write a file via terminal.

Bin installation

  1. Clone the repo
git clone https://github.com/AliothCancer/csv_deserializer.git

A folder called csv_deserializer will be created

  1. Move inside that folder
cd csv_deserializer
  1. Compile the project
cargo build --release
  1. Copy in local bin
  • Assuming ~/.local/bin:
    • is in $PATH (bash)
    • is in $env.PATH (nushell)
cp target/release/csv_deserializer ~/.local/bin

Bin usage

❯ csv_deserializer -h
Usage: csv_deserializer [OPTIONS] --input-file <input_file>

Options:
  -i, --input-file <input_file>
  -n, --null-values <a,b,..>
  -h, --help                     Print help
  -V, --version                  Print version

Note on null values:

  • --null-values is an optional comma separate list of string which will be converted to the Null variant which all generated enums have

Lib Usage Guide

There is 2 struct to represent the csv file as rust type:

#[derive(Debug)]
pub struct CsvDataset<'a> {
    pub names: Vec<ColName>,
    pub values: Vec<Vec<CsvAny>>,
    pub null_values: NullValues<'a>,
    pub info: Vec<ColumnInfo>,
}
  1. CsvDataset is defined in the lib.rs. It can also be used to easily load a csv Every csv "cell" is stored in CsvAny type:
#[derive(Debug, PartialEq, PartialOrd, Clone)]
pub enum CsvAny {
    Str(String),
    Int(i64),
    Float(f64),
    Null,  // to represent null values
    Empty, // if it is just empty
}
  1. CsvDataFrame is generated from the binary of this crate so it is available only after you put the rust generated code in a rs file and defined it as a module. The exact structure depends on the csv file you passed, i.e. name of the columns, unique values for each column. (See the iris example as a reference of the structure of this type)

To use this library for generating and utilizing a typed Rust interface for your CSV files, follow these steps:

1. Loading the Dataset

First, load your CSV file using a csv::Reader. You then create a CsvDataset by providing the reader and specifying which strings should be treated as null values.

let file = File::open("iris.csv")?;
let rdr = csv::ReaderBuilder::new()
    .has_headers(true)
    .from_reader(file);

let dataset = CsvDataset::new(rdr, NullValues(&["NA"]));

2. Generating Rust Code

Use the csv_deserializing cli to generate the rust code for a specific csv file. The binary will print all the rust code so you can redirect this output to a file from your command line to save it.

3. Using the Generated Code

Once the code is saved into a file (e.g., iris.rs), you can import it into your project. To work with the typed data, initialize a CsvDataFrame type by passing the CsvDataset you created earlier.

mod iris;
use iris::*;

let df = CsvDataFrame::new(&dataset);

4. Iris Dataset ETL Example

    // Build a reader for the csv file
    let path = "iris.csv";
    let file = File::open(path)?;
    let rdr = csv::ReaderBuilder::new()
        .has_headers(true)
        .from_reader(file);

    // builf the CsvDataset with reader and nullvalues
    let dataset = CsvDataset::new(rdr, NullValues(vec!["NA"]));

    // The iris.rs file is generate with the binary of csv_deserializer

    // Then inside the iris.rs file a CsvDataFrame is used
    // as the main struct which contains all the data
    let df = CsvDataFrame::new(&dataset);

    // Do ETL stuffes in a type safe way but it comes at less
    // flexibility sometimes, so you can always use CsvDataset which
    // use CsvAny as the type for every cell

    // Can destruct the column wrapper called CsvColumn with if let
    if let CsvColumn::target(target) = &df.target
        && let CsvColumn::petal_length_cm(_pet_length) = &df.petal_length_cm
    {
        target.iter().for_each(|x| match x {
            target::Iris_setosa => todo!(),
            target::Iris_versicolor => todo!(),
            target::Iris_virginica => todo!(),
            target::Null => todo!(),
        });
    }

    // Can use a list of all columns
    // make sure to use completion
    // for match arms
    for col in df.get_columns() {
        match col {
            CsvColumn::sepal_length_cm(sepal_length_cms) => todo!(),
            CsvColumn::sepal_width_cm(sepal_width_cms) => todo!(),
            CsvColumn::petal_length_cm(petal_length_cms) => todo!(),
            CsvColumn::petal_width_cm(petal_width_cms) => todo!(),
            CsvColumn::target(targets) => todo!(),
        }
    }

More info

Name sanitization and Type Recognition: Categorical vs Numerical

Sanitization is achived converting any number or special char to Strings that will be used in the generated code. In particular the function which does it is contained in sanitizer.rs (sanitize_identifier).

The library identifies types by attempting to parse each raw CSV value.

  • Numerical: If a value parses as an i64, it is treated as an Int; if it parses as an f64, it is treated as a Float. For example taking a look at sepal length (cm) in the iris dataset, the resulting type is:
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum sepal_length_cm {
    Float(f64),
    Null,
}
// Also implement from string
impl std::str::FromStr for sepal_length_cm {
    type Err = String;

    fn from_str(s: &str) -> Result<Self, Self::Err> {
        let f = s.parse::<f64>().unwrap();
        Ok(sepal_length_cm::Float(f))
    }
}
  • Categorical: Values that cannot be parsed as numbers are treated as Str. The generated rust code for a string values column is like: (Example for iris dataset)
create_enum!(target;
"Iris-setosa" => Iris_setosa,
"Iris-versicolor" => Iris_versicolor,
"Iris-virginica" => Iris_virginica,
Null,
);

The create_enum macro is used to have a sintactic sugar way to associate raw strings to the the typed enum variant.

  • Metadata: ColumnInfo tracks the count of these types and stores unique variants to facilitate categorical Enum generation.

Main structure of the generated code

This is the example for the iris dataset:

sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa

Rust generated code:

#[derive(Debug)]
pub enum CsvColumn {
    sepal_length_cm(Vec<sepal_length_cm>),
    sepal_width_cm(Vec<sepal_width_cm>),
    petal_length_cm(Vec<petal_length_cm>),
    petal_width_cm(Vec<petal_width_cm>),
    target(Vec<target>),
}

pub struct CsvDataFrame {
    pub sepal_length_cm: CsvColumn,
    pub sepal_width_cm: CsvColumn,
    pub petal_length_cm: CsvColumn,
    pub petal_width_cm: CsvColumn,
    pub target: CsvColumn,
}

Each enum used to represent the csv value have a Null variant.

Commit count: 0

cargo fmt