fastcsv

Crates.iofastcsv
lib.rsfastcsv
version0.2.2
created_at2025-11-12 13:35:54.817736+00
updated_at2025-11-12 14:04:16.561698+00
descriptionA fast SIMD parser for CSV files as defined by RFC 4180, based on simdcsv (C++)
homepage
repositoryhttps://github.com/jagtesh/simdcsv
max_upload_size
id1929368
size1,945,316
Jag Chadha (jagtesh)

documentation

README

simdcsv

A fast SIMD parser for CSV files as defined by RFC 4180.

Now written in Rust! This project has been migrated from C++ to Rust to leverage memory safety, better cross-platform support, and LLVM's powerful vectorization capabilities.

Features

  • High Performance: Utilizes SIMD intrinsics (AVX2 on x86_64, NEON on ARM) for fast CSV parsing
  • RFC 4180 Compliant: Correctly handles quoted fields, escaped quotes, and standard CSV delimiters
  • Cross-Platform: Supports Linux, macOS, and Windows on x86_64 and ARM architectures
  • Memory Safe: Written in Rust with zero-cost abstractions
  • LLVM Optimized: Uses #[inline(always)] hints and target feature attributes for optimal code generation

Building

Prerequisites

  • Rust 1.70 or later (install from rustup.rs)
  • A CPU with SIMD support (AVX2 for x86_64, NEON for ARM)

Build Instructions

# Build release version with native CPU optimizations
cargo build --release

# The binary will be at target/release/simdcsv

The project automatically detects your CPU architecture and enables appropriate SIMD features via .cargo/config.toml.

Usage

# Parse a CSV file
./target/release/simdcsv <file.csv>

# Verbose output with statistics
./target/release/simdcsv -v <file.csv>

# Dump parsed field positions
./target/release/simdcsv -d <file.csv>

# Run with custom iteration count for benchmarking
./target/release/simdcsv -i 1000 <file.csv>

Examples

# Parse the included example files
./target/release/simdcsv examples/nfl.csv
./target/release/simdcsv examples/EDW.TEST_CAL_DT.csv

Performance

On modern x86_64 CPUs with AVX2 support, simdcsv achieves approximately 3.9 GB/s throughput parsing RFC 4180-compliant CSV files, which is 71% of the C++ baseline performance using a fully safe Rust implementation with no unsafe code in the hot path.

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test
cargo test test_parse_simple_csv

Architecture

The parsing algorithm follows a similar approach to simdjson:

  1. Read in a CSV file into a buffer - as per usual, the buffer will be cache-line-aligned and padded so that even an exuberantly long SIMD read in a unrolled loop can safely happen without having to worry about unsafe reads.

  2. Identification of CSV fields. This process will be considerably simpler, as unlike simdjson, we will not have to a implement a complex grammar.

a) We need to identify where are quotes are first - this ensures that escaped commas and CR-LF pairs are not treated as separators. Since RFC 4180 defines our quote convention as using "" for an escaped quote in all circumstances where they appear, and otherwise pairing quotes at the start and end of a field, this means that our quote detection code from simjson (see https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/ for a write-up) will allow us to identify all regions where we are 'inside' a quote quite easily.

The "edges" that we will identify here are relatively complex as we will nominally leave and reenter a quoted field every time we encounter a doubled-quote. So for example,

,"foo""bar,",

encountered in a field will cause us to 'leave and renenter' our quoted field between the 'foo' and the 'bar'. However, this will have no real effect on the main point of this pass, which is to identify unescaped commas and CR-LF sequences.

  1. Comma and CR-LF detection.

We need to then scan for commas and CR-LF pairs. This is relatively simple and the only new wrinkle on SIMD scanning techniques in simdjson is the fact that we have to detect a CR followed by a LF.

At this point, we can identify all our actual delimiters. There may be additional passes to be done in the SIMD domain, but it's possible that we might at this stage do a bits-to-indexes transform and start working on our CSV document as a series of indexes into our data in a 2-dimensional (at least nominally) array.

Other tasks that need to happen:

  • We should validate that the things that appear as "textdata" within the fields are valid ASCII as per the standard.
  • UTF validation is not covered by RFC 4180 but will surely be a necessity.
  • Numbers that appear within fields will likely need to be converted to integer or floating point values
  • The escaped text will need to be converted (in situ or in newly allocated storage) into unescaped variants
  • It should be possible to parse only some columns, without incurring much of a price for skipping the other columns.

SIMD Implementation

The Rust implementation leverages LLVM's vectorization capabilities through:

Target Features

  • AVX2 (x86_64): Used for 256-bit SIMD operations with _mm256_* intrinsics
  • PCLMULQDQ (x86_64): Carryless multiplication for efficient quote detection
  • NEON (ARM): 128-bit SIMD operations with vld1q_* and vceqq_* intrinsics

Optimization Techniques

  • #[inline(always)] attributes on hot path functions to encourage inlining
  • #[target_feature] attributes to enable instruction set extensions
  • Runtime feature detection with is_x86_feature_detected!() for CPU capability checking
  • Buffered processing (4-chunk buffering) for better instruction pipelining
  • Prefetching with _mm_prefetch to reduce cache misses
  • Explicit loop unrolling in bit-flattening routines

Build Configuration

The .cargo/config.toml automatically sets -C target-cpu=native to enable all available CPU features at compile time.

Migration from C++ to Rust

The codebase has been migrated from C++ to Rust with the following improvements:

Benefits

  • Memory Safety: No manual memory management, automatic cleanup via RAII (Drop trait)
  • Cross-Platform: Better platform abstraction through Rust's standard library
  • Modern Tooling: Cargo for dependency management, testing, and building
  • Error Handling: Type-safe error handling with Result types
  • Zero-Cost Abstractions: Rust's abstractions compile to efficient code

Original C++ Implementation

The original C++ implementation is available in the git history prior to commit d23361a. The Rust implementation maintains the same algorithmic approach while leveraging Rust's memory safety guarantees.

Performance Comparison

  • C++ baseline: ~5.5 GB/s on x86_64 with AVX2
  • Rust implementation: ~3.9 GB/s on x86_64 with AVX2 (71% of C++ performance)
  • Fully safe implementation using chunked allocation strategy - no unsafe code in hot paths
  • The performance gap is a reasonable tradeoff for complete memory safety

References

Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.

Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., & Neumann, T. (2013). Instant loading for main memory databases. Proceedings of the VLDB Endowment, 6(14), 1702-1713.

License

MIT

Commit count: 0

cargo fmt