| Crates.io | fastcsv |
| lib.rs | fastcsv |
| version | 0.2.2 |
| created_at | 2025-11-12 13:35:54.817736+00 |
| updated_at | 2025-11-12 14:04:16.561698+00 |
| description | A fast SIMD parser for CSV files as defined by RFC 4180, based on simdcsv (C++) |
| homepage | |
| repository | https://github.com/jagtesh/simdcsv |
| max_upload_size | |
| id | 1929368 |
| size | 1,945,316 |
A fast SIMD parser for CSV files as defined by RFC 4180.
Now written in Rust! This project has been migrated from C++ to Rust to leverage memory safety, better cross-platform support, and LLVM's powerful vectorization capabilities.
#[inline(always)] hints and target feature attributes for optimal code generation# Build release version with native CPU optimizations
cargo build --release
# The binary will be at target/release/simdcsv
The project automatically detects your CPU architecture and enables appropriate SIMD features via .cargo/config.toml.
# Parse a CSV file
./target/release/simdcsv <file.csv>
# Verbose output with statistics
./target/release/simdcsv -v <file.csv>
# Dump parsed field positions
./target/release/simdcsv -d <file.csv>
# Run with custom iteration count for benchmarking
./target/release/simdcsv -i 1000 <file.csv>
# Parse the included example files
./target/release/simdcsv examples/nfl.csv
./target/release/simdcsv examples/EDW.TEST_CAL_DT.csv
On modern x86_64 CPUs with AVX2 support, simdcsv achieves approximately 3.9 GB/s throughput parsing RFC 4180-compliant CSV files, which is 71% of the C++ baseline performance using a fully safe Rust implementation with no unsafe code in the hot path.
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_parse_simple_csv
The parsing algorithm follows a similar approach to simdjson:
Read in a CSV file into a buffer - as per usual, the buffer will be cache-line-aligned and padded so that even an exuberantly long SIMD read in a unrolled loop can safely happen without having to worry about unsafe reads.
Identification of CSV fields. This process will be considerably simpler, as unlike simdjson, we will not have to a implement a complex grammar.
a) We need to identify where are quotes are first - this ensures that escaped commas and CR-LF pairs are not treated as separators. Since RFC 4180 defines our quote convention as using "" for an escaped quote in all circumstances where they appear, and otherwise pairing quotes at the start and end of a field, this means that our quote detection code from simjson (see https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/ for a write-up) will allow us to identify all regions where we are 'inside' a quote quite easily.
The "edges" that we will identify here are relatively complex as we will nominally leave and reenter a quoted field every time we encounter a doubled-quote. So for example,
,"foo""bar,",
encountered in a field will cause us to 'leave and renenter' our quoted field between the 'foo' and the 'bar'. However, this will have no real effect on the main point of this pass, which is to identify unescaped commas and CR-LF sequences.
We need to then scan for commas and CR-LF pairs. This is relatively simple and the only new wrinkle on SIMD scanning techniques in simdjson is the fact that we have to detect a CR followed by a LF.
At this point, we can identify all our actual delimiters. There may be additional passes to be done in the SIMD domain, but it's possible that we might at this stage do a bits-to-indexes transform and start working on our CSV document as a series of indexes into our data in a 2-dimensional (at least nominally) array.
Other tasks that need to happen:
The Rust implementation leverages LLVM's vectorization capabilities through:
_mm256_* intrinsicsvld1q_* and vceqq_* intrinsics#[inline(always)] attributes on hot path functions to encourage inlining#[target_feature] attributes to enable instruction set extensionsis_x86_feature_detected!() for CPU capability checking_mm_prefetch to reduce cache missesThe .cargo/config.toml automatically sets -C target-cpu=native to enable all available CPU features at compile time.
The codebase has been migrated from C++ to Rust with the following improvements:
The original C++ implementation is available in the git history prior to commit d23361a. The Rust implementation maintains the same algorithmic approach while leveraging Rust's memory safety guarantees.
Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.
Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., & Neumann, T. (2013). Instant loading for main memory databases. Proceedings of the VLDB Endowment, 6(14), 1702-1713.
MIT