datafusion-datasource-orc

Crates.iodatafusion-datasource-orc
lib.rsdatafusion-datasource-orc
version0.0.1
created_at2026-01-12 13:41:43.496057+00
updated_at2026-01-12 13:41:43.496057+00
descriptionORC file format support for Apache DataFusion
homepage
repositoryhttps://github.com/suxiaogang223/datafusion-datasource-orc
max_upload_size
id2037728
size233,033
Socrates (suxiaogang223)

documentation

README

DataFusion ORC Datasource

License Rust crates.io

A DataFusion extension providing ORC (Optimized Row Columnar) file format support for Apache DataFusion.

Overview

datafusion-datasource-orc adds comprehensive ORC file format support to Apache DataFusion, enabling efficient query execution on ORC data through predicate pushdown, column projection, and async I/O.

Built on top of orc-rust, it implements DataFusion's file format abstraction traits (FileFormat, FileSource, FileOpener) to provide a seamless experience similar to DataFusion's built-in Parquet support.

Features

  • Schema Inference - Automatically infer table schema from ORC files
  • Statistics Extraction - Extract file-level statistics for query optimization
  • Predicate Pushdown - Filter data at stripe level using ORC row indexes
  • Column Projection - Push down column selection to read only needed data
  • Async I/O - Non-blocking reads with support for S3, GCS, Azure, and local filesystems
  • Multi-file Schema Merging - Seamlessly query across multiple ORC files

Installation

Add to your Cargo.toml:

[dependencies]
datafusion-datasource-orc = "0.0.1"
datafusion = "51"

Quick Start

use datafusion::prelude::*;
use datafusion::datasource::listing::{
    ListingOptions, ListingTable, ListingTableConfig, ListingTableUrl,
};
use datafusion_datasource_orc::OrcFormat;
use std::sync::Arc;

#[tokio::main]
async fn main() -> datafusion_common::Result<()> {
    // Create a SessionContext
    let ctx = SessionContext::new();

    // Configure listing options with ORC format
    let listing_options = ListingOptions::new(Arc::new(OrcFormat::default()))
        .with_file_extension(".orc");

    // Create a listing table URL
    let table_path = ListingTableUrl::parse("file:///path/to/orc/files/")?;

    // Register the table
    let config = ListingTableConfig::new(table_path)
        .with_listing_options(listing_options);
    let table = ListingTable::try_new(config)?;
    ctx.register_table("my_table", Arc::new(table))?;

    // Execute query with predicate pushdown
    let df = ctx.sql("SELECT * FROM my_table WHERE id > 100").await?;
    df.show().await?;

    Ok(())
}

Configuration

Configure ORC reading behavior via format options:

use datafusion_datasource_orc::{OrcFormatFactory, OrcFormatOptions, OrcReadOptions};

let read_options = OrcReadOptions::default()
    .with_batch_size(16384)              // Rows per batch
    .with_pushdown_predicate(true)        // Enable predicate pushdown
    .with_metadata_size_hint(1_048_576);  // Metadata buffer hint

let format_options = OrcFormatOptions { read: read_options };
let orc_factory = OrcFormatFactory::new_with_options(format_options);

let ctx = SessionContext::new();
ctx.register_file_format("orc", Arc::new(orc_factory))?;

Format Options

Option Type Default Description
orc.batch_size usize 1024 Number of rows per RecordBatch
orc.pushdown_predicate bool true Enable/disable predicate pushdown
orc.metadata_size_hint usize 32768 Metadata allocation hint in bytes

Type Support

ORC Type Arrow Type Status
BOOLEAN Boolean
TINYINT Int8
SMALLINT Int16
INT Int32
BIGINT Int64
FLOAT Float32
DOUBLE Float64
STRING String
BINARY Binary
TIMESTAMP Timestamp
LIST List
MAP Map
STRUCT Struct
DECIMAL Decimal128
DATE Date32
VARCHAR String
CHAR String

Architecture

SQL Query
    ↓
DataFusion Logical Plan
    ↓
DataFusion Physical Plan
    ↓
OrcFormat.create_physical_plan()
    ↓
DataSourceExec (using OrcSource)
    ↓
OrcOpener.open()
    ↓
orc-rust ArrowReader
    ↓
Arrow RecordBatch Stream

Core Components

  • OrcFormat - Implements FileFormat trait, provides schema inference and statistics
  • OrcSource - Implements FileSource trait, handles predicate pushdown
  • OrcOpener - Implements FileOpener trait, manages file opening and stream creation
  • ObjectStoreChunkReader - Bridges DataFusion's object_store to orc-rust's reader

Development

Building

cargo build

Testing

# Run all tests
cargo test

# Run specific test module
cargo test --test basic_reading
cargo test --test predicate_pushdown

Benchmarks

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench --bench orc_query_sql -- full_table_scan

TODO

  • Additional ORC type support (STRUCT, DECIMAL, DATE, VARCHAR, CHAR)
  • Enhanced error handling and edge case coverage
  • Write support (query results to ORC format)
  • Compression options configuration
  • Performance optimizations
  • Comprehensive integration tests

Related Projects

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

Built on top of the excellent orc-rust library and inspired by DataFusion's Parquet implementation.

Commit count: 25

cargo fmt