Crates.io | datafusion-objectstore-hdfs |
lib.rs | datafusion-objectstore-hdfs |
version | 0.1.4 |
source | src |
created_at | 2022-09-21 05:58:24.843993 |
updated_at | 2023-07-20 06:57:34.054295 |
description | A hdfs object store implemented the object store |
homepage | |
repository | https://github.com/datafusion-contrib/datafusion-objectstore-hdfs |
max_upload_size | |
id | 670708 |
size | 30,548 |
HDFS as a remote ObjectStore for Datafusion.
This crate introduces HadoopFileSystem
as a remote ObjectStore which provides the ability of querying on HDFS files.
For the HDFS access, We leverage the library fs-hdfs. Basically, the library only provides Rust FFI APIs for the libhdfs
which can be compiled by a set of C files provided by the official Hadoop Community.
Since the libhdfs
is also just a C interface wrapper and the real implementation for the HDFS access is a set of Java jars, in order to make this crate work, we need to prepare the Hadoop client jars and the JRE environment.
Install Java.
Specify and export JAVA_HOME
.
To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. Currently, we support Hadoop-2 and Hadoop-3.
Unpack the downloaded Hadoop distribution. For example, the folder is /opt/hadoop. Then prepare some environment variables:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export DYLD_LIBRARY_PATH=$JAVA_HOME/jre/lib/server
export CLASSPATH=$CLASSPATH:`hadoop classpath --glob`
Suppose there's a hdfs directory,
let hdfs_file_uri = "hdfs://localhost:8020/testing/tpch_1g/parquet/line_item";
in which there're a list of parquet files. Then we can query on these parquet files as follows:
let ctx = SessionContext::new();
let url = Url::parse("hdfs://").unwrap();
ctx.runtime_env().register_object_store(&url, Arc::new(HadoopFileSystem));
let table_name = "line_item";
println!(
"Register table {} with parquet file {}",
table_name, hdfs_file_uri
);
ctx.register_parquet(table_name, &hdfs_file_uri, ParquetReadOptions::default()).await?;
let sql = "SELECT count(*) FROM line_item";
let result = ctx.sql(sql).await?.collect().await?;
git submodule update --init --recursive
cargo test
During the testing, a HDFS cluster will be mocked and started automatically.
cargo build --no-default-features --features datafusion-objectstore-hdfs/hdfs3,datafusion-objectstore-hdfs-testing/hdfs3,datafusion-hdfs-examples/hdfs3
cargo test --no-default-features --features datafusion-objectstore-hdfs/hdfs3,datafusion-objectstore-hdfs-testing/hdfs3,datafusion-hdfs-examples/hdfs3
Run the ballista-sql test by
cargo run --bin ballista-sql --no-default-features --features hdfs3