Crates.io | icechunk-python |
lib.rs | icechunk-python |
version | 0.1.0-alpha.1 |
source | src |
created_at | 2024-10-10 22:19:51.728251 |
updated_at | 2024-10-10 22:19:51.728251 |
description | Transactional storage engine for Zarr designed for use on cloud object storage |
homepage | https://github.com/earth-mover/icechunk |
repository | https://github.com/earth-mover/icechunk |
max_upload_size | |
id | 1404659 |
size | 1,419,076 |
Icechunk is a transactional storage engine for Zarr designed for use on cloud object storage.
Let's break down what that means:
The core entity in Icechunk is a store. A store is defined as a Zarr hierarchy containing one or more Arrays and Groups. The most common scenario is for an Icechunk store to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates. However, formally a store can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays. Users of Icechunk should aim to scope their stores only to related arrays and groups that require consistent transactional updates.
Icechunk aspires to support the following core requirements for stores:
This Icechunk project consists of three main parts:
All of this is open source, licensed under the Apache 2.0 license.
The Rust implementation is a solid foundation for creating bindings in any language. We encourage adopters to collaborate on the Rust implementation, rather than reimplementing Icechunk from scratch in other languages.
We encourage collaborators from the broader community to contribute to Icechunk. Governance of the project will be managed by Earthmover PBC.
We recommend using Icechunk from Python, together with the Zarr-Python library
!!! warning "Icechunk is a very new project." It is not recommended for production use at this time. These instructions are aimed at Icechunk developers and curious early adopters.
Icechunk is currently designed to support the Zarr V3 Specification. Using it today requires installing the [still unreleased] Zarr Python V3 branch.
To set up an Icechunk development environment, follow these steps
Activate your preferred virtual environment (here we use virtualenv
):
python3 -m venv .venv
source .venv/bin/activate
Alternatively, create a conda environment
mamba create -n icechunk rust python=3.12
conda activate icechunk
Install maturin
:
pip install maturin
Build the project in dev mode:
cd icechunk-python/
maturin develop
or build the project in editable mode:
cd icechunk-python/
pip install -e icechunk@.
[!WARNING] This only makes the python source code editable, the rust will need to be recompiled when it changes
Once you have everything installed, here's an example of how to use Icechunk.
from icechunk import IcechunkStore, StorageConfig
from zarr import Array, Group
# Example using memory store
storage = StorageConfig.memory("test")
store = await IcechunkStore.open(storage=storage, mode='r+')
# Example using file store
storage = StorageConfig.filesystem("/path/to/root")
store = await IcechunkStore.open(storage=storage, mode='r+')
# Example using S3
s3_storage = StorageConfig.s3_from_env(bucket="icechunk-test", prefix="oscar-demo-repository")
store = await IcechunkStore.open(storage=storage, mode='r+')
You will need docker compose
and (optionally) just
.
Once those are installed, first switch to the icechunk root directory, then start up a local minio server:
docker compose up -d
Use just
to conveniently run a test
just test
This is just an alias for
cargo test --all
!!! tip For other aliases see Justfile.
Every update to an Icechunk store creates a new snapshot with a unique ID. Icechunk users must organize their updates into groups of related operations called transactions. For example, appending a new time slice to mutliple arrays should be done as single transactions, comprising the following steps
While the transaction is in progress, none of these changes will be visible to other users of the store. Once the transaction is committed, a new snapshot is generated. Readers can only see and use committed snapshots.
Additionally, snapshots occur in a specific linear (i.e. serializable) order within branch.
A branch is a mutable reference to a snapshot--a pointer that maps the branch name to a snapshot ID.
The default branch is main
.
Every commit to the main branch updates this reference.
Icechunk's design protects against the race condition in which two uncoordinated sessions attempt to update the branch at the same time; only one can succeed.
Finally, Icechunk defines tags--immutable references to snapshot. Tags are appropriate for publishing specific releases of a repository or for any application which requires a persistent, immutable identifier to the store state.
[!NOTE] For more detailed explanation, have a look at the Icechunk spec
Zarr itself works by storing both metadata and chunk data into a abstract store according to a specified system of "keys". For example, a 2D Zarr array called myarray, within a group called mygroup, would generate the following keys.
mygroup/zarr.json
mygroup/myarray/zarr.json
mygroup/myarray/c/0/0
mygroup/myarray/c/0/1
In standard regular Zarr stores, these key map directly to filenames in a filesystem or object keys in an object storage system. When writing data, a Zarr implementation will create these keys and populate them with data. When modifying existing arrays or groups, a Zarr implementation will potentially overwrite existing keys with new data.
This is generally not a problem, as long there is only one person or process coordinating access to the data. However, when multiple uncoordinated readers and writers attempt to access the same Zarr data at the same time, various consistency problems problems emerge. These consistency problems can occur in both file storage and object storage; they are particularly severe in a cloud setting where Zarr is being used as an active store for data that are frequently changed while also being read.
With Icechunk, we keep the same core Zarr data model, but add a layer of indirection between the Zarr keys and the on-disk storage. The Icechunk library translates between the Zarr keys and the actual on-disk data given the particular context of the user's state. Icechunk defines a series of interconnected metadata and data files that together enable efficient isolated reading and writing of metadata and chunks. Once written, these files are immutable. Icechunk keeps track of every single chunk explicitly in a "chunk manifest".
flowchart TD
zarr-python[Zarr Library] <-- key / value--> icechunk[Icechunk Library]
icechunk <-- data / metadata files --> storage[(Object Storage)]
Why not just use Iceberg directly?
Iceberg and all other "table formats" (Delta, Hudi, LanceDB, etc.) are based on tabular data model. This data model cannot accommodate large, multidimensional arrays (tensors) in a general, scalable way.
Is Icechunk part of Zarr?
Formally, no. Icechunk is a separate specification from Zarr. However, it is designed to interoperate closely with Zarr. In the future, we may propose a more formal integration between the Zarr spec and Icechunk spec if helpful. For now, keeping them separate allows us to evolve Icechunk quickly while maintaining the stability and backwards compatibility of the Zarr data model.
Icechunk's was inspired by several existing projects and formats, most notably