lance-datagen

Crates.io	lance-datagen
lib.rs	lance-datagen
version	1.0.1
created_at	2023-10-31 18:53:52.865696+00
updated_at	2025-12-30 21:38:10.947641+00
description	A columnar data format that is 100x faster than Parquet for random access.
homepage
repository	https://github.com/lance-format/lance
max_upload_size
id	1020258
size	171,546

Lance Community (lance-community)

documentation

README

Lance Logo

The Open Lakehouse Format for Multimodal AI
High-performance vector search, full-text search, random access, and feature engineering capabilities for the lakehouse.
Compatible with Pandas, DuckDB, Polars, PyArrow, Ray, Spark, and more integrations on the way.

Documentation • Community • Discord

Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. Lance is perfect for:

Building search engines and feature stores with hybrid search capabilities.
Large-scale ML training requiring high performance IO and random access.
Storing, querying, and managing multimodal data including images, videos, audio, text, and embeddings.

The key features of Lance include:

Expressive hybrid search: Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.
Lightning-fast random access: 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.
Native multimodal data support: Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.
Data evolution: Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.
Zero-copy versioning: ACID transactions, time travel, and automatic versioning without needing extra infrastructure.
Rich ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

For more details, see the full Lance format specification.

[!TIP] Lance is in active development and we welcome contributions. Please see our contributing guide for more information.

Quick Start

Installation

pip install pylance

To install a preview release:

pip install --pre --extra-index-url https://pypi.fury.io/lance-format/pylance

[!NOTE] For versions prior to 1.0.0-beta.4, you can find them at https://pypi.fury.io/lancedb/pylance

[!TIP] Preview releases are released more often than full releases and contain the latest features and bug fixes. They receive the same level of testing as full releases. We guarantee they will remain published and available for download for at least 6 months. When you want to pin to a specific version, prefer a stable release.

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()
df

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ",
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
      for q in query_vectors]

Directory structure

Directory	Description
rust	Core Rust implementation
python	Python bindings (PyO3)
java	Java bindings (JNI)
docs	Documentation source

Benchmarks

Vector search

We used the SIFT dataset to benchmark our results with 1M vectors of 128D

For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 MacBook Air)

ANNs are always a trade-off between recall and performance

Vs. parquet

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/XMLs. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Why Lance for AI/ML workflows?

The machine learning development cycle involves multiple stages:

graph LR
    A[Collection] --> B[Exploration];
    B --> C[Analytics];
    C --> D[Feature Engineer];
    D --> E[Training];
    E --> F[Evaluation];
    F --> C;
    E --> G[Deployment];
    G --> H[Monitoring];
    H --> A;

Traditional lakehouse formats were designed for SQL analytics and struggle with AI/ML workloads that require:

Vector search for similarity and semantic retrieval
Fast random access for sampling and interactive exploration
Multimodal data storage (images, videos, audio alongside embeddings)
Data evolution for feature engineering without full table rewrites
Hybrid search combining vectors, full-text, and SQL predicates

While existing formats (Parquet, Iceberg, Delta Lake) excel at SQL analytics, they require additional specialized systems for AI capabilities. Lance brings these AI-first features directly into the lakehouse format.

A comparison of different formats across ML development stages:

	Lance	Parquet & ORC	JSON & XML	TFRecord	Database	Warehouse
Analytics	Fast	Fast	Slow	Slow	Decent	Fast
Feature Engineering	Fast	Fast	Decent	Slow	Decent	Good
Training	Fast	Decent	Slow	Fast	N/A	N/A
Exploration	Fast	Slow	Fast	Slow	Fast	Decent
Infra Support	Rich	Rich	Decent	Limited	Rich	Rich

Commit count: 0