Crates.io | connector_arrow |
lib.rs | connector_arrow |
version | 0.6.0 |
source | src |
created_at | 2024-02-05 09:45:22.354995 |
updated_at | 2024-10-15 18:09:22.309217 |
description | Load data from databases to Apache Arrow, the fastest way. |
homepage | |
repository | https://github.com/aljazerzen/connector_arrow |
max_upload_size | |
id | 1127434 |
size | 257,236 |
A flexible database client that converts data into Apache Arrow format across various databases.
This is achieved by defining API traits (i.e. Connection
) and implementing them for objects used
by a various database client crates (i.e. rusqlite::Connection
). connector_arrow
treats
databases as "data stores for Arrow format," aligning with the philosophy of data interoperability.
arrow::record_batch::RecordBatch
into database tables.Based on ConnectorX, but focus on being a Rust library, instead of a Python library. This means that this crate:
Similar to ADBC, but written in pure, safe Rust, without need for dynamic linking of C libraries.
RDBMS | SQLite | DuckDB | PostgreSQL | MySQL | Microsoft SQL Server |
---|---|---|---|---|---|
feature | src_sqlite |
src_duckdb |
src_postgres |
src_mysql |
src_tiberius |
dependency | rusqlite | duckdb | postgres | mysql | tiberius |
query | x | x | x | x | x |
query params | x | x | x | x | |
schema get | x | x | x | x | x |
schema edit | x | x | x | x | x |
append | x | x | x | x | x |
roundtrip: null & bool | x | x | x | x | x |
roundtrip: int | x | x | x | x | x |
roundtrip: uint | x | x | x | x | x |
roundtrip: float | x | x | x | x | x |
roundtrip: decimal | x | x | x | x | |
roundtrip: timestamp | x | x | x | x | x |
roundtrip: date | x | x | |||
roundtrip: time | x | x | |||
roundtrip: duration | x | x | |||
roundtrip: interval | |||||
roundtrip: utf8 | x | x | x | x | x |
roundtrip: binary | x | x | x | x | |
roundtrip: empty | x | x | x | x | |
containers | |||||
binary fallback | x | x |
None of the sources are enabled by default, use features to enable them.
Converting relational data from and to Apache Arrow comes with an inherent problem: type system of any database does not map to arrow type system with a one-to-one relation. In practice this means that:
VARCHAR
and TEXT
will be mapped into LargeUtf8
),Utf8
and LargeUtf8
will be mapped into TEXT
).As a consequence, a roundtrip of pushing data to a database and querying it back must convert at
least some types. We call this process "type coercion" and is documented by
Connection::coerce_type
.
When designing the mapping, we:
For example, most databases don't support sorting UInt8
. To get around that, we could:
Int8
by bounding the value to range 0..127
, which would lose information.Int8
by subtracting 128 (i.e. just reinterpreting the bytes as Int8
). This
coercion would be efficient, as the new type would not be any larger than the original, but it
would be confusing as value 0 would be converted to -128 and value 255 to 127.Int16
. This uses more space, but it does not lose information or change the
meaning of the stored value. So connector_arrow
coerces UInt8 => Int16
, UInt16 => Int32
,
UInt32 => Int64
and UInt64 => Decimal128(20, 0)
.Another problem when converting between two type systems is a mismatch between static and dynamic
type parameters. For example, arrow type Decimal128(20, 4)
defines precision and scale statically,
in the schema, which implies that it must the same for all values in the array. On the other hand,
PostgreSQL type NUMERIC
has dynamic precision and scale, which means that each value may have a
different pair of parameters. PostgreSQL does allow specifying parameters statically with
NUMERIC(20, 4)
, but that is only available for column definition and not for query results. Even
when selecting directly from a table, the result will only the information that this column is
NUMERIC
.
This problem is even more prevalent with SQLite, which has a fully dynamic type system. This means
that any table or result column may contain multiple different types. It is possible the declare
table column types, but that information is not validated at all (i.e. you could set the type to
PEANUT_BUTTER(42)
) and is only used to determine the type
affinity of the column.
This problem can be solved in the following ways:
NUMERIC
, that would mean conversion to a
decimal textual representation, encoded as Utf8
. This is generally slow and inconvenient to
work with.At the moment, connector_arrow
does not have a common way of solving this problem. Connector for
SQLite uses option 2 and other connectors don't support types with dynamic types parameters.
Preferred way of solving the problem is option 3: infer from the first batch and reject non-uniform types. This option will result in more errors being presented to the users. We justify this decision with observation that because conversion between the two type systems is non trivial, silent conversion into a different type would mean unpredictable downstream behavior, which would require user attention anyway. By catching the problem early, we have the option to provide informative error messages and hints on how to specify the result type explicitly.