connector_arrow

Crates.ioconnector_arrow
lib.rsconnector_arrow
version0.6.0
sourcesrc
created_at2024-02-05 09:45:22.354995
updated_at2024-10-15 18:09:22.309217
descriptionLoad data from databases to Apache Arrow, the fastest way.
homepage
repositoryhttps://github.com/aljazerzen/connector_arrow
max_upload_size
id1127434
size257,236
Aljaž Mur Eržen (aljazerzen)

documentation

https://docs.rs/connector_arrow

README

Connector Arrow

A flexible database client that converts data into Apache Arrow format across various databases.

This is achieved by defining API traits (i.e. Connection) and implementing them for objects used by a various database client crates (i.e. rusqlite::Connection). connector_arrow treats databases as "data stores for Arrow format," aligning with the philosophy of data interoperability.

Documentation

Key features

  • Query: Query databases and retrieve results in Apache Arrow format.
  • Query Parameters: Utilize Arrow type system for query parameters.
  • Temporal and Container Types: Correctly handles temporal and container types.
  • Schema Introspection: Query the database for schema of specific tables.
  • Schema Migration: Basic schema migration commands.
  • Append: Write arrow::record_batch::RecordBatch into database tables.

Based on ConnectorX, but focus on being a Rust library, instead of a Python library. This means that this crate:

  • uses minimal dependencies (it even disables default features),
  • does not support multiple destinations, but only arrow,
  • does not include parallelism, but allows downstream crates to implement it themselves,
  • does not include connection pooling, but allows downstream crates to implement it themselves.

Similar to ADBC, but written in pure, safe Rust, without need for dynamic linking of C libraries.

Support matrix

RDBMS SQLite DuckDB PostgreSQL MySQL Microsoft SQL Server
feature src_sqlite src_duckdb src_postgres src_mysql src_tiberius
dependency rusqlite duckdb postgres mysql tiberius
query x x x x x
query params x x x x
schema get x x x x x
schema edit x x x x x
append x x x x x
roundtrip: null & bool x x x x x
roundtrip: int x x x x x
roundtrip: uint x x x x x
roundtrip: float x x x x x
roundtrip: decimal x x x x
roundtrip: timestamp x x x x x
roundtrip: date x x
roundtrip: time x x
roundtrip: duration x x
roundtrip: interval
roundtrip: utf8 x x x x x
roundtrip: binary x x x x
roundtrip: empty x x x x
containers
binary fallback x x

None of the sources are enabled by default, use features to enable them.

Type coercion

Converting relational data from and to Apache Arrow comes with an inherent problem: type system of any database does not map to arrow type system with a one-to-one relation. In practice this means that:

  • when querying data, multiple database types might be mapped into a single arrow type (for example, in PostgreSQL, both VARCHAR and TEXT will be mapped into LargeUtf8),
  • when pushing data, multiple arrow types might be mapped into a single database type (for example, in PostgreSQL, both Utf8 and LargeUtf8 will be mapped into TEXT).

As a consequence, a roundtrip of pushing data to a database and querying it back must convert at least some types. We call this process "type coercion" and is documented by Connection::coerce_type.

When designing the mapping, we:

  • guarantee that the roundtrip will not lose information (i.e. numbers losing precision),
  • prioritize correctness and predictability over storage efficiency.

For example, most databases don't support sorting UInt8. To get around that, we could:

  1. store it into Int8 by bounding the value to range 0..127, which would lose information.
  2. store it into Int8 by subtracting 128 (i.e. just reinterpreting the bytes as Int8). This coercion would be efficient, as the new type would not be any larger than the original, but it would be confusing as value 0 would be converted to -128 and value 255 to 127.
  3. store it into Int16. This uses more space, but it does not lose information or change the meaning of the stored value. So connector_arrow coerces UInt8 => Int16, UInt16 => Int32, UInt32 => Int64 and UInt64 => Decimal128(20, 0).

Dynamic vs static types

Another problem when converting between two type systems is a mismatch between static and dynamic type parameters. For example, arrow type Decimal128(20, 4) defines precision and scale statically, in the schema, which implies that it must the same for all values in the array. On the other hand, PostgreSQL type NUMERIC has dynamic precision and scale, which means that each value may have a different pair of parameters. PostgreSQL does allow specifying parameters statically with NUMERIC(20, 4), but that is only available for column definition and not for query results. Even when selecting directly from a table, the result will only the information that this column is NUMERIC.

This problem is even more prevalent with SQLite, which has a fully dynamic type system. This means that any table or result column may contain multiple different types. It is possible the declare table column types, but that information is not validated at all (i.e. you could set the type to PEANUT_BUTTER(42)) and is only used to determine the type affinity of the column.

This problem can be solved in the following ways:

  1. Convert to some other type. For PostgreSQL NUMERIC, that would mean conversion to a decimal textual representation, encoded as Utf8. This is generally slow and inconvenient to work with.
  2. Buffer and infer. If we are able receive and buffer all of the result data, we could infer the type information from the data. This is obviously not possible if we want to support streaming results, as it would defeat all of the benefits of streaming. It might happen that values don't have uniform types. In that case, we can reject the result entirely and inform the user to cast the data into a uniform type before returning the results. We can also try to cast the values to some uniform type, but is generally slow, error prone and that might not be possible.
  3. Infer from the first batch. If we don't want to rule out streaming, we can opt for buffering only the first batch of data and inferring the types from that. If any of the subsequent batches turns out to have different types, we again have the options: reject or cast.

At the moment, connector_arrow does not have a common way of solving this problem. Connector for SQLite uses option 2 and other connectors don't support types with dynamic types parameters.

Preferred way of solving the problem is option 3: infer from the first batch and reject non-uniform types. This option will result in more errors being presented to the users. We justify this decision with observation that because conversion between the two type systems is non trivial, silent conversion into a different type would mean unpredictable downstream behavior, which would require user attention anyway. By catching the problem early, we have the option to provide informative error messages and hints on how to specify the result type explicitly.

Commit count: 161

cargo fmt