# Connector Arrow A flexible database client that converts data into Apache Arrow format across various databases. This is achieved by defining API traits (i.e. `Connection`) and implementing them for objects used by a various database client crates (i.e. `rusqlite::Connection`). `connector_arrow` treats databases as "data stores for Arrow format," aligning with the philosophy of data interoperability. [Documentation](https://docs.rs/connector_arrow) ## Key features - **Query**: Query databases and retrieve results in Apache Arrow format. - **Query Parameters**: Utilize Arrow type system for query parameters. - **Temporal and Container Types**: Correctly handles temporal and container types. - **Schema Introspection**: Query the database for schema of specific tables. - **Schema Migration**: Basic schema migration commands. - **Append**: Write `arrow::record_batch::RecordBatch` into database tables. Based on [ConnectorX](https://github.com/sfu-db/connector-x), but focus on being a Rust library, instead of a Python library. This means that this crate: - uses minimal dependencies (it even disables default features), - does not support multiple destinations, but only [arrow](https://crates.io/crates/arrow), - does not include parallelism, but allows downstream crates to implement it themselves, - does not include connection pooling, but allows downstream crates to implement it themselves. Similar to [ADBC](https://arrow.apache.org/docs/format/ADBC.html), but written in pure, safe Rust, without need for dynamic linking of C libraries. ## Support matrix | RDBMS | SQLite | DuckDB | PostgreSQL | MySQL | Microsoft SQL Server | | --- | --- | --- | --- | --- | --- | | feature | `src_sqlite` | `src_duckdb` | `src_postgres` | `src_mysql` | `src_tiberius` | | dependency | [rusqlite](https://crates.io/crates/rusqlite) | [duckdb](https://crates.io/crates/duckdb) | [postgres](https://crates.io/crates/postgres) | [mysql](https://crates.io/crates/mysql) | [tiberius](https://crates.io/crates/tiberius) | | query | x | x | x | x | x | | query params | x | x | x | | x | | schema get | x | x | x | x | x | | schema edit | x | x | x | x | x | | append | x | x | x | x | x | | roundtrip: null & bool | x | x | x | x | x | | roundtrip: int | x | x | x | x | x | | roundtrip: uint | x | x | x | x | x | | roundtrip: float | x | x | x | x | x | | roundtrip: decimal | x | | x | x | x | | roundtrip: timestamp | x | x | x | x | x | | roundtrip: date | x | | x | | | | roundtrip: time | x | | x | | | | roundtrip: duration | x | | x | | | | roundtrip: interval | | | | | | | roundtrip: utf8 | x | x | x | x | x | | roundtrip: binary | x | x | x | x | | | roundtrip: empty | | x | x | x | x | | containers | | | | | | | binary fallback | x | | x | | | None of the sources are enabled by default, use features to enable them. ## Type coercion Converting relational data from and to Apache Arrow comes with an inherent problem: type system of any database does not map to arrow type system with a one-to-one relation. In practice this means that: - when querying data, multiple database types might be mapped into a single arrow type (for example, in PostgreSQL, both `VARCHAR` and `TEXT` will be mapped into `LargeUtf8`), - when pushing data, multiple arrow types might be mapped into a single database type (for example, in PostgreSQL, both `Utf8` and `LargeUtf8` will be mapped into `TEXT`). As a consequence, a roundtrip of pushing data to a database and querying it back must convert at least some types. We call this process "type coercion" and is documented by `Connection::coerce_type`. When designing the mapping, we: - guarantee that the roundtrip will not lose information (i.e. numbers losing precision), - prioritize correctness and predictability over storage efficiency. For example, most databases don't support sorting `UInt8`. To get around that, we could: 1. store it into `Int8` by bounding the value to range `0..127`, which would lose information. 2. store it into `Int8` by subtracting 128 (i.e. just reinterpreting the bytes as `Int8`). This coercion would be efficient, as the new type would not be any larger than the original, but it would be confusing as value 0 would be converted to -128 and value 255 to 127. 3. store it into `Int16`. This uses more space, but it does not lose information or change the meaning of the stored value. So `connector_arrow` coerces `UInt8 => Int16`, `UInt16 => Int32`, `UInt32 => Int64` and `UInt64 => Decimal128(20, 0)`. ## Dynamic vs static types Another problem when converting between two type systems is a mismatch between static and dynamic type parameters. For example, arrow type `Decimal128(20, 4)` defines precision and scale statically, in the schema, which implies that it must the same for all values in the array. On the other hand, PostgreSQL type `NUMERIC` has dynamic precision and scale, which means that each value may have a different pair of parameters. PostgreSQL does allow specifying parameters statically with `NUMERIC(20, 4)`, but that is only available for column definition and not for query results. Even when selecting directly from a table, the result will only the information that this column is `NUMERIC`. This problem is even more prevalent with SQLite, which has a fully dynamic type system. This means that any table or result column may contain multiple different types. It is possible the declare table column types, but that information is not validated at all (i.e. you could set the type to `PEANUT_BUTTER(42)`) and is only used to determine the [type affinity](https://www.sqlite.org/datatype3.html#type_affinity) of the column. This problem can be solved in the following ways: 1. **Convert to some other type.** For PostgreSQL `NUMERIC`, that would mean conversion to a decimal textual representation, encoded as `Utf8`. This is generally slow and inconvenient to work with. 2. **Buffer and infer.** If we are able receive and buffer all of the result data, we could infer the type information from the data. This is obviously not possible if we want to support streaming results, as it would defeat all of the benefits of streaming. It might happen that values don't have uniform types. In that case, we can reject the result entirely and inform the user to cast the data into a uniform type before returning the results. We can also try to cast the values to some uniform type, but is generally slow, error prone and that might not be possible. 3. **Infer from the first batch.** If we don't want to rule out streaming, we can opt for buffering only the first batch of data and inferring the types from that. If any of the subsequent batches turns out to have different types, we again have the options: reject or cast. At the moment, `connector_arrow` does not have a common way of solving this problem. Connector for SQLite uses option 2 and other connectors don't support types with dynamic types parameters. Preferred way of solving the problem is option 3: infer from the first batch and reject non-uniform types. This option will result in more errors being presented to the users. We justify this decision with observation that because conversion between the two type systems is non trivial, silent conversion into a different type would mean unpredictable downstream behavior, which would require user attention anyway. By catching the problem early, we have the option to provide informative error messages and hints on how to specify the result type explicitly.