datafusion-ffi

Crates.iodatafusion-ffi
lib.rsdatafusion-ffi
version52.1.0
created_at2024-11-09 04:06:03.548485+00
updated_at2026-01-24 16:02:48.608598+00
descriptionForeign Function Interface implementation for DataFusion
homepagehttps://datafusion.apache.org
repositoryhttps://github.com/apache/datafusion
max_upload_size
id1441797
size579,247
Andrew Lamb (alamb)

documentation

README

Apache DataFusion Foreign Function Interface

This crate contains code to allow interoperability of Apache DataFusion with functions from other libraries and/or DataFusion versions using a stable interface.

One of the limitations of the Rust programming language is that there is no stable Rust ABI (Application Binary Interface). If a library is compiled with one version of the Rust compiler and you attempt to use that library with a program compiled by a different Rust compiler, there is no guarantee that you can access the data structures. In order to share code between libraries loaded at runtime, you need to use Rust's FFI (Foreign Function Interface (FFI)).

The purpose of this crate is to define interfaces between DataFusion libraries that will remain stable across different versions of DataFusion. This allows users to write libraries that can interface between each other at runtime rather than require compiling all of the code into a single executable.

In general, it is recommended to run the same version of DataFusion by both the producer and consumer of the data and functions shared across the FFI, but this is not strictly required.

See API Docs for details and examples.

Use Cases

Two use cases have been identified for this crate, but they are not intended to be all inclusive.

  1. datafusion-python which will use the FFI to provide external services such as a TableProvider without needing to re-export the entire datafusion-python code base. With datafusion-ffi these packages do not need datafusion-python as a dependency at all.
  2. Users may want to create a modular interface that allows runtime loading of libraries. For example, you may wish to design a program that only uses the built in table sources, but also allows for extension from the community led datafusion-contrib repositories. You could enable module loading so that users could at runtime load a library to access additional data sources. Alternatively, you could use this approach so that customers could interface with their own proprietary data sources.

Limitations

One limitation of the approach in this crate is that it is designed specifically to work across Rust libraries. In general, you can use Rust's FFI to operate across different programming languages, but that is not the design intent of this crate. Instead, we are using external crates that provide stable interfaces that closely mirror the Rust native approach. To learn more about this approach see the abi_stable and async-ffi crates.

If you have a library in another language that you wish to interface to DataFusion the recommendation is to create a Rust wrapper crate to interface with your library and then to connect it to DataFusion using this crate. Alternatively, you could use bindgen to interface directly to the FFI provided by this crate, but that is currently not supported.

FFI Boundary

We expect this crate to be used by both sides of the FFI Boundary. This should provide ergonamic ways to both produce and consume structs and functions across this layer.

For example, if you have a library that provides a custom TableProvider, you can expose it by using FFI_TableProvider::new(). When you need to consume a FFI_TableProvider, you can access it by converting using ForeignTableProvider::from() which will create a struct that implements TableProvider.

There is a complete end to end demonstration in the examples.

Asynchronous Calls

Some of the functions with this crate require asynchronous operation. These will perform similar to their pure rust counterparts by using the async-ffi crate. In general, any call to an asynchronous function in this interface will not block the rest of the program's execution.

Struct Layout

In this crate we have a variety of structs which closely mimic the behavior of their internal counterparts. To see detailed notes about how to use them, see the example in FFI_TableProvider.

Memory Management

One of the advantages of Rust is the ownership model, which means programmers usually do not need to worry about memory management. When interacting with foreign code, this is not necessarily true. If you review the structures in this crate, you will find that many of them implement the Drop trait and perform a foreign call.

Suppose we have a FFI_CatalogProvider, for example. This struct is safe to pass across the FFI boundary, so it may be owned by either the library that produces the underlying CatalogProvider or by another library that consumes it. If we look closer at the FFI_CatalogProvider, it has a pointer to some private data. That private data is only accessible on the producer's side. If you attempt to access it on the consumer's side, you may get segmentation faults or other bad behavior. Within that private data is the actual Arc<dyn CatalogProvider. That Arc<> must be freed, but if the FFI_CatalogProvider is only owned on the consumer's side, we have no way to access the private data and free it.

To account for this, most structs in this crate have a release method that is used to clean up any privately held data. This calls into the producer's side, regardless of if it is called on either the local or foreign side. Most of the structs in this crate carry atomic reference counts to the underlying data, and this is straight forward. Some structs like the FFI_Accumulator contain an inner Box<dyn Accumulator>. The reason for this is that we need to be able to mutably access these based on the Accumulator trait definition. For these we have slightly more complicated release code based on whether it is being dropped on the local or foreign side. Traits that use a Box<> for their underlying data also cannot implement Clone.

Library Marker ID

When reviewing the code, many of the structs in this crate contain a call to a library_marker_id. The purpose of this call is to determine if a library is accessing local code through the FFI structs. Consider this example: you have a primary program that exposes functions to create a schema provider. You have a secondary library that exposes a function to create a catalog provider and the secondary library uses the schema provider of the primary program. From the point of view of the secondary library, the schema provider is foreign code.

Now when we register the secondary library with the primary program as a catalog provider and we make calls to get a schema, the secondary library will return a FFI wrapped schema provider back to the primary program. In this case that schema provider is actually local code to the primary program except that it is wrapped in the FFI code!

We work around this by the library_marker_id calls. What this does is it creates a global variable within each library and returns a usize address of that library. This is guaranteed to be unique for every library that contains FFI code. By comparing these usize addresses we can determine if a FFI struct is local or foreign.

In our example of the schema provider, if you were to make a call in your primary program to get the schema provider, it would reach out to the foreign catalog provider and send back a FFI_SchemaProvider object. By then comparing the library_marker_id of this object to the primary program, we determine it is local code. This means it is safe to access the underlying private data.

Users of the FFI code should not need to access these function. If you are implementing a new FFI struct, then it is recommended that you follow the established patterns for converting from FFI struct into the underlying traits. Specifically you should use crate::get_library_marker_id and in your unit tests you should override this with crate::mock_foreign_marker_id to force your test to create the foreign variant of your struct.

Task Context Provider

Many of the FFI structs in this crate contain a FFI_TaskContextProvider. The purpose of this struct is to weakly hold a reference to a method to access the current TaskContext. The reason we need this accessor is because we use the datafusion-proto crate to serialize and deserialize data across the FFI boundary. In particular, we need to serialize and deserialize functions using a TaskContext, which implements FunctionRegistry.

This becomes difficult because we may need to register multiple user defined functions, table or catalog providers, etc with a Session, and each of these will need the TaskContext to perform the processing. For this reason we cannot simply include the TaskContext at the time of registration because it would not have knowledge of anything registered afterward.

The FFI_TaskContextProvider is built from a trait that provides a method to get the current TaskContext. FFI_TaskContextProvider only holds a Weak reference to the TaskContextProvider, because otherwise we could create a circular dependency at runtime. It is imperative that if you use these methods that your provider remains valid for the lifetime of the calls. The FFI_TaskContextProvider is implemented on SessionContext and it is easy to implement on any struct that implements Session.

Commit count: 12494

cargo fmt