Crates.io | inference |
lib.rs | inference |
version | 0.3.0 |
source | src |
created_at | 2023-02-18 05:32:26.240638 |
updated_at | 2023-02-18 05:32:26.240638 |
description | A crate for managing the machine learning inference process |
homepage | |
repository | https://github.com/opensensordotdev/inference |
max_upload_size | |
id | 788018 |
size | 372,777 |
A Rust crate for managing the inference process for machine learning (ML) models. Currently, we support interacting with a Triton Inference Server, loading models from a MinIO Model Store.
triton
service in docker compose
to not require a GPU...YMMV based on whether the model you're trying to serve inference requests from was already compiled/optimized for GPU-only inference.docker-compose
is now deprecated and the compose functionality is integrated into the docker compose
command. To install alongside an existing docker installation, run sudo apt-get install docker-compose-plugin
. ref.For Debian-based Linux distros, you can install inference
's dependencies (except Docker & NVIDIA container toolkit, that require special repository configuration documented above) with the following command:
apt-get install clang build-essential lld clang protobuf-compiler libprotobuf-dev zstd libzstd-dev make cmake pkg-config libssl-dev
inference
is tested on Ubuntu 22.04 LTS, but welcomes pull requests to fix Windows or MacOS issues.
git clone https://github.com/opensensordotdev/inference.git
protoc
! Otherwise inference
won't build!make
: Download the latest versions of the Triton Inference Server Protocol Buffer files & Triton sample ML modelsdocker compose up
: Start the MinIO and Triton containers + monitoring infrastructureinference.triton
service in docker-compose.yaml
. In order for your GPU to work with Triton, the CUDA versions on your host OS and the CUDA version expected by Triton have to be compatible.
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
sample_models
directory to the models
bucket vis the MinIO web UI at localhost:9001
cargo test
: Verify all cargo tests passhttp://localhost:8000/v2/models/simple
Will print model name and parameters required to set up the inputs and outputs.
proto folder will contain protocol buffers. Only grpc_service.proto
is referenced in the build.rs
because model_config.proto
is included by grpc_service
. Generated code from tonic is in inference.rs
Submitting requests to a gRPC service requires a mutable reference to a Client. This prohibits you from passing a single Client around to multiple Tasks and creates a bottleneck for async code.
Trying to hide this from users by wrapping what amounts to a synchronous resource in a struct and using async message passing to access it might help some but still doesn't fix the core problem.
While it would be possible to make a connection pool of multiple Client<Channel>
s and hide this pool in
a struct accessed with async message passing, this is complicated.
It also doesn't work to store a tonic.transport.Channel
in the TritonClient struct...it requires the struct to implement
some obscure internal tonic traits. tonic.transport.Channel.
The idiomatic way appears to be storing a single master Client in a struct and then providing a function that returns a clone of the Client since Cloning clients is cheap.
A limitation of this could be that gRPC servers usually have a finite number of connections they can multiplex (100 seems to be the number a lot of places throw out). See gRPC performance best practices.
tonic
seems to have a default buffer size of 1024. Source: DEFAULT_BUFFER_SIZE
channel/mod.rs
This might be useful eventually if you have multiple Triton pods and want to discover which ones are live + update the endpoint list grpc load balancing, github.
Not clear if there's a connection pool under the hood there or how they're able to connect to multiple servers?