# **Perform easy and efficient ML model inference** This crate is designed to run **any** Machine Learning model on **any** architecture with ease and efficiency. It leverages the [Triton Inference Server](https://github.com/triton-inference-server/server) (specifically the [Triton C library](https://github.com/triton-inference-server/core)) and provides a similar API with comparable advantages. However, **Tritonserver-rs** allows you to build the inference server locally, offering significant performance benefits. Check the [benchmark](./BENCH.md) for more details. --- # Usage Run inference in three simple steps: ## **Step 1. Prepare the model repository** Organize your model files in the following structure: ``` models/ ├── yolov8/ | ├── config.pbtxt | ├── 1/ | │ └── model.onnx | ├── 2/ | │ └── model.onnx | └── ``/ └── ``/ ``` **Rules**: - All models must be stored in the same root directory (`models/` in this example). - Each model resides in its own folder containing: - A `config.pbtxt` configuration file. - One or more subdirectories, each representing a version of the model and containing the model file (e.g., `model.onnx`). --- ## **Step 2. Write the code** Add **Tritonserver-rs** to your `Cargo.toml`: ```toml [dependencies] tritonserver-rs = "0.1" ``` Then write your application code: ```rust use tritonserver_rs::{Buffer, options::Options, Server}; use std::time::Duration; // Configure server options. let mut opts = Options::new("models/")?; opts.exit_timeout(Duration::from_secs(5))? .backend_directory("/opt/tritonserver/backends")?; // Create the server. let server = Server::new(opts).await?; // Input data. let image = image::open("/data/cats.jpg")?; let image = image.as_flat_samples_u8(); // Create a request (specify the model name and version). let mut request = server.create_request("yolov8", 2)?; // Add input data and an allocator. request .add_default_allocator() .add_input("IMAGE", Buffer::from(image))?; // Run inference. let fut = request.infer_async()?; // Obtain results. let response = fut.await?; ``` --- ## **Step 3. Deploy** Here is an example of how to deploy using `docker-compose.yml`: ```yml my_app: image: {DEV_IMAGE} volumes: - ./Cargo.toml:/project/ - ./src:/project/src - ../models:/models - ../cats.jpg:/data/cats.jpg entrypoint: ["cargo", "run", "--manifest-path=/project/Cargo.toml"] ``` We recommend using [Dockerfile.dev](./Dockerfile.dev) as `{DEV_IMAGE}`. For more details on suitable images and deployment instructions, see [DEPLOY.md](./DEPLOY.md). --- # **More Information** For further details, check out the following resources: - [Examples](./examples/): Learn how to run various ML models using **Tritonserver-rs**, configure inference, prepare models, and deploy. - [Model configuration guide](MODEL_CONFIGURATION.md). - [Build and deployment instructions](DEPLOY.md). - [Benchmark results](BENCH.md). - [Triton Inference Server guides](https://github.com/triton-inference-server/server/tree/main/docs/README.md). - Documentation on [docs.rs](TODO.com). --- # **Advantages of the Crate** - **Versatility**: Extensive configuration options for models and servers. - **High performance**: Optimized for maximum efficiency. - **Broad backend support**: Run PyTorch, ONNX, TensorFlow, TensorRT, OpenVINO, model pipelines, and custom backends out of the box. - **Compatibility**: Supports most GPUs and architectures. - **Multi-model handling**: Handle multiple models simultaneously. - **Prometheus integration**: Built-in support for monitoring. - **CUDA-optimized**: Directly handle model inputs and outputs on GPU memory. - **Dynamic server management**: Advanced runtime control features. - **Rust-based**: Enjoy the safety, speed, and concurrency benefits of Rust.