ctranslate2-server

Crates.ioctranslate2-server
lib.rsctranslate2-server
version0.1.2
created_at2026-01-13 12:15:34.331982+00
updated_at2026-01-14 07:21:22.134275+00
descriptionA high-performance inference server for CTranslate2 models, compatible with OpenAI's API.
homepage
repositoryhttps://github.com/any35/ctranslate2-server
max_upload_size
id2040014
size109,922
Walter Ye (any35)

documentation

README

CTranslate2 Server

A high-performance, OpenAI-compatible HTTP server for CTranslate2 models, built with Rust and Axum.

Features

  • OpenAI Compatible: Implements /v1/chat/completions for text generation.
  • High Performance: Powered by CTranslate2 (C++) with efficient Rust bindings (ct2rs).
  • Multi-Model Support: Native support for NLLB and T5 models.
  • Flexible Configuration:
    • Aliases: Map friendly names (e.g., nllb) to specific model folders.
    • Lazy Loading: Models are loaded into memory only when first requested.
    • Hardware Acceleration: Support for CPU (MKL/DNNL) and GPU (CUDA).
  • Advanced Generation Control:
    • Beam Size
    • Repetition Penalty
    • Target Language (for multilingual models like NLLB)

Installation

cargo install ctranslate2-server

Quick Start

1. Prepare Models

Download your CTranslate2-converted models into a models/ directory. Structure example:

models/
├── nllb-200-distilled-600M/
│   ├── model.bin
│   ├── sentencepiece.model
│   └── shared_vocabulary.txt
└── t5-small/
    ├── model.bin
    └── ...

2. Generate Configuration

Run the built-in config generator to scan your models/ directory and create a config.toml.

Using Docker:

docker run --rm -v $(pwd):/app -w /app any35/ctranslate2-server:latest /app/server --config-gen
# Note: You might need to adjust permissions or run the generator locally if using the provided binary directly.
# Alternatively, copy the example config below.

Using Cargo (Local):

cargo run --bin config_generator

3. Run with Docker Compose

Create a docker-compose.yml (see Docker Compose section) and run:

docker-compose up -d

4. Test API

Generate Text (NLLB Translation):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nllb",
    "messages": [
      {"role": "user", "content": "Hello world"}
    ],
    "target_lang": "zho_Hans"
  }'

Configuration (config.toml)

# Global Defaults
default_model = "nllb"
target_lang = "eng_Latn"
device = "cpu"          # "cpu" or "cuda"
device_indices = [0]    # GPU IDs
beam_size = 5
repetition_penalty = 1.2

[server]
host = "0.0.0.0"
port = 8080

[aliases]
"nllb" = "nllb-200-distilled-600M"

[models]
[models."nllb-200-distilled-600M"]
path = "./models/nllb-200-distilled-600M"
model_type = "nllb"
target_lang = "fra_Latn" # Per-model default

API Reference

POST /v1/chat/completions

Parameters:

  • model: (string) Model alias or directory name.
  • messages: (array) List of messages. Last user message is used as prompt.
  • target_lang: (string, optional) Target language code (e.g., fra_Latn, zho_Hans). Overrides config.
  • beam_size: (int, optional) Beam size for search (default: 5).
  • repetition_penalty: (float, optional) Penalty for repeated tokens (default: 1.2).
  • no_repeat_ngram_size: (int, optional) Prevent repeating n-grams of this size.

Docker

Build Locally

./scripts/build_docker.sh

Run (CPU)

docker run -p 8080:8080 -v $(pwd)/models:/app/models -v $(pwd)/config.toml:/app/config.toml any35/ctranslate2-server:cpu

Run (GPU)

Requires NVIDIA Container Toolkit.

docker run --gpus all -p 8080:8080 -v $(pwd)/models:/app/models -v $(pwd)/config.toml:/app/config.toml any35/ctranslate2-server:gpu

Convert Models with Quantization

# for NLLB 3.3B, quantization with int8_float16
ct2-transformers-converter --model ./ --output_dir ./quant_16 --quantization int8_float16 

# for NLLB 1.3B, quantization with int8
ct2-transformers-converter --model ./ --output_dir ./quant_8 --quantization int8

Commit count: 46

cargo fmt