# ggml Tensor library for machine learning ***Note that this project is under development and not ready for production use. \ Some of the development is currently happening in the [llama.cpp](https://github.com/ggerganov/llama.cpp) and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) repos*** ## Features - Written in C - 16-bit float support - 4-bit integer quantization support - Automatic differentiation (WIP in progress) - ADAM and L-BFGS optimizers - Optimized for Apple silicon via NEON intrinsics and Accelerate framework - On x86 architectures utilzes AVX intrinsics - No third-party dependencies - Zero memory allocations during runtime ## Roadmap - [X] Example of GPT-2 inference [examples/gpt-2](https://github.com/ggerganov/ggml/tree/master/examples/gpt-2) - [X] Example of GPT-J inference [examples/gpt-j](https://github.com/ggerganov/ggml/tree/master/examples/gpt-j) - [X] Example of Whisper inference [examples/whisper](https://github.com/ggerganov/ggml/tree/master/examples/whisper) - [X] Support 4-bit integer quantization https://github.com/ggerganov/ggml/pull/27 - [X] Example of Cerebras-GPT inference [examples/gpt-2](https://github.com/ggerganov/ggml/tree/master/examples/gpt-2) - [ ] Example of FLAN-T5 inference https://github.com/ggerganov/ggml/pull/12 - [X] Example of LLaMA inference [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) - [X] Example of RWKV inference [saharNooby/rwkv.cpp](https://github.com/saharNooby/rwkv.cpp) - [ ] Example of [SAM](https://github.com/facebookresearch/segment-anything) inference - [ ] Idea for GPU support: https://github.com/ggerganov/llama.cpp/discussions/915 - [X] Example of StableLM (GPT-NeoX) inference [examples/stablelm](https://github.com/ggerganov/ggml/tree/master/examples/stablelm) - [X] Example of BERT inference [skeskinen/bert.cpp](https://github.com/skeskinen/bert.cpp) ## Whisper inference (example) With ggml you can efficiently run [Whisper](examples/whisper) inference on the CPU. Memory requirements: | Model | Disk | Mem | | --- | --- | --- | | tiny | 75 MB | ~280 MB | | base | 142 MB | ~430 MB | | small | 466 MB | ~1.0 GB | | medium | 1.5 GB | ~2.6 GB | | large | 2.9 GB | ~4.7 GB | ## GPT inference (example) With ggml you can efficiently run [GPT-2](examples/gpt-2) and [GPT-J](examples/gpt-j) inference on the CPU. Here is how to run the example programs: ```bash # Build ggml + examples git clone https://github.com/ggerganov/ggml cd ggml mkdir build && cd build cmake .. make -j4 gpt-2 gpt-j # Run the GPT-2 small 117M model ../examples/gpt-2/download-ggml-model.sh 117M ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example" # Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM) ../examples/gpt-j/download-ggml-model.sh 6B ./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example" # Run the Cerebras-GPT 111M model # Download from: https://huggingface.co/cerebras python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/ ./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example" ``` The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows: | Model | Size | Time / Token | | --- | --- | --- | | GPT-2 | 117M | 5 ms | | GPT-2 | 345M | 12 ms | | GPT-2 | 774M | 23 ms | | GPT-2 | 1558M | 42 ms | | --- | --- | --- | | GPT-J | 6B | 125 ms | For more information, checkout the corresponding programs in the [examples](examples) folder. ## Using cuBLAS ```bash # fix the path to point to your CUDA compiler cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc .. ```