# StableLM Transformer architecture: GPT-NeoX Ref: https://github.com/stability-AI/stableLM/#stablelm-alpha ## Usage ```bash # get the repo and build it git clone https://github.com/ggerganov/ggml cd ggml mkdir build && cd build cmake .. make -j # get the StableLM 3B Alpha model git clone https://huggingface.co/stabilityai/stablelm-base-alpha-3b # convert model to FP16 python3 ../examples/stablelm/convert-h5-to-ggml.py ./stablelm-base-alpha-3b/ 1 # run inference using FP16 precision make -j && ./bin/stablelm -m ./stablelm-base-alpha-3b/ggml-model-f16.bin -p "I believe the meaning of life is" -t 8 -n 64 main: seed = 1681940611 stablelm_model_load: loading model from 'models/stablelm-base-alpha-3b/ggml-model-f16.bin' - please wait ... stablelm_model_load: n_vocab = 50688 stablelm_model_load: n_ctx = 4096 stablelm_model_load: n_embd = 4096 stablelm_model_load: n_head = 32 stablelm_model_load: n_layer = 16 stablelm_model_load: n_rot = 32 stablelm_model_load: ftype = 1 stablelm_model_load: ggml ctx size = 10011.10 MB stablelm_model_load: memory_size = 2048.00 MB, n_mem = 65536 stablelm_model_load: ................................ done stablelm_model_load: model size = 6939.28 MB / num tensors = 260 main: number of tokens in prompt = 7 main: token[0] = 42, I main: token[1] = 2868, believe main: token[2] = 253, the main: token[3] = 4495, meaning main: token[4] = 273, of main: token[5] = 1495, life main: token[6] = 310, is I believe the meaning of life is to grow, to find a way, to love, to find an appreciation for life, and to live it with all of its beauty. For I am the child of God. I am the offspring of God's love. I am the offspring of the light of the world. I am the offspring of the main: mem per token = 12186760 bytes main: load time = 2118.55 ms main: sample time = 9.59 ms main: predict time = 4474.07 ms / 63.92 ms per token main: total time = 6911.26 ms ``` ## 4-bit integer quantization mode ```bash # quantize the model to 4-bits using Q4_3 quantization ./bin/stablelm-quantize ./stablelm-base-alpha-3b/ggml-model-f16.bin ./stablelm-base-alpha-3b/ggml-model-q4_3.bin 6 # run the quantized model ./bin/stablelm -m ./stablelm-base-alpha-3b/ggml-model-q4_3.bin -p "I believe the meaning of life is" -t 8 -n 64 main: seed = 1682021489 stablelm_model_load: loading model from 'models/stablelm-base-alpha-3b/ggml-model-q4_3.bin' - please wait ... stablelm_model_load: n_vocab = 50688 stablelm_model_load: n_ctx = 4096 stablelm_model_load: n_embd = 4096 stablelm_model_load: n_head = 32 stablelm_model_load: n_layer = 16 stablelm_model_load: n_rot = 32 stablelm_model_load: ftype = 6 stablelm_model_load: ggml ctx size = 5676.10 MB stablelm_model_load: memory_size = 1024.00 MB, n_mem = 65536 stablelm_model_load: ........................ done stablelm_model_load: model size = 2604.28 MB / num tensors = 196 main: number of tokens in prompt = 7 main: token[0] = 42, I main: token[1] = 2868, believe main: token[2] = 253, the main: token[3] = 4495, meaning main: token[4] = 273, of main: token[5] = 1495, life main: token[6] = 310, is I believe the meaning of life is to love and be loved. The last three verses were enough to tie us all together. If you love someone you love them all. There are some things in this world that are just not equal in Heaven. - Be here in this moment. This world is not what is outside of us. It is what main: mem per token = 12958024 bytes main: load time = 850.51 ms main: sample time = 9.95 ms main: predict time = 3103.81 ms / 44.34 ms per token main: total time = 4177.68 ms ``` ## Notes - No guarantees for correctness - The tokenizer is currently hacked - probably works only for English - Non-parallel residual is not supported - Contributions and improvements are welcome ## Note about possible bug **There might be some issue with this implementation - not 100% sure. The embeddings magnitude increases after each layer which is unexpected. To observe this, uncomment the following line:** https://github.com/ggerganov/ggml/blob/abea4b7609c14b837015ab625e3ac36c4708dd03/src/ggml.c#L9208 ``` ... p[ 0] = 65.5842 p[ 1] = 61.6951 p[ 2] = 59.3500 p[ 3] = 61.2421 p[ 4] = 65.9653 p[ 5] = 59.4936 p[ 6] = 58.4164 p[ 0] = -209.6351 p[ 1] = -214.0987 p[ 2] = -217.0928 p[ 3] = -215.0267 p[ 4] = -208.2430 p[ 5] = -215.3692 p[ 6] = -214.1981 p[ 0] = -301.0286 p[ 1] = -308.6521 p[ 2] = -310.7513 p[ 3] = -307.0832 p[ 4] = -299.9238 p[ 5] = -306.0667 p[ 6] = -302.1777 ... ``` **Instead, I think the magnitude should remain around `1`. See https://github.com/ggerganov/llama.cpp/issues/1063#issuecomment-1527730562 for more analysis**