We will support different attention approaches. [Candle](https://github.com/huggingface/candle) provides us a brought varierty on existing implementations. 

| Type                  | From where                                                                                        |  
|-----------------------|---------------------------------------------------------------------------------------------------|
| SelfAttention         | Integrated- [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/wuerstchen/attention_processor.rs) for one input sequence. Depending on the implementation it is also called a *dot product attention* or *global attention*.| 
| CrossAttention (aka Co-Attention)       | Not Integrated so far - [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/stable_diffusion/attention.rs) for multiple input sequences | 
| CausalSelfAttention   | Not Integrated so far - [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/llama2_c.rs) for parts of one or multiple input sequences e.g., only all token before the present. Depending on the implementation also called *local attention*. | 
| MultiHeadAttention   | Not Integrated so far - [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/segment_anything/transformer.rs) for multiple concerns/ questions | 
| MultiQueryAttention   | Not Integrated so far - [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/chatglm.rs) for multiple concerns/ questions but knowing the other concerns/ questions | 
| GroupQueryAttention   | Not Integrated so far - [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/quantized_mpt.rs) for building logical groups between the questions | 


*Terms*:
- Heads: amount on parallel questions on a given stream.
- Contexts: amount of parallel streams.
- Temporal: Time.
- Spatial: Dimensionality. 

*Note: All attention should be available for multiple dimensions. This includes spatial transformer which acts in >= 2D space (=spatial) as required for CNN applications.*

More complex models mappes as own layer:
- https://arxiv.org/html/2312.06635v3
    * [here](https://github.com/huggingface/candle/blob/main/candle-transformers/src/models/rwkv_v6.rs)