# Example of using epilogue visitor in the CUTLASS Python interface
This notebook walks through a basic example of using the CUTLASS Python interface to declare, compile, and run GEMMs with different epilogues through CUTLASS Epilogue Visitor.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/04_epilogue_visitor.ipynb)


## Prerequisites for running on Colab
This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
!#nvidia-smi

If running on Colab, you will need to install the CUTLASS Python interface. To do so, uncomment the following line and run the cell:

In [None]:
!#pip install nvidia-cutlass

## General setup
We first import various packages needed for the example, construct the input and output tensors that will be used in our example.

In [None]:
import torch
import cutlass
from cutlass.epilogue import relu
from cutlass import Tensor as FakeTensor
from cutlass.utils.profiler import CUDAEventProfiler

# This controls whether ther C++ GEMM declaration will be printed at each step. Set to `false` to
# omit this information.
print_module = True

# The Epilogue Visitor feature currently only works for SM80 and 90
from cutlass.backend.utils.device import device_cc
if device_cc() not in [80, 90]:
 import sys
 sys.exit()

m = 16384
n = m
k = 512

type_A = torch.float16
type_B = torch.float16
type_C = torch.float16
type_D = torch.float16

torch.manual_seed(2023)
scope_min = -4
scope_max = 4
tensor_A = torch.ceil(torch.empty(size=(m, k), dtype=type_A, device="cuda").uniform_(scope_min, scope_max))
tensor_B = torch.ceil(torch.empty(size=(k, n), dtype=type_B, device="cuda").uniform_(scope_min, scope_max))
tensor_C = torch.ceil(torch.empty(size=(m, n), dtype=type_C, device="cuda").uniform_(scope_min, scope_max))
tensor_D = torch.zeros_like(tensor_C)

plan = cutlass.op.Gemm(element=torch.float16, layout=cutlass.LayoutType.RowMajor, element_accumulator=torch.float32)

## Define the epilogue visitor functor
The epilogue functor can be defined as a simple Python function and a set of example tensors for inputs and outputs. The example below illustrates a complex epilogue under the directed acyclic graph structure (`F` is used twice). The epilogue takes source tensors in different ranks: `alpha`, `beta` are scalars, `bias` is a column vector to broadcast, and `C`, `aux` are matrices. It contains various math operations from basic arithmatic operations and built-in callable functions like `relu`. It also accomodates multiple outputs `D` and `F`. Note that there are some restrictions on syntax.
* Each named variable must be assigned exactly once and defined before it used.
* Reserved names: `accum`, `C`, and `D` are reserved for accumulator, tensor_C, and tensor_D.
* Return values must be a named variable.

The example tensors is a dictionary with tensor names as keys and reference tensors as values. The reference tensors can be `float`, `torch.Tensor`, `numpy.ndarray`, or our `FakeTensor`. They provides the shape and data type information of the inputs and outputs of the epilogue.

The epilogue can be generated simply through `cutlass.evt.trace(, )`.

In [None]:
# Define epilogue visitor
def example_epilogue(accum, alpha, C, beta, aux, bias):
 F = alpha * accum + (beta * C + aux)
 E = relu(F + 1) + bias
 D = E + F
 return D, F

# Construct inputs and outputs
alpha = 0.5
beta = 0.5
aux = torch.ceil(torch.empty(size=(m, n), dtype=type_C, device="cuda").uniform_(scope_min, scope_max))
bias = torch.ceil(torch.empty(size=(m, 1), dtype=type_C, device="cuda").uniform_(scope_min, scope_max))
tensor_F = torch.zeros_like(tensor_D)
examples_tensors = {
 "accum": FakeTensor(element=torch.float32, shape=(m, n), layout_tag=cutlass.LayoutType.RowMajor),
 "alpha": alpha,
 "C": tensor_C,
 "beta": beta,
 "aux": aux,
 "bias": bias,
 "D": tensor_D,
 "F": tensor_F
}

# Trace the epilogue visitor
epilogue_visitor = cutlass.epilogue.trace(example_epilogue, examples_tensors)

## Run a GEMM with the epilogue visitor functor
The `epilogue_visitor` can be used by setting the plan's `epilogue_visitor` field. The arguments for the epilogue visitor are provided as a `dict` through the `visitor_args` keyword argument.

In [None]:
visitor_args = {
 "alpha": alpha, "C": tensor_C, "beta": beta, 
 "aux": aux, "bias": bias, "D": tensor_D, "F": tensor_F
}

plan.epilogue_visitor = epilogue_visitor
plan.run(
 tensor_A, tensor_B, tensor_C, tensor_D, 
 visitor_args=visitor_args, print_module=print_module)

The epilogue function `example_epilogue` can be used as a reference function. We can now verify the results simply with

In [None]:
class TorchReference(torch.nn.Module):
 def forward(self, A, B, alpha, C, beta, aux, bias):
 accum = torch.matmul(A, B)
 return example_epilogue(accum, alpha, C, beta, aux, bias)

torch_reference = TorchReference()
tensor_D_ref, tensor_F_ref = torch_reference(tensor_A, tensor_B, alpha, tensor_C, beta, aux, bias)

assert torch.equal(tensor_D, tensor_D_ref)
assert torch.equal(tensor_F, tensor_F_ref)

The performance of CUTLASS fused kernel can be profiled with

In [None]:
warmup_iterations = 10
profile_iterations = 50
# Profile CUTLASS fused kernel
duration = CUDAEventProfiler(
 plan, warmup_iterations, profile_iterations,
 tensor_A, tensor_B, tensor_C, tensor_D, 
 visitor_args=visitor_args)()

print(f"CUTLASS duration: {duration:.2f} ms")