// Tile size: 4x3 // Accumulators: 0-11 // Col regs: ymm12 // Row regs: ymm13-15 // Load col of A vmovaps ymm12, [rax] // Fill 3 cols of B vbroadcastss ymm13, dword ptr [rcx + 0] vbroadcastss ymm14, dword ptr [rcx + 4] vbroadcastss ymm15, dword ptr [rcx + 8] // N.B. Stepping cols in inner loop vfmadd231ps ymm0, ymm12, ymm13 vfmadd231ps ymm4, ymm12, ymm14 vfmadd231ps ymm8, ymm12, ymm15 vmovaps ymm12, [rax+32] vfmadd231ps ymm1, ymm12, ymm13 vfmadd231ps ymm5, ymm12, ymm14 vfmadd231ps ymm9, ymm12, ymm15 vmovaps ymm12, [rax+64] vfmadd231ps ymm2, ymm12, ymm13 vfmadd231ps ymm6, ymm12, ymm14 vfmadd231ps ymm10, ymm12, ymm15 vmovaps ymm12, [rax+96] vfmadd231ps ymm3, ymm12, ymm13 vfmadd231ps ymm7, ymm12, ymm14 vfmadd231ps ymm11, ymm12, ymm15 // Load col of A, switching col! vmovaps ymm13, [rax + 128] // Fill 3 cols of B vbroadcastss ymm14, dword ptr [rcx + 12] vbroadcastss ymm15, dword ptr [rcx + 16] vbroadcastss ymm12, dword ptr [rcx + 20] // N.B. Stepping cols in inner loop vfmadd231ps ymm0, ymm13, ymm14 vfmadd231ps ymm4, ymm13, ymm15 vfmadd231ps ymm8, ymm13, ymm12 vmovaps ymm13, [rax + 160] vfmadd231ps ymm1, ymm13, ymm14 vfmadd231ps ymm5, ymm13, ymm15 vfmadd231ps ymm9, ymm13, ymm12 vmovaps ymm13, [rax + 192] vfmadd231ps ymm2, ymm13, ymm14 vfmadd231ps ymm6, ymm13, ymm15 vfmadd231ps ymm10, ymm13, ymm12 vmovaps ymm13, [rax + 224] vfmadd231ps ymm3, ymm13, ymm14 vfmadd231ps ymm7, ymm13, ymm15 vfmadd231ps ymm11, ymm13, ymm12 add rcx, 24 add rax, 256