// Tile size: 2x6 // Accumulators: 0-11 // Col regs: ymm14-15 // Row regs: ymm12-13 vbroadcastss ymm14, dword ptr [rcx] vmovaps ymm12, [rax] vmovaps ymm13, [rax + 32] vbroadcastss ymm15, dword ptr [rcx + 4] vfmadd231ps ymm0, ymm12, ymm14 vfmadd231ps ymm1, ymm13, ymm14 vbroadcastss ymm14, dword ptr [rcx + 8] vfmadd231ps ymm2, ymm12, ymm15 vfmadd231ps ymm3, ymm13, ymm15 vbroadcastss ymm15, dword ptr [rcx + 12] vfmadd231ps ymm4, ymm12, ymm14 vfmadd231ps ymm5, ymm13, ymm14 vbroadcastss ymm14, dword ptr [rcx + 16] vfmadd231ps ymm6, ymm12, ymm15 vfmadd231ps ymm7, ymm13, ymm15 vbroadcastss ymm15, dword ptr [rcx + 20] vfmadd231ps ymm8, ymm12, ymm14 vfmadd231ps ymm9, ymm13, ymm14 vbroadcastss ymm14, dword ptr [rcx+24] vfmadd231ps ymm10, ymm12, ymm15 vfmadd231ps ymm11, ymm13, ymm15 // Iteration two vmovaps ymm12, [rax + 64] vmovaps ymm13, [rax + 96] vbroadcastss ymm15, dword ptr [rcx + 24 + 4] vfmadd231ps ymm0, ymm12, ymm14 vfmadd231ps ymm1, ymm13, ymm14 vbroadcastss ymm14, dword ptr [rcx + 24 + 8] vfmadd231ps ymm2, ymm12, ymm15 vfmadd231ps ymm3, ymm13, ymm15 vbroadcastss ymm15, dword ptr [rcx + 24 + 12] vfmadd231ps ymm4, ymm12, ymm14 vfmadd231ps ymm5, ymm13, ymm14 vbroadcastss ymm14, dword ptr [rcx + 24 + 16] vfmadd231ps ymm6, ymm12, ymm15 vfmadd231ps ymm7, ymm13, ymm15 vbroadcastss ymm15, dword ptr [rcx + 24 + 20] vfmadd231ps ymm8, ymm12, ymm14 vfmadd231ps ymm9, ymm13, ymm14 vfmadd231ps ymm10, ymm12, ymm15 vfmadd231ps ymm11, ymm13, ymm15 add rax, 128 add rcx, 48