// Tile size: 5x2 // Accumulators: 0-9 // Col regs: ymm10-13 // Row regs: ymm14-15 vmovaps ymm10, [rax] vbroadcastss ymm14, dword ptr [rcx + 0] vbroadcastss ymm15, dword ptr [rcx + 4] vmovaps ymm11, [rax + 32] // NB stepping column-wise vfmadd231ps ymm0, ymm10, ymm14 vfmadd231ps ymm5, ymm10, ymm15 vmovaps ymm12, [rax + 64] vfmadd231ps ymm1, ymm11, ymm14 vfmadd231ps ymm6, ymm11, ymm15 vmovaps ymm13, [rax + 96] vfmadd231ps ymm2, ymm12, ymm14 vfmadd231ps ymm7, ymm12, ymm15 vmovaps ymm10, [rax + 128] vfmadd231ps ymm3, ymm13, ymm14 vfmadd231ps ymm8, ymm13, ymm15 vmovaps ymm11, [rax + 160] vfmadd231ps ymm4, ymm10, ymm14 vfmadd231ps ymm9, ymm10, ymm15 vbroadcastss ymm14, dword ptr [rcx + 8] vbroadcastss ymm15, dword ptr [rcx + 12] vmovaps ymm12, [rax + 192] // NB stepping column-wise vfmadd231ps ymm0, ymm11, ymm14 vfmadd231ps ymm5, ymm11, ymm15 vmovaps ymm13, [rax + 224] vfmadd231ps ymm1, ymm12, ymm14 vfmadd231ps ymm6, ymm12, ymm15 vmovaps ymm10, [rax + 256] vfmadd231ps ymm2, ymm13, ymm14 vfmadd231ps ymm7, ymm13, ymm15 vmovaps ymm11, [rax + 288] vfmadd231ps ymm3, ymm10, ymm14 vfmadd231ps ymm8, ymm10, ymm15 vfmadd231ps ymm4, ymm11, ymm14 vfmadd231ps ymm9, ymm11, ymm15 add rax, 320 add rcx, 16