// Accumulators: 0-9 // Columns: 14-15 // Rows: 10-13 vbroadcastss ymm10, dword ptr [rcx] vbroadcastss ymm11, dword ptr [rcx + 4] vbroadcastss ymm12, dword ptr [rcx + 8] vbroadcastss ymm13, dword ptr [rcx + 12] vmovaps ymm14, [rax] vmovaps ymm15, [rax + 32] vfmadd231ps ymm0, ymm14, ymm10 vfmadd231ps ymm1, ymm15, ymm10 vfmadd231ps ymm2, ymm14, ymm11 vfmadd231ps ymm3, ymm15, ymm11 // Use register 11 as it's "middle" use, leading to a decent // trade-off between required use next iteration and when it has // to be used this iteration. vbroadcastss ymm11, dword ptr [rcx + 16] vfmadd231ps ymm4, ymm14, ymm12 vfmadd231ps ymm5, ymm15, ymm12 vfmadd231ps ymm6, ymm14, ymm13 vfmadd231ps ymm7, ymm15, ymm13 vfmadd231ps ymm8, ymm14, ymm11 vfmadd231ps ymm9, ymm15, ymm11