# Vectorized and performance-portable Quicksort ## Introduction As of 2022-06-07 this sorts large arrays of built-in types about ten times as fast as LLVM's `std::sort`. Note that other algorithms such as pdqsort can be about twice as fast as LLVM's std::sort as of 2023-06. See also our [blog post](https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html) and [paper](https://arxiv.org/abs/2205.05982). ## Instructions Here are instructions for reproducing our results with cross-platform CMake, Linux, or AWS (SVE, NEON). ### CMake, any platform Please first ensure that Clang (tested with 13.0.1 and 15.0.6) is installed, and if it is not the default compiler, point the CC and CXX environment variables to it, e.g. ``` export CC=clang-15 export CXX=clang++-15 ``` Then run the usual CMake workflow, also documented in the Highway README, e.g.: ``` mkdir -p build && cd build && cmake .. && make -j taskset -c 2 tests/bench_sort ``` The optional `taskset -c 2` part reduces the variability of measurements by preventing the OS from migrating the benchmark between cores. ### Linux Please first ensure golang, and Clang (tested with 13.0.1) are installed via your system's package manager. ``` go install github.com/bazelbuild/bazelisk@latest git clone https://github.com/google/highway cd highway CC=clang CXX=clang++ ~/go/bin/bazelisk build -c opt hwy/contrib/sort:all bazel-bin/hwy/contrib/sort/sort_test bazel-bin/hwy/contrib/sort/bench_sort ``` ### AWS Graviton3 Instance config: amazon linux 5.10 arm64, c7g.8xlarge (largest allowed config is 32 vCPU). Initial launch will fail. Wait a few minutes for an email saying the config is verified, then re-launch. See IPv4 hostname in list of instances. `ssh -i /path/key.pem ec2-user@hostname` Note that the AWS CMake package is too old for llvm, so we build it first: ``` wget https://cmake.org/files/v3.23/cmake-3.23.2.tar.gz tar -xvzf cmake-3.23.2.tar.gz && cd cmake-3.23.2/ ./bootstrap -- -DCMAKE_USE_OPENSSL=OFF make -j8 && sudo make install cd .. ``` AWS clang is at version 11.1, which generates unnecessary `AND` instructions which slow down the sort by 1.15x. We tested with clang trunk as of June 13 (which reports Git hash 8f6512fea000c3a0d394864bb94e524bee375069). To build: ``` git clone --depth 1 https://github.com/llvm/llvm-project.git cd llvm-project mkdir -p build && cd build /usr/local/bin/cmake ../llvm -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi" -DCMAKE_BUILD_TYPE=Release make -j32 && sudo make install ``` ``` sudo yum install go go install github.com/bazelbuild/bazelisk@latest git clone https://github.com/google/highway cd highway CC=/usr/local/bin/clang CXX=/usr/local/bin/clang++ ~/go/bin/bazelisk build -c opt --copt=-march=armv8.2-a+sve hwy/contrib/sort:all bazel-bin/hwy/contrib/sort/sort_test bazel-bin/hwy/contrib/sort/bench_sort ``` The above command line enables SVE, which is currently only available on Graviton 3. You can also test NEON on the same processor, or other Arm CPUs, by changing the `-march=` option to `--copt=-march=armv8.2-a+crypto`. Note that such flags will be unnecessary once Clang supports `#pragma target` for NEON and SVE intrinsics, as it does for x86. ## Results `bench_sort` outputs the instruction set (AVX3 refers to AVX-512), the sort algorithm (std for `std::sort`, vq for our vqsort), the type of keys being sorted (f32 is float), the distribution of keys (uniform32 for uniform random with range 0-2^32), the number of keys, then the throughput of sorted keys (i.e. number of key bytes output per second). Example excerpt from Xeon 6154 (Skylake-X) CPU clocked at 3 GHz: ``` [ RUN ] BenchSortGroup/BenchSort.BenchAllSort/AVX3 AVX3: std: f32: uniform32: 1.00E+06 54 MB/s ( 1 threads) AVX3: vq: f32: uniform32: 1.00E+06 1143 MB/s ( 1 threads) ``` ## Additional results Thanks to Lukas Bergdoll, who did a thorough [performance analysis](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md) on various sort implementations. This helped us identify a performance bug, caused by obtaining entropy from the OS on each call. This was fixed in #1334 and we look forward to the updated results. ### Optimizations for small arrays Our initial focus was on large arrays. Since the VQSort paper was published, we have improved its performance for small arrays: - Previously, each call to VQSort obtained entropy from the OS. Unpredictable seeding does help avoid worst-cases, and the cost is negligible when the input size is at least 100K elements. However, the overhead is very costly for arrays of just 100 or 1000, so we now obtain entropy only once per thread and cache the seeds in TLS. This significantly improves the performance on subsequent calls. Users can also explicitly initialize the random generator. - We also improved the efficiency of our sorting network for inputs shorter than half its size. Our approach avoids costly transposes by interpreting inputs as a 2D matrix. Previously, we always used 16 rows, which means only a single vector lane is active for up to 16 elements. We have added 8x2 and 8x4 networks which use more lanes when available, and also 4x1 and 8x1 networks for very small inputs. - Previously we also loaded (overlapping) full vectors, with the offsets determined by the number of columns. Now we use the minimum vector size sufficient for the number of columns, which enables higher IPC on Skylake and reduces the cost of unaligned loads. Unfortunately this decreases code reuse; VQSort now consists of about 1500 instructions (https://gcc.godbolt.org/z/ojYKfjPe6). The size of sorting networks has nearly doubled to 10.8 KiB, 70% of the total. Although large, this still fits comfortably within 32 KiB instruction caches, and possibly even in micro-op caches (DSB, 1500-2300 micro-ops), especially given that not all instructions are guaranteed to execute. ### Study of AVX-512 downclocking We study whether AVX-512 downclocking affects performance. Using the GHz reported by perf, we find an upper bound on the effects of downclocking, and observe that its effect is negligible when compared to scalar code. This issue has somehow attracted far more attention than seems warranted. An attempt by Daniel Lemire to measure the [worst-case](https://lemire.me/blog/2018/08/15/the-dangers-of-avx-512-throttling-a-3-impact/) only saw a **3% decrease**, and Intel CPUs since Icelake, as well as AMD Zen4, are much less impacted by throttling, if at all. By contrast, "Silver" and "Bronze" Intel Xeons have more severe throttling and would require a large(r) speedup from AVX-512 to outweigh the downclocking. However, these CPUs are marketed towards "entry compute, network and storage" and "small business and storage server solutions", and are thus less suitable for the high-performance workloads we consider. Our test workstation runs Linux (6.1.20-2rodete1-amd64) and has the same Xeon Gold 6154 CPU used in our paper because its Skylake microarchitecture is the most (potentially) affected. The compiler is a Clang similar to the LLVM trunk. We added a new 'cold' benchmark that initializes random seeds, fills an array with a constant except at one random index, calls VQSort, and then prints a random element to ensure the computations are not elided. To run it, we build bench_sort with `-DSORT_ONLY_COLD=1` and then invoke `taskset -c 6 setarch -R x86_64 perf stat -r 15 -d bench_sort`. The taskset and setarch serve to reduce variability by avoiding thread migration, and disabling address space randomization. `-r 15` requests 15 runs so that perf can display the variability of the measurements: < 1% for cycles, instructions, L1 dcache loads; LLC miss variability is much higher (> 10%) presumably due to the remaining background activity on this machine. For our measurements, we use the GHz value reported by `perf`. This does not include time spent in the kernel, and is thus noisy for short runtimes. Note that running `perf` under `sudo` is not an option because it results in "Workload failed: Cannot allocate memory". We see results between 2.6 - 2.9 GHz when running AVX-512 code. This is relative to 3.0 GHz nominal; we disabled Turbo Boost via MSR and ran `sudo cpupower frequency-set --governor performance` to prevent unnecessary frequency reductions. To the best of our knowledge, the remaining gap is explained by time spent in the kernel (in particular handling page faults) and downclocking. Thus an *upper-bound* for the latter is (3 - 2.9)/3 to (3 - 2.6)/3, or **1.03 - 1.13x**. Such a frequency reduction would already be negligible compared to the 2-4x increase in work per cycle from 512-bit SIMD relative to 256 or 128-bit SIMD, which is typically less or not at all affected by downclocking. To further tighten this bound, we compare AVX-512 code vs. non-AVX-512 code, in the form of `std::sort`. Ensuring the remainder of the binary does not use AVX-512 is nontrivial. Library functions such as `memset` are known to use AVX-512, and they would not show up in a disassembly of our binary. Neither would they raise exceptions if run on a CPU lacking AVX-512 support, because software typically verifies CPU support before running AVX-512. As a first step, we take care to avoid calls to such library functions in our test, which is more feasible with a self-contained small binary. In particular, array zero-initialization typically compiles to `memset` (verified with clang-16), so we manually initialize the array to the return value of an `Unpredictable1` function whose implementation is not visible to the compiler. This indeed compiles to a scalar loop. To further increase confidence that the binary lacks AVX-512 instructions before VQSort, we replace the initialization loop with AVX-512 stores. This indeed raises the measured throughput from a fairly consistent 9 GB/s to 9-15 GB/s, likely because some of the AVX-512 startup now occurs outside of our timings. We examine this effect in the next section, but for now we can conclude that because adding AVX-512 makes a difference, the binary was otherwise not using it. Now we can revert to scalar initialization and compare the GHz reported for VQSort vs. `std::sort`. Across three runs, the ranges are 2.8-2.9 and 2.8-2.8 GHz. Thus we conclude: if there is any downclocking for a single core running AVX-512 on this Skylake-X CPU, the effect is **under the noise floor of our measurement**, and certainly far below any speedup one can reasonably predict from 512-bit SIMD. We expect this result to generalize to AMD Zen4 and any Gold/Platinum Intel Xeon. ### Study of AVX-512 startup overhead In the previous section, we saw that downclocking is negligible on our system, but there is a noticeable benefit to warming up AVX-512 before the sort. To understand why, we refer to Travis Downs' excellent [measurements](https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html#summary) of how Skylake reacts to an AVX-512 instruction: 8-20 us of reduced instruction throughput, an additional potential halt of 10 us, and then downclocking. Note that downclocking is negligible on a single core per the previous section. We choose the array length of 10K unsigned 64-bit keys such that VQSort completes in 7-10 us. Thus in this benchmark, VQSort (almost) finishes before AVX-512 is fully warmed up, and the speedup is reduced because the startup costs are amortized over relatively little data. Across five series of 15 runs, the average of average throughputs is 9.3 GB/s, implying a runtime of 8.6 us including startup costs. Note that the two-valued, almost all-equal input distribution is quite skewed. The above throughput does not reflect the performance attainable on other distributions, especially uniform random. However, this choice is deliberate because Quicksort can terminate early if all values in a partition are equal. When measuring such a 'best-case' input, we are more likely to observe the cost of startup overhead in surrounding code. Otherwise, this overhead might be hidden by the increase in sorting time. Now let us compare this throughput to the previously mentioned measurement with AVX-512 warmed up (via slow scatter instructions so that initialization takes about 100 us, well in excess of the warmup period): 15.2 GB/s, or 5.3 us without startup cost. It appears the 10 us halt is not happening, possibly because we do not use SIMD floating-point nor multiplication instructions. Thus we only experience reduced instruction throughput and/or increased latency. The ratio between cold and warmed-up time is only 1.6, which is plausible if the Skylake throttling is actually rounding latencies up to a multiple of four cycles, as Downs speculates. Indeed a large fraction of the SIMD instructions especially in the VQSort base case are cross-lane or 64-bit min/max operations with latencies of 3 cycles on Skylake, so their slowdown might only be 1.3x. The measured 1.6x could plausibly derive from 7/8 of 1.3x and 1/8 of 4x for single-cycle latency instructions. Assuming this understanding of AVX-512 startup cost is valid, how long does it remain active before the CPU reverts to the previous settings? The CPU cannot know what future instructions are coming, and to prevent unnecessary transitions, it has a hysteresis (delay after the last AVX-512 instruction before shutting down) which Downs measures as 680 us. Thus our benchmark subsequently sleeps for 100 ms to ensure the next run of the binary sees the original CPU state. Indeed we find for the five series that the slopes of the lines of best fit are negative in one case, positive in two, and flat in two, indicating there is no consistent pattern of benefit for earlier or later runs. What are the implications for users of VQSort? If the surrounding code executes an AVX-512 instruction at least every 500 us, then AVX-512 remains active and **any call to VQSort will benefit from it, no matter how small the input**. This is a reasonable expectation for modern systems whose designers were aware of data-oriented programming principles, because many (though not all) domains and operations can benefit from SIMD. By contrast, consider the case of dropping VQSort into an existing legacy system that does not yet use SIMD. In the case of 10K input sizes, we still observe a 2.3x speedup vs. `std::sort`. However, the following code may have to deal with throttling for the remainder of the 20 us startup period. With VQSort we have 8.6 us runtime plus up to 11.4 us throttled code (potentially running at quarter speed) plus the remaining 3/4 of 11.4 for a total of 28.6. With `std::sort` we have 19.5 us runtime plus 20 us of normal subsequent code, or 39.5 us. Thus the overall speedup for the 20 us region plus VQSort **shrinks to 1.4x**, and it is possible to imagine an actual slowdown for sufficiently small inputs, when factoring in the throttling of subsequent code. This unfortunate 'beggar thy neighbor' effect cannot be solved at the level of individual building blocks such as a sort, and must instead be addressed at the system level. For example: - vectorizing more and more parts of the code to amortize startup cost; - relying on newer CPUs than Skylake (launched 2015!) which have little or no AVX-512 startup overhead, such as Intel Icelake (2021) or AMD Zen4 (2022); - ensuring sorts (or anything else using AVX-512) process at least 100 KiB of data, such that the expected speedup outweighs any startup cost. Any of these solutions are sufficient to render AVX-512 startup overhead a non-issue. ### Comparison with Intel's x86-simd-sort and vxsort Our May 2022 paper compared performance with `ips4o` and `std::sort`. We now add results for Intel's [x86-simd-sort](https://github.com/intel/x86-simd-sort), released as open source around October 2022, and [vxsort](https://github.com/damageboy/vxsort-cpp/tree/master). We find that VQSort is generally about 1.4 times as fast as either, and in a few cases equal or up to 2% slower. Note that vxsort was open-sourced around May 2020; we were unaware of it at the time of writing because it had been published in the form of a blog series. We imported both from Github on 2023-06-06 at about 10:15 UTC. Both are integrated into our bench_sort, running on the same Linux OS and Xeon 6154 CPU mentioned above. We use uniform random inputs, because vxsort and x86-simd-sort appear to have much less robust handling of skewed input distributions. They choose the pivot as the median of three keys, or of 64 bytes, respectively. By contrast, VQSort draws a 384 byte sample and analyzes their distribution, which improves load balance and prevents recursing into all-equal partitions. Lacking this, the other algorithms are more vulnerable to worst-cases. Choosing uniform random thus prevents disadvantaging the other algorithms. We sample performance across a range of input sizes and types: - To isolate the performance of the sorting networks used by all three algorithms, we start with powers of two up to 128. VQSort is generally the fastest for 64-bit keys with the following exceptions: tie with vxsort at N=2 (537 MB/s), slower than vxsort at N=16 (2114 vs. 2147), tie with x86-simd-sort at N=32 (2643 MB/s). Note that VQSort is about 1.6 times as fast as both others for N=128; possibly because its 2D structure enables larger networks. - The `kPow10` mode in bench_sort measures power of ten input sizes between 10 and 100K. Note that this covers non-power of two sizes, as well as the crossover point between sorting networks and Quicksort recursion. The speedups of VQSort relative to x86-simd-sort range from 1.33 to 1.81 (32-bit keys), and 1.25 to 1.68 (64-bit keys), with geomeans of 1.48 and 1.44. The speedups of VQSort relative to vxsort range from 1.08 to 2.10 (32-bit keys), and 1.00 to 1.47 (64-bit keys), with geomeans of 1.41 and 1.20. Note that vxsort matches VQSort at 10 64-bit elements; in all other cases, VQSort is strictly faster. - Finally, we study the effect of key type at a fixed input size of 10K elements. x86-simd-sort requires AVX512-VBMI2 for int16, which our CPU does not support. Also, both other algorithms do not support 128-bit keys, thus we only consider 32/64-bit integer and float types. The results in MB/s are: |Type|VQSort|x86-simd-sort|vxsort| |---|---|---|---| |f32|**1551**| 798| 823| |f64|**1773**|1147| 745| |i32|**1509**|1042| 968| |i64|**1365**|1043|1145| VQSort is the fastest for each type, in some cases even about twice as fast. Interestingly, vxsort performs at its best on i64, whereas the others are at their best for f64. A potential explanation is that this CPU can execute two f64 min/max per cycle, but only one i64. In conclusion, VQSort is generally more efficient than vxsort and x86-simd-sort across a range of input sizes and types. Occasionally, it is up to 2% slower, but the geomean of its speedup (32-bit keys and power-of-ten sizes) vs. vxsort is **1.41**, and **1.48** vs. x86-simd-sort.