# IsoAlloc Performance ## The Basics Performance is a top priority for any memory allocator. Balancing those performance priorities with security is difficult. Anytime we introduce heavy handed security checks we usually have to pay a performance penalty. But we can be deliberate about our design choices and use good tooling to find places where we can mitigate those tradeoffs. In the end correct code is better than endless security mitigations. ## Configuration and Optimizations IsoAlloc is only designed and tested for 64 bit, so we don't have to worry about portability hurting our performance. We can assume a large address space will always be present and we can optimize for simple things like always fetching 64 bits of memory as we iterate over an array. This remainder of this section is a basic overview of the performance optimizations in IsoAlloc. Perhaps the most important optimization in IsoAlloc is the design choice to use a simple bitmap for tracking chunk states. Combining this with zones comprised of contiguous chunks of pages results in good performance at the cost of memory. This is in contrast to typical allocator designs full of linked list code that tends to result far more complex code, slow page faults, and exploitable designs. All data fetches from a zone bitmap are 64 bits at a time which takes advantage of fast CPU pipelining. Fetching bits at a different bit width will result in slower performance by an order of magnitude in allocation intensive tests. All user chunks are 8 byte aligned no matter how big each chunk is. Accessing this memory with proper alignment will minimize CPU cache flushes. When `PRE_POPULATE_PAGES` is enabled in the Makefile global caches, the root, and zone bitmaps (but not pages that hold user data) are created with `MAP_POPULATE` which instructs the kernel to pre-populate the page tables which reduces page faults and results in better performance. Note that by default at zone creation time user pages will have canaries written at random aligned offsets. This will cause page faults and populate those PTE's when the pages are first written to whether those pages are ever used at runtime or not. If you disable canaries it will result in lower RSS and a faster runtime performance. The `MAX_ZONES` value in `conf.h` limits the total number of zones that can be allocated at runtime. If your program is being killed with OOM errors you can safely increase this value, however its max value is 65535. However it will result in a larger allocation for the `root->zones` array which holds meta data for each zone whether that zone is currently mapped and in use or not. To calculate the total number of bytes available for allocations you can do (`MAX_ZONES * ZONE_USER_SIZE`). Note that `ZONE_USER_SIZE` is not configurable in `conf.h`. Default zones for common sizes are created in the library constructor. This helps speed up allocations for long running programs. New zones are created on demand when needed but this will incur a small performance penalty in the allocation path. All chunk sizes are multiples of 32 with a minimum value of `SMALLEST_CHUNK_SZ` (32 by default, alignment and smallest chunk size should be in sync) and a maximum value of `SMALL_SIZE_MAX` up to 65536 by default. In a configuration with `SMALL_SIZE_MAX` set to 65536 zones will only be created for 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, and 65536. You can increase `SMALL_SIZE_MAX` up to 131072. If you choose to change this value be mindful of the pages that it can waste (e.g. allocating a chunk of 16385 bytes will result in returning a chunk of 32768 bytes). Everything above `SMALL_SIZE_MAX` is allocated by the big zone path which has a limitation of 4 GB and a size granularity that is only limited by page size alignment. Big zones have meta data allocated separately. Guard pages for this meta data can be enabled or disabled using `BIG_ZONE_META_DATA_GUARD`. Likewise `BIG_ZONE_GUARD` can be used to enable or disable guard pages for big zone user data pages. By default user chunks are not sanitized upon free. While this helps mitigate uninitialized memory vulnerabilities it is a very slow operation. You can enable this feature by changing the `SANITIZE_CHUNKS` flag in the Makefile. When `USE_MLOCK` is enabled in the Makefile the meta data for the IsoAlloc root and other caches will be locked with `mlock`. This means this data will never be swapped to disk. We do this because using these data structures is required for both the alloc and free hot paths. This operation may fail if we are running inside a container with memory limits. Failure to lock the memory will not cause an abort and all failures will be silently ignored. By default IsoAlloc randomizes the freelist per zone which results in a small performance hit. You can disable this by setting `RANDOMIZE_FREELIST` to 0 in the Makefile. When enabled `USE_SPINLOCK` will use spinlocks via `atomic_flag` instead of a pthread mutex. Performance and load testing of IsoAlloc has shown spinlocks are slightly slower than a mutex so it is not the preferred default option. If you know your program will not require multi-threaded access to IsoAlloc you can disable threading support by setting the `THREAD_SUPPORT` define to 0 in the Makefile. This will remove all atomic/mutex lock/unlock operations from the allocator, which will result in significant performance gains in some programs. If you do require thread support then you may want to profile your program to determine what default zone sizes will benefit performance. `DISABLE_CANARY` can be set to 1 to disable the creation and verification of canary chunks. This removes a useful security feature but will significantly improve performance and RSS. By default IsoAlloc will attempt to use Huge Pages (for both Linux and Mac OS) for any allocations that are a multiple of 2 mb in size. This is the default huge page size on most systems but it might not be on yours. On Linux you can check the value for your system by running the following command: ``` cat /proc/meminfo | grep Hugepagesize Hugepagesize: 2048 kB ``` If necessary you can adjust the value of `HUGE_PAGE_SZ` in `conf.h` to reflect the size on your system. ## Caches and Memoization There are a few important caches and memoization techniques used in IsoAlloc. These significantly improve the performance of alloc/free hot paths and keep the design as simple as possible. ### Zone Free List Each zone contains an array of bitslots that represent free chunks in that zone. The allocation hot path searches this list first for a free chunk that can satisfy the allocation request. Allocating chunks from this list is a lot faster than iterating through a zones bitmap for a free bitslot. This cache is refilled whenever it is low. Free'd chunks are added to this list after they've been in the quarantine. ### Chunk to Zone Lookup The chunk-to-zone lookup table is a high hit rate cache for finding which zone owns a user chunk. It works by mapping the MSB of the chunk address to a zone index. Misses are gracefully handled and more common with a higher RSS and more mappings. ### MRU Zone Cache It is not uncommon to write a program that uses multiple threads for different purposes. Some threads will never make an allocation request above or below a certain size. This thread local cache optimizes for this by storing a TLS array of the threads most recently used zones. These zones are checked in the `iso_find_zone_range` free path if the chunk-to-zone lookup fails. ### Thread Chunk Quarantine This thread local cache speeds up the free hot path by quarantining chunks until a threshold has been met. Until that threshold is reached free's are very cheap. When the cache is emptied and the chunks free'd its still faster because we take advantage of keeping the zone meta data in a CPU cache line. ### Zone Lookup Table Zones are linked by their `next_sz_index` member which tells the allocator where in the `_root->zones` array it can find the next zone that holds the same size chunks. This lookup table helps us find the first zone that holds a specific size in O(1) time. This is achieved by placing a zone's index value at that zones size index in the table, e.g. `zone_lookup_table[zone->size] = zone->index`, from there we just need to use the next zone's index member and walk it like a singly linked list to find other zones of that size. Zones are added to the front of the list as they are created. ## ARM NEON IsoAlloc tries to use ARM Neon instructions to make loops in the hot path faster. If you want to disable any of this code at compile time and rely instead on more portable, but potentially slower code just set `DONT_USE_NEON` to 1 in the Makefile. ## Tests I've spent a good amount of time testing IsoAlloc to ensure its reasonably fast compared to glibc/ptmalloc. But it is impossible for me to model or simulate all the different states a program that uses IsoAlloc may be in. This section briefly covers the existing performance related tests for IsoAlloc and the data I have collected so far. The `malloc_cmp_test` build target will build 2 different versions of the test/tests.c program which runs roughly 1.4 million alloc/calloc/realloc operations. We free every other allocation and then free the remaining ones once the allocation loop is complete. This helps simulate some fragmentation in the heap. On average IsoAlloc is faster than ptmalloc but the difference is so close it will likely not be noticable. The following test was run in an Ubuntu 20.04.3 LTS (Focal Fossa) for ARM64 docker container with libc version 2.31-0ubuntu9.2 on a MacOS host. The kernel used was `Linux f7f23ca7dc44 5.10.76-linuxkit`. ``` Running IsoAlloc Performance Test build/tests iso_alloc/iso_free 1441616 tests completed in 0.168293 seconds iso_calloc/iso_free 1441616 tests completed in 0.171274 seconds iso_realloc/iso_free 1441616 tests completed in 0.231350 seconds Running glibc/ptmalloc Performance Test malloc/free 1441616 tests completed in 0.166813 seconds calloc/free 1441616 tests completed in 0.223232 seconds realloc/free 1441616 tests completed in 0.306684 seconds Running jemalloc Performance Test LD_PRELOAD=/code/mimalloc-bench/extern/jemalloc/lib/libjemalloc.so build/malloc_tests malloc/free 1441616 tests completed in 0.064520 seconds calloc/free 1441616 tests completed in 0.178228 seconds realloc/free 1441616 tests completed in 0.271620 seconds Running mimalloc Performance Test LD_PRELOAD=/code/mimalloc-bench/extern/mimalloc/out/release/libmimalloc.so build/malloc_tests malloc/free 1441616 tests completed in 0.085471 seconds calloc/free 1441616 tests completed in 0.099644 seconds realloc/free 1441616 tests completed in 0.143821 seconds Running mimalloc-secure Performance Test LD_PRELOAD=/code/mimalloc-bench/extern/mimalloc/out/secure/libmimalloc-secure.so build/malloc_tests malloc/free 1441616 tests completed in 0.128479 seconds calloc/free 1441616 tests completed in 0.148797 seconds realloc/free 1441616 tests completed in 0.191719 seconds Running tcmalloc Performance Test LD_PRELOAD=/code/mimalloc-bench/extern/gperftools/.libs/libtcmalloc_minimal.so build/malloc_tests malloc/free 1441616 tests completed in 0.093779 seconds calloc/free 1441616 tests completed in 0.103634 seconds realloc/free 1441616 tests completed in 0.131152 seconds Running scudo Performance Test LD_PRELOAD=/code/mimalloc-bench/extern/scudo/compiler-rt/lib/scudo/standalone/libscudo.so build/malloc_tests malloc/free 1441616 tests completed in 0.227757 seconds calloc/free 1441616 tests completed in 0.204610 seconds realloc/free 1441616 tests completed in 0.258962 seconds ``` The same test run on an AWS t2.xlarge Ubuntu 20.04 instance with 4 `Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz` CPUs and 16 gb of memory: ``` Running IsoAlloc Performance Test iso_alloc/iso_free 1441616 tests completed in 0.147336 seconds iso_calloc/iso_free 1441616 tests completed in 0.161482 seconds iso_realloc/iso_free 1441616 tests completed in 0.244981 seconds Running glibc malloc Performance Test malloc/free 1441616 tests completed in 0.182437 seconds calloc/free 1441616 tests completed in 0.246065 seconds realloc/free 1441616 tests completed in 0.332292 seconds ``` Here is the same test as above on Mac OS 12.1 ``` Running IsoAlloc Performance Test build/tests iso_alloc/iso_free 1441616 tests completed in 0.149818 seconds iso_calloc/iso_free 1441616 tests completed in 0.183772 seconds iso_realloc/iso_free 1441616 tests completed in 0.274413 seconds Running system malloc Performance Test build/malloc_tests malloc/free 1441616 tests completed in 0.084803 seconds calloc/free 1441616 tests completed in 0.194901 seconds realloc/free 1441616 tests completed in 0.240934 seconds ``` This same test can be used with the `perf` utility to measure basic stats like page faults and CPU utilization using both heap implementations. The output below is on the same AWS t2.xlarge instance as above. ``` $ perf stat build/tests iso_alloc/iso_free 1441616 tests completed in 0.416603 seconds iso_calloc/iso_free 1441616 tests completed in 0.575822 seconds iso_realloc/iso_free 1441616 tests completed in 0.679546 seconds Performance counter stats for 'build/tests': 1709.07 msec task-clock # 1.000 CPUs utilized 7 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 145562 page-faults # 0.085 M/sec 1.709414837 seconds time elapsed 1.405068000 seconds user 0.304239000 seconds sys $ perf stat build/malloc_tests malloc/free 1441616 tests completed in 0.359380 seconds calloc/free 1441616 tests completed in 0.569044 seconds realloc/free 1441616 tests completed in 0.597936 seconds Performance counter stats for 'build/malloc_tests': 1528.51 msec task-clock # 1.000 CPUs utilized 5 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 433055 page-faults # 0.283 M/sec 1.528795324 seconds time elapsed 0.724352000 seconds user 0.804371000 seconds sys ``` The following benchmarks were collected from [mimalloc-bench](https://github.com/daanx/mimalloc-bench) with the default configuration of IsoAlloc. As you can see from the data IsoAlloc is competitive with jemalloc, tcmalloc, and glibc/ptmalloc for some benchmarks but clearly falls behind in the Redis benchmark. For any benchmark that IsoAlloc scores poorly on I was able to tweak its build to improve the CPU time and memory consumption. It's worth noting that IsoAlloc was able to stay competitive even with performing many security checks not present in other allocators. Please note these are 'best case' measurements, not averages. ``` # benchmark allocator elapsed rss user sys page-faults page-reclaims cfrac jemalloc 03.47 3948 3.46 0.00 0 422 cfrac mimalloc 03.19 2688 3.18 0.00 0 337 cfrac smimalloc 03.57 2860 3.56 0.00 0 375 cfrac tcmalloc 03.25 7392 3.24 0.00 0 1325 cfrac scudo 06.00 3920 5.99 0.00 0 561 cfrac isoalloc 05.69 12920 5.58 0.10 0 3016 espresso jemalloc 03.61 4508 3.56 0.00 5 553 espresso mimalloc 03.43 3828 3.40 0.01 1 1299 espresso smimalloc 03.65 5760 3.60 0.01 0 2682 espresso tcmalloc 03.43 8132 3.39 0.01 0 1485 espresso scudo 04.53 4028 4.49 0.01 0 514 espresso isoalloc 04.59 48984 4.49 0.07 0 24276 barnes jemalloc 01.93 59412 1.91 0.01 3 16646 barnes mimalloc 01.91 57860 1.89 0.01 0 16539 barnes smimalloc 01.98 57928 1.96 0.01 0 16557 barnes tcmalloc 01.91 62664 1.89 0.01 0 17515 barnes scudo 01.92 58940 1.91 0.01 0 16595 barnes isoalloc 01.92 58328 1.91 0.01 0 16714 redis jemalloc 5.019 31280 2.37 0.17 0 7268 redis mimalloc 4.487 29204 2.07 0.20 0 6825 redis smimalloc 4.909 30992 2.28 0.20 0 7410 redis tcmalloc 4.675 37336 2.17 0.20 0 8682 redis scudo 6.105 36968 2.85 0.23 0 8623 redis iso 7.967 105332 3.48 0.54 0 112953 cache-thrash1 jemalloc 01.28 3648 1.27 0.00 1 240 cache-thrash1 mimalloc 01.28 3408 1.28 0.00 0 197 cache-thrash1 smimalloc 01.28 3256 1.27 0.00 0 202 cache-thrash1 tcmalloc 01.27 7100 1.26 0.00 0 1127 cache-thrash1 scudo 01.27 3240 1.26 0.00 0 200 cache-thrash1 isoalloc 01.26 3460 1.26 0.00 0 363 cache-thrashN jemalloc 00.21 3936 1.64 0.00 0 360 cache-thrashN mimalloc 00.21 3516 1.63 0.00 0 239 cache-thrashN smimalloc 00.22 3584 1.68 0.01 0 249 cache-thrashN tcmalloc 02.74 6992 20.36 0.00 0 1151 cache-thrashN scudo 00.61 3164 2.53 0.00 0 237 cache-thrashN isoalloc 00.71 4032 5.63 0.00 0 472 larsonN jemalloc 4.892 84172 39.71 0.20 1 52478 larsonN mimalloc 4.360 98504 39.61 0.17 0 26372 larsonN smimalloc 6.546 105724 39.77 0.16 3 27432 larsonN tcmalloc 4.450 63464 39.57 0.21 0 15299 larsonN scudo 44.707 33104 28.92 4.80 0 7826 larsonN isoalloc 249.791 63996 7.09 17.51 0 15567 larsonN-sized jemalloc 4.872 84428 39.56 0.22 1 52874 larsonN-sized mimalloc 4.335 95388 39.82 0.13 0 25625 larsonN-sized smimalloc 6.332 106372 39.71 0.17 0 27642 larsonN-sized tcmalloc 4.230 64956 39.59 0.15 0 15669 larsonN-sized scudo 44.601 32900 28.68 4.65 0 7793 larsonN-sized isoalloc 363.176 70240 39.59 0.29 0 17222 mstressN jemalloc 00.92 139772 2.10 1.76 1 984466 mstressN mimalloc 00.43 352132 1.56 0.15 0 88171 mstressN smimalloc 00.62 352204 1.85 0.67 0 95538 mstressN tcmalloc 00.51 147680 1.80 0.25 0 37111 mstressN scudo 01.38 142068 3.23 1.63 0 616639 mstressN isoalloc 03.11 225352 4.90 5.91 0 722991 xmalloc-testN jemalloc 2.307 64460 25.13 5.14 1 22975 xmalloc-testN mimalloc 0.513 82212 36.03 1.05 0 26689 xmalloc-testN smimalloc 0.857 73504 36.58 1.05 0 28285 xmalloc-testN tcmalloc 6.055 40824 9.31 18.77 0 9642 xmalloc-testN scudo 13.416 56708 10.06 14.30 0 16560 xmalloc-testN isoalloc 10.049 15268 7.19 20.91 0 3668 glibc-simple jemalloc 01.96 2984 1.95 0.00 1 313 glibc-simple mimalloc 01.50 1900 1.49 0.00 0 212 glibc-simple smimalloc 01.77 2032 1.76 0.00 0 229 glibc-simple tcmalloc 01.52 6880 1.52 0.00 0 1212 glibc-simple scudo 04.58 2776 4.58 0.00 0 281 glibc-simple isoalloc 04.21 10892 4.12 0.09 0 4674 glibc-thread jemalloc 6.772 4160 15.98 0.00 1 457 glibc-thread mimalloc 3.759 3320 15.98 0.00 0 585 glibc-thread smimalloc 9.012 17144 15.89 0.02 0 4018 glibc-thread tcmalloc 10.434 8508 15.99 0.00 0 1580 glibc-thread scudo 80.979 4076 15.90 0.01 0 582 glibc-thread isoalloc 374.692 2240 2.56 5.14 0 348 ``` IsoAlloc isn't quite ready for performance sensitive server workloads. However it's more than fast enough for client side mobile/desktop applications with risky C/C++ attack surfaces. These environments have threat models similar to what IsoAlloc was designed for.