.. _Bandwidth_and_Cache_Affinity: Bandwidth and Cache Affinity ============================ For a sufficiently simple function ``Foo``, the examples might not show good speedup when written as parallel loops. The cause could be insufficient system bandwidth between the processors and memory. In that case, you may have to rethink your algorithm to take better advantage of cache. Restructuring to better utilize the cache usually benefits the parallel program as well as the serial program. An alternative to restructuring that works in some cases is ``affinity_partitioner.`` It not only automatically chooses the grainsize, but also optimizes for cache affinity and tries to distribute the data uniformly among threads. Using ``affinity_partitioner`` can significantly improve performance when: - The computation does a few operations per data access. - The data acted upon by the loop fits in cache. - The loop, or a similar loop, is re-executed over the same data. - There are more than two hardware threads available (and especially if the number of threads is not a power of two). If only two threads are available, the default scheduling in |full_name| usually provides sufficient cache affinity. The following code shows how to use ``affinity_partitioner``. :: #include "oneapi/tbb.h"   void ParallelApplyFoo( float a[], size_t n ) { static affinity_partitioner ap; parallel_for(blocked_range(0,n), ApplyFoo(a), ap); }   void TimeStepFoo( float a[], size_t n, int steps ) { for( int t=0; t`_. Notice revision #20201201 .. |image0| image:: Images/image007.jpg :width: 453px :height: 178px .. |image1| image:: Images/image008.jpg :width: 551px :height: 192px