scx_p2dq

Crates.ioscx_p2dq
lib.rsscx_p2dq
version1.0.21
created_at2025-03-06 19:01:24.272559+00
updated_at2025-09-05 23:15:26.487995+00
descriptionscx_p2dq A simple pick two load balancing scheduler in BPF
homepage
repository
max_upload_size
id1581682
size342,100
Tejun Heo (htejun)

documentation

README

scx_p2dq

Overview

A simple pick 2 load balancing scheduler with multi-layer queueing.

The p2dq scheduler is a general purpose scheduler that uses a pick two algorithm for load balancing tasks across last level caches (LLCs) and NUMA nodes. p2dq uses in kernel task dispatch queues (DSQs) and BPF arena based queues (ATQs) depending on configuration.

The scheduler handles all scheduling decisions in BPF and the userspace component is only for metric reporting and some integrations with other subsystems such as power management.

How it works

p2dq has a number of task queues for different purposes (and subject to change). For a high level on how tasks are enqueued see the following diagram for a system that has two LLCs and four CPUs with two CPUs per LLC. For each LLC there are generally three DSQs; the LLC DSQ (dsq), per CPU DSQ, and migration (mig_dsq). Tasks that are considered migration eligible are placed in the migration dsq.

graph TD;
    LLC0[LLC 0]-->id1[(dsq)];
    LLC0[LLC 0]-->id3[(mig_dsq)];
    id1-->cpu0;
    id1-->cpu1;
    id3-->cpu0;
    id3-->cpu1;

    id3-..->cpu2;
    id3-..->cpu3;

    LLC1[LLC 1]-->id4[(mig_dsq)];
    LLC1[LLC 1]-->id5[(dsq)];
    id4-->cpu2;
    id4-->cpu3;
    id5-->cpu2;
    id5-->cpu3;

    id4-..->cpu0;
    id4-..->cpu1;

When a CPU is ready to run a task it will look at all the DSQs and try to find the DSQ with the most eligible task based on virtual runtime (vtime) for the LLC to ensure fairness across the DSQs. If no task is found then the CPU will try to enqueue tasks from the DSQs in the following order: CPU, LLC, and migration.

If no tasks are found then p2dq does pick two load balancing. In the pick two process the load balancer randomly selects two LLCs and compares the relative load. The LLC with the most load is chosen and the migration DSQ is attempted to be consumed. If that fails then the second migraiton DSQ is attempted.

Use Cases

p2dq can perform well in a variety of workloads including interactive workloads such as gaming, desktop, batch processing and server applications. Tuning of of p2dq for each use case is required.

p2dq can also be extended to create other schedulers. One example is scx_chaos, which is a scheduler that can introduce entropy into a system via scheduling. p2dq makes use of BPF arenas for many parts of the scheduler including topology awareness, task state tracking, and task queueing.

Gaming

p2dq can work well as a gaming scheduler with some tuning. A list of relevant options for gaming:

  • --deadline adds deadline scaling of timeslices.
  • --task-slice creates more stable slice durations for better consistency. Overrides other slice scaling methods.
  • --autoslice auto scaling of interactive slice duration based on utilization of interactive tasks.
  • --freq-control for controling CPU frequency with certain drivers.
  • --cpu-priority uses a min-heap to schedule on CPUs based on a score of most recently used and preferred core value.
  • --sched-mode can use the performance mode to schedule on Big cores.
  • --idle-resume-us how long a CPU stays idle before dropping to a lower C-state.

Big/Little Support

p2dq has support for Big/Little architectures and the scheduling can be set with --sched-mode. The performance mode biases to scheduling on big cores. The efficiency mode biases to scheduling on little cores and the default mode schedules interactive tasks on efficiency cores and high throughput tasks on big cores.

Configuration

The main idea behind p2dq is being able to classify which tasks are interactive and using a separate dispatch queue (DSQ) for them. Non interactive tasks can have special properties such as being able to be load balanced across LLCs/NUMA nodes. The --autoslice option will attempt to scale DSQ time slices based on the --interactive-ratio. DSQ time slices can also be set manually if the duration/distribution of tasks that are considered to be interactive is known in advance. scxtop can be used to get an understanding of time slice utilization so that DSQs can be properly configured. For desktop systems keeping the interactive ratio small (ex: <5) and using a small number of queues will give a general performance with autoslice enabled.

Things you probably shouldn't use p2dq for

  • Workload that require advanced cgroup features such as CPU bandwidth throttling.

Support

Create an Issue

FAQ

  • Why does p2dq do poorly in benchmark XYZ?
    • The design of p2dq makes some workloads perform worse than others. In particular if you run benchmarks that spin up load quickly across many cores p2dq may have less throughput due to the load balancing and the "stickiness" of tasks to last level caches (LLCs). If you run the system at full CPU saturation (100% CPU utilization) then p2dq may perform worse compared to other schedulers due to the scheduler overhead.
    • Some benchmarks really prefer when tasks stay sticky to their current CPU. The general design for p2dq is to enqueue tasks to run on queues (DSQs/ATQs) at the LLC level. This can improve work conservation, but if the workload is heavily dependent on the iTLB then p2dq may perform worse than other schedulers.
  • Is p2dq production ready?
    • p2dq has been tested on various services at Meta and has shown similar performance to EEVDF without causing incidents.
  • What features are missing?
    • Hotplug support is not supported if you hotplug a CPU then the scheduler will get kicked.
    • Advanced features such as cgroup CPU bandwidth throttling.
  • What things sort of work?
    • Big/Little core support.
    • Power management features (idle resume).
Commit count: 0

cargo fmt