1 Pthread WorkQueue Regulator
2 ===========================
4 The Pthread Workqueue Regulator is meant to help userland regulate thread
5 pools based on the actual amount of threads that are running, the capacity of
6 the machines, the amount of blocked threads ...
11 In the kernel, threads registered in the pwq regulator can be in 4 states:
14 This is the state of threads that are curently blocked in a syscall.
17 This is the state of threads that are either really running, or have
18 been preempted out by the kernel. In other words it's the number of
22 This is the state of threads that are currently in a `PWQR_WAIT` call
23 from userspace (see `pwqr_ctl`) but that would not overcommit if
24 released by a `PWQR_WAKE` call.
27 This is the state of threads that are currently in a `PWQR_WAIT` call
28 from userspace (see `pwqr_ctl`) but that would overcommit if released
29 by a `PWQR_WAKE` call.
31 This state avoids waking a thread to force userland to "park" the thread, this
32 is racy, make the scheduler work for nothing useful. Though if `PWQR_WAKE` is
33 called, quarantined threads are woken but with a `EDQUOT` errno set.
36 This is the state of threads currently in a `PWQR_PARK` call from
37 userspace (see `pwqr_ctl`).
40 The regulator tries to maintain the following invariant:
42 running + waiting == target_concurrency
43 || (running + waiting < target_concurrency && waiting > 0)
45 When `running + waiting` overcommits::
46 The kernel puts waiting threads into the quarantine, which doesn't
47 require anything from userland. It's something userland discovers only
48 when it needs a waiting thread, which may never happen.
50 If there are no waiting threads, then well, the workqueue overcommits, and
51 that's one of the TODO items at the moment (see Notes)
53 When `running + waiting` undercommits::
54 If waiting is non-zero then well, we don't care, it's that userland
55 actually doesn't need work to be performed.
57 If waiting is zero, then a parked thread (if such a thread) is woken up so
58 that userland has a chance to consume jobs.
60 Unparking threads only when waiting becomes zero avoid flip-flops when the job
61 flow is small, and that some of the running threads sometimes blocks (IOW
62 running sometimes decreases, making `running + waiting` be below target
63 concurrency for very small amount of time).
65 The regulation between running and waiting threads is left to userspace that
66 is a way better judge than kernel land that has absolutely no knowledge about
67 the current workload. Also, doing so means that when there are lots of jobs to
68 process and that the pool has a size that doesn't require more regulation,
69 kernel isn't called for mediation/regulation AT ALL.
71 NOTE: right now threads are unparked as soon as `running + waiting`
72 undercommit, and some delay should be applied to be sure it's not a really
73 short blocking syscall that made us undercommit.
75 NOTE: when we're overcommiting for a "long" time, userspace should be notified
76 in some way it should try to reduce its amount of running threads. Note that
77 the Apple implementation (before Lion at least) has the same issue. Though if
78 you imagine someone that spawns a zillion jobs that call very slow `msync()s`
79 or blocking `read()s` over the network, then that all those go back to running
80 state, the overcommit is huge.
81 A way to mitigate this atm is that when userspace belives the amount of
82 threads is abnormally high it should periodically try to PARK the threads. If
83 that blocks the thread, then it's that we were overcommiting. Note that it may
84 be the best solution rather than a kernel-side implementation. To be thought
92 int pwqr_create(int flags);
96 This call returns a new PWQR file-descriptor. The regulator is initialized
97 with a concurrency corresponding to the number of online CPUs at the time of
98 the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`.
101 a mask of flags, currently only O_CLOEXEC.
105 On success, this call return a nonnegative file descriptor.
106 On error, -1 is returned, and errno is set to indicate the error.
111 Invalid value specified in flags
113 The system limit on the total number of open files has been reached.
115 There was insufficient memory to create the kernel object.
123 int pwqr_ctl(int pwqrfd, int op, int val, void *addr);
129 This system call performs control operations on the pwqr instance referred to
130 by the file descriptor `pwqrfd`.
132 Valid values for the `op` argument are:
135 Requests the current concurrency level for this regulator.
138 Modifies the current concurrency level for this regulator. The new
139 value is passed as the `val` argument. The requests returns the old
140 concurrency level on success.
142 A zero or negative value for `val` means 'automatic' and is recomputed
143 as the current number of online CPUs as
144 `sysconf(_SC_NPROCESSORS_ONLN)` would return.
147 Registers the calling thread to be taken into account by the pool
148 regulator. If the thread is already registered into another regulator,
149 then it's automatically unregistered from it.
152 Deregisters the calling thread from the pool regulator.
155 Tries to wake `val` threads from the pool. This is done according to
156 the current concurrency level not to overcommit. On success, the
157 number of woken threads is returned, it can be 0.
160 Tries to wake `val` threads from the pool. This is done bypassing the
161 current concurrency level (`OC` stands for `OVERCOMMIT`). On success,
162 the number of woken threads is returned, it can be 0.
165 Puts the thread to wait for a future `PWQR_WAKE` command. If this
166 thread must be parked to maintain concurrency below the target, then
167 the call blocks with no further ado.
169 If the concurrency level is below the target, then the kernel checks if the
170 address `addr` still contains the value `val` (in the fashion of `futex(2)`).
171 If it doesn't then the call doesn't block. Else the calling thread is blocked
172 until a `PWQR_WAKE` command is received.
175 Puts the thread in park mode. Those are spare threads to avoid
176 cloning/exiting threads when the pool is regulated. Those threads are
177 released by the regulator only, and can only be woken from userland
178 with the `PWQR_WAKE_OC` command, and once all waiting threads have
181 The call blocks until an overcommiting wake requires the thread, or the kernel
182 regulator needs to grow the pool with new running threads.
186 When successful `pwqr_ctl` returns a nonnegative value.
187 On error, -1 is returned, and errno is set to indicate the error.
192 `pwqfd` is not a valid file descriptor.
195 `pwqfd` is a valid pwqr file descriptor but is in a broken state: it
196 has been closed while other threads were in a pwqr_ctl call.
198 NOTE: this is due to the current implementation and would probably not be here
202 Error in reading value from `addr` from userspace.
207 Errors specific to `PWQR_REGISTER`:
210 There was insufficient memory to perform the operation.
212 Errors specific to `PWQR_WAIT`:
215 When the kernel evaluated if `addr` still contained `val` it didn't.
216 This works like `futex(2)`.
218 Errors specific to `PWQR_WAIT` and `PWQR_PARK`:
221 The call was interrupted by a syscall (note that sometimes the kernel
222 masks this fact when it has more important "errors" to report like
225 The thread has been woken by a `PWQR_WAKE` or `PWQR_WAKE_OC` call, but