1 Pthread WorkQueue Regulator
2 ===========================
4 The Pthread Workqueue Regulator is meant to help userland regulate thread
5 pools based on the actual amount of threads that are running, the capacity of
6 the machines, the amount of blocked threads ...
11 In the kernel, threads registered in the pwq regulator can be in 5 states:
14 This is the state of threads that are curently blocked in a syscall.
17 This is the state of threads that are either really running, or have
18 been preempted out by the kernel. In other words it's the number of
22 This is the state of threads that are currently in a `PWQR_CTL_WAIT`
23 call from userspace (see `pwqr_ctl`) but that would not overcommit if
24 released by a `PWQR_CTL_WAKE` call.
27 This is the state of threads that are currently in a `PWQR_CTL_WAIT`
28 call from userspace (see `pwqr_ctl`) but that would overcommit if
29 released by a `PWQR_CTL_WAKE` call.
31 This state avoids waking a thread to force userland to "park" the thread, this
32 is racy, make the scheduler work for nothing useful. Though if
33 `PWQR_CTL_WAKE` is called, quarantined threads are woken but with a `EDQUOT`
34 errno set, and only one by one, no matter how wakes have been asked.
36 This state actually has only one impact: when `PWQR_CTL_WAKE` is called for
37 more than one threads, for example 4, and that userland knows that there is 5
38 threads in WAIT state, but that actually 3 of them are in the quarantine, only
39 2 will be woken up, and the `PWQR_CTL_WAKE` call will return 2. Any subsequent
40 `PWQR_CTL_WAKE` call will wake up one quarantined thread to let it be parked,
41 but returning 0 each time to hide that from userland.
44 This is the state of threads currently in a `PWQR_CTL_PARK` call from
45 userspace (see `pwqr_ctl`).
48 The regulator tries to maintain the following invariant:
50 running + waiting == target_concurrency
51 || (running + waiting < target_concurrency && waiting > 0)
53 When `running + waiting` overcommits::
54 The kernel puts waiting threads into the quarantine, which doesn't
55 require anything from userland. It's something userland discovers only
56 when it needs a waiting thread, which may never happen.
58 If there are no waiting threads, then well, the workqueue overcommits, and the
59 pwqr file descriptor is made available for reading. When userlands reads, it
60 will get the amount of overcommit at the time of the `read(3)` call.
62 NOTE: the availability for reading is only done after a delay (0.05s roughly
63 in the current implementation). And once you've read, this enables a new
64 notification in another 0.05s and so on, to allow a slow decrease of the
67 When `running + waiting` undercommits::
68 If waiting is non-zero then well, we don't care, it's that userland
69 actually doesn't need work to be performed.
71 If waiting is zero, then a parked thread (if such a thread) is woken up so
72 that userland has a chance to consume jobs.
74 Unparking threads only when waiting becomes zero avoid flip-flops when the job
75 flow is small, and that some of the running threads sometimes blocks (IOW
76 running sometimes decreases, making `running + waiting` be below target
77 concurrency for very small amount of time).
79 NOTE: unparking only happens after a delay (0.1s in the current
80 implementation) during which `waiting` must have been remained zero and the
81 overal load to be under commiting resources for the whole period.
83 The regulation between running and waiting threads is left to userspace that
84 is a way better judge than kernel land that has absolutely no knowledge about
85 the current workload. Also, doing so means that when there are lots of jobs to
86 process and that the pool has a size that doesn't require more regulation,
87 kernel isn't called for mediation/regulation AT ALL.
92 When we're overcommiting for a "long" time, userspace should be notified in
93 some way it should try to reduce its amount of running threads. Note that the
94 Apple implementation (before Lion at least) has the same issue. Though if you
95 imagine someone that spawns a zillion jobs that call very slow `msync()s` or
96 blocking `read()s` over the network, then that all those go back to running
97 state, the overcommit is huge.
99 There are several ways to "fix" this:
101 in kernel (poll solution, actually implemented right now)::
102 Let the file descriptor be pollable, and let it be readable (returning
103 something like the amount of overcommit at read() time for example) so
104 that userland is notified that it should try to reduce the amount of
107 It sounds very easy, but it has one major drawback: it meaks the pwqfd must be
108 somehow registered into the eventloop, and it's not very suitable for a
109 pthread_workqueue implementation. In other words, if you can plug into the
110 event-loop because it's a custom one or one that provides thread regulation
111 then it's fine, if you can't (glib, libdispatch, ...) then you need a thread
112 that will basically just poll() on this file-descriptor, it's really wasteful.
114 NOTE: this has been implemented now, but still it looks "expensive" to hook
115 for some users. So if some alternative way to be signalled could exist, it'd
119 Userspace knows how many "running" threads there are, it's easy to
120 track the amount of registered threads, and parked/waiting threads are
121 already accounted for. When "waiting" is zero, if "registerd - parked"
122 is "High" userspace could choose to randomly try to park one thread.
124 userspace can use non blocking read() to probe if it's overcommiting.
126 It's in NONE when userspace belives it's not necessary to probe (e.g. when the
127 amount of running + waiting threads isn't that large, say less than 110% of
128 the concurrency or any kind of similar rule).
130 It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64
131 jobs to mitigate the cost of the syscall. If the probe returns '1' then ask
132 for down-commiting and stay in SLOW mode, if it returns AGAIN all is fine, if
133 it returns more than '1' ask for down-commiting and go to AGGRESSIVE.
135 When AGGRESSVE threads check if they must park more often and in a more
136 controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very
137 long), for example based on some poor man's timer (clock_gettime(MONOTONIC)
138 sounds fine). State transition works as for SLOW.
140 The issue I have with this is that it sounds to add quite some code in the
141 fastpath code, hence I dislike it a lot.
144 To be able to define a new signal we could asynchronously send to the
145 process. The signal handler would just put some global flag to '1',
146 the threads in turn would check for this flag in their job consuming
147 loop, and the first thread that sees it to '1', xchg()s 0 for it, and
148 goes to PARK mode if it got the '1'. It's fast, inexpensive.
150 Sadly AFAICT defining new signals() isn't such a good idea. Another
151 possibility is to give an address for the flag at pwqr_create() time and let
152 the kernel directly write into userland. The problem is, I feel like it's a
153 very wrong interface somehow. I should ask some kernel hacker to know if that
154 would be really frowned upon. If not, then that's the leanest solution of all.
161 int pwqr_create(int flags);
165 This call returns a new PWQR file-descriptor. The regulator is initialized
166 with a concurrency corresponding to the number of online CPUs at the time of
167 the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`.
170 a mask of flags among `PWQR_FL_CLOEXEC`, and `PWQR_FL_NONBLOCK`.
172 Available operations on the pwqr file descriptor are:
174 `poll`, `epoll` and friends::
175 the PWQR file descriptor can be watched for `POLLIN` events (not
176 `POLLOUT` ones as it can not be written to).
179 The file returned can be read upon. The read blocks (or fails setting
180 `EAGAIN` if in non blocking mode) until the regulator believes the
181 pool is overcommitting. The buffer passed to read should be able to
182 hold an integer. When `read(3)` is successful, it returns in the
183 buffer passed to `read(3)` the amount of overcommiting threads
184 (understand: the number of threads to park so that the pool isn't
185 overcommiting anymore).
189 On success, this call return a nonnegative file descriptor.
190 On error, -1 is returned, and errno is set to indicate the error.
195 Invalid value specified in flags
197 The system limit on the total number of open files has been reached.
199 There was insufficient memory to create the kernel object.
207 int pwqr_ctl(int pwqrfd, int op, int val, void *addr);
213 This system call performs control operations on the pwqr instance referred to
214 by the file descriptor `pwqrfd`.
216 Valid values for the `op` argument are:
218 `PWQR_CTL_GET_CONC`::
219 Requests the current concurrency level for this regulator.
221 `PWQR_CTL_SET_CONC`::
222 Modifies the current concurrency level for this regulator. The new
223 value is passed as the `val` argument. The requests returns the old
224 concurrency level on success.
226 A zero or negative value for `val` means 'automatic' and is recomputed as the
227 current number of online CPUs as `sysconf(_SC_NPROCESSORS_ONLN)` would return.
229 `PWQR_CTL_REGISTER`::
230 Registers the calling thread to be taken into account by the pool
231 regulator. If the thread is already registered into another regulator,
232 then it's automatically unregistered from it.
234 `PWQR_CTL_UNREGISTER`::
235 Deregisters the calling thread from the pool regulator.
238 Tries to wake `val` threads from the pool. This is done according to
239 the current concurrency level not to overcommit. On success, a hint of
240 the number of woken threads is returned, it can be 0.
242 This is only a hint of the number of threads woken up for two reasons. First,
243 the kernel could really have woken up a thread, but when it becomes scheduled,
244 it could *then* decide that it would overcommit (because some other thread
245 unblocked inbetween for example), and block it again.
247 But it can also lie in the other direction: userland is supposed to account
248 for waiting threads. So when we're overcommiting and userland want a waiting
249 thread to be unblocked, we actually say we woke none, but still unblock one
250 (the famous quarantined threads we talk about above). This allow the userland
251 counter of waiting threads to decrease, but we know the thread won't be usable
255 Tries to wake `val` threads from the pool. This is done bypassing the
256 current concurrency level (`OC` stands for `OVERCOMMIT`). On success,
257 the number of woken threads is returned, it can be 0, but it's the
258 real count that has been (or will soon be) woken up. If it's less than
259 required, it's because there aren't enough parked threads.
262 Puts the thread to wait for a future `PWQR_CTL_WAKE` command. If this
263 thread must be parked to maintain concurrency below the target, then
264 the call blocks with no further ado.
266 If the concurrency level is below the target, then the kernel checks if the
267 address `addr` still contains the value `val` (in the fashion of `futex(2)`).
268 If it doesn't then the call doesn't block. Else the calling thread is blocked
269 until a `PWQR_CTL_WAKE` command is received.
271 `addr` must of course be a pointer to an aligned integer which stores the
272 reference ticket in userland.
275 Puts the thread in park mode. Those are spare threads to avoid
276 cloning/exiting threads when the pool is regulated. Those threads are
277 released by the regulator only, and can only be woken from userland
278 with the `PWQR_CTL_WAKE_OC` command, and once all waiting threads have
281 The call blocks until an overcommiting wake requires the thread, or the kernel
282 regulator needs to grow the pool with new running threads.
286 When successful `pwqr_ctl` returns a nonnegative value.
287 On error, -1 is returned, and errno is set to indicate the error.
292 `pwqfd` is not a valid file descriptor.
295 `pwqfd` is a valid pwqr file descriptor but is in a broken state: it
296 has been closed while other threads were in a pwqr_ctl call.
298 NOTE: this is due to the current implementation and would probably not be here
302 Error in reading value from `addr` from userspace.
307 Errors specific to `PWQR_CTL_REGISTER`:
310 There was insufficient memory to perform the operation.
312 Errors specific to `PWQR_CTL_WAIT`:
315 When the kernel evaluated if `addr` still contained `val` it didn't.
316 This works like `futex(2)`.
318 Errors specific to `PWQR_CTL_WAIT` and `PWQR_CTL_PARK`:
321 The call was interrupted by a syscall (note that sometimes the kernel
322 masks this fact when it has more important "errors" to report like
325 The thread has been woken by a `PWQR_CTL_WAKE` or `PWQR_CTL_WAKE_OC`
326 call, but is overcommiting.