1 Pthread WorkQueue Regulator
2 ===========================
4 The Pthread Workqueue Regulator is meant to help userland regulate thread
5 pools based on the actual amount of threads that are running, the capacity of
6 the machines, the amount of blocked threads ...
11 In the kernel, threads registered in the pwq regulator can be in 4 states:
14 This is the state of threads that are curently blocked in a syscall.
17 This is the state of threads that are either really running, or have
18 been preempted out by the kernel. In other words it's the number of
22 This is the state of threads that are currently in a `PWQR_WAIT` call
23 from userspace (see `pwqr_ctl`) but that would not overcommit if
24 released by a `PWQR_WAKE` call.
27 This is the state of threads that are currently in a `PWQR_WAIT` call
28 from userspace (see `pwqr_ctl`) but that would overcommit if released
29 by a `PWQR_WAKE` call.
31 This state avoids waking a thread to force userland to "park" the thread, this
32 is racy, make the scheduler work for nothing useful. Though if `PWQR_WAKE` is
33 called, quarantined threads are woken but with a `EDQUOT` errno set.
35 This state actually has only one impact: when `PWQR_WAKE` is called for more
36 than one threads, for example 4, and that userland knows that there is 5
37 threads in WAIT state, but that actually 3 of them are in the quarantine, only
38 2 will be woken up, and the `PWQR_WAKE` call will return 2. Any subsequent
39 `PWQR_WAKE` call will wake up one quarantined thread to let it be parked, but
40 returning 0 each time to hide that from userland.
43 This is the state of threads currently in a `PWQR_PARK` call from
44 userspace (see `pwqr_ctl`).
47 The regulator tries to maintain the following invariant:
49 running + waiting == target_concurrency
50 || (running + waiting < target_concurrency && waiting > 0)
52 When `running + waiting` overcommits::
53 The kernel puts waiting threads into the quarantine, which doesn't
54 require anything from userland. It's something userland discovers only
55 when it needs a waiting thread, which may never happen.
57 If there are no waiting threads, then well, the workqueue overcommits, and
58 that's one of the TODO items at the moment (see Notes)
60 When `running + waiting` undercommits::
61 If waiting is non-zero then well, we don't care, it's that userland
62 actually doesn't need work to be performed.
64 If waiting is zero, then a parked thread (if such a thread) is woken up so
65 that userland has a chance to consume jobs.
67 Unparking threads only when waiting becomes zero avoid flip-flops when the job
68 flow is small, and that some of the running threads sometimes blocks (IOW
69 running sometimes decreases, making `running + waiting` be below target
70 concurrency for very small amount of time).
72 NOTE: unparking only happens after a delay (0.1s in the current
73 implementation) during which `waiting` must have been remained zero and the
74 overal load to be under commiting resources for the whole period.
76 The regulation between running and waiting threads is left to userspace that
77 is a way better judge than kernel land that has absolutely no knowledge about
78 the current workload. Also, doing so means that when there are lots of jobs to
79 process and that the pool has a size that doesn't require more regulation,
80 kernel isn't called for mediation/regulation AT ALL.
85 When we're overcommiting for a "long" time, userspace should be notified in
86 some way it should try to reduce its amount of running threads. Note that the
87 Apple implementation (before Lion at least) has the same issue. Though if you
88 imagine someone that spawns a zillion jobs that call very slow `msync()s` or
89 blocking `read()s` over the network, then that all those go back to running
90 state, the overcommit is huge.
92 There are several ways to "fix" this:
94 in kernel (poll solution)::
95 Let the file descriptor be pollable, and let it be readable (returning
96 something like the amount of overcommit at read() time for example) so
97 that userland is notified that it should try to reduce the amount of
100 It sounds very easy, but it has one major drawback: it meaks the pwqfd must be
101 somehow registered into the eventloop, and it's not very suitable for a
102 pthread_workqueue implementation.
104 in kernel (hack-ish solution)::
105 The kernel could voluntarily unpark/unblock a thread with another
106 errno that would signal overcommiting. Unlike the pollable proposal,
107 this doesn't require hooking in the event loop. Though it requires
108 having one such thread, which may not be the case when userland has
109 reached the peak number of threads it would ever want to use.
111 Is this really a problem? I'm not sure. Especially since when that happens
112 userland could pick a victim thread that would call `PWQR_PARK` after each
113 processed job, which would allow some kind of poor man's poll.
115 The drawback I see in that solution is that we wake up YET ANOTHER thread at a
116 moment when we're already overcommiting, which sounds counter productive.
117 That's why I didn't implement that.
120 Userspace knows how many "running" threads there are, it's easy to
121 track the amount of registered threads, and parked/waiting threads are
122 already accounted for. When "waiting" is zero, if "registerd - parked"
123 is "High" userspace could choose to randomly try to park one thread.
125 I think `PWQR_PARK` could use `val` to have some "probing" mode, that would
126 return `0` if it wouldn't block and `-1/EWOULDBLOCK` if it would in the non
127 probing mode. Userspace could maintain some global probing_mode flag, that
128 would be a tristate: NONE, SLOW, AGGRESSVE.
130 It's in NONE when userspace belives it's not necessary to probe (e.g. when the
131 amount of running + waiting threads isn't that large, say less than 110% of
132 the concurrency or any kind of similar rule).
134 It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64
135 jobs to mitigate the cost of the syscall. If the probe returns EWOULDBLOCK
136 then the thread goes to PARK mode, and the probing_mode goes to AGGRESSVE.
138 When AGGRESSVE threads check if they must park more often and in a more
139 controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very
140 long), for example based on some poor man's timer (clock_gettime(MONOTONIC)
141 sounds fine). As soon as a probe returns 0 or we're in the NONE conditions,
142 then the probing_mode goes back to NONE/SLOW.
144 The issue I have with this is that it sounds to add quite some code in the
145 fastpath code, hence I dislike it a lot.
148 To be able to define a new signal we could asynchronously send to the
149 process. The signal handler would just put some global flag to '1',
150 the threads in turn would check for this flag in their job consuming
151 loop, and the first thread that sees it to '1', xchg()s 0 for it, and
152 goes to PARK mode if it got the '1'. It's fast, inexpensive.
154 Sadly AFAICT defining new signals() isn't such a good idea. Another
155 possibility is to give an address for the flag at pwqr_create() time and let
156 the kernel directly write into userland. The problem is, I feel like it's a
157 very wrong interface somehow. I should ask some kernel hacker to know if that
158 would be really frowned upon. If not, then that's the leanest solution of all.
165 int pwqr_create(int flags);
169 This call returns a new PWQR file-descriptor. The regulator is initialized
170 with a concurrency corresponding to the number of online CPUs at the time of
171 the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`.
174 a mask of flags, currently only O_CLOEXEC.
178 On success, this call return a nonnegative file descriptor.
179 On error, -1 is returned, and errno is set to indicate the error.
184 Invalid value specified in flags
186 The system limit on the total number of open files has been reached.
188 There was insufficient memory to create the kernel object.
196 int pwqr_ctl(int pwqrfd, int op, int val, void *addr);
202 This system call performs control operations on the pwqr instance referred to
203 by the file descriptor `pwqrfd`.
205 Valid values for the `op` argument are:
208 Requests the current concurrency level for this regulator.
211 Modifies the current concurrency level for this regulator. The new
212 value is passed as the `val` argument. The requests returns the old
213 concurrency level on success.
215 A zero or negative value for `val` means 'automatic' and is recomputed
216 as the current number of online CPUs as
217 `sysconf(_SC_NPROCESSORS_ONLN)` would return.
220 Registers the calling thread to be taken into account by the pool
221 regulator. If the thread is already registered into another regulator,
222 then it's automatically unregistered from it.
225 Deregisters the calling thread from the pool regulator.
228 Tries to wake `val` threads from the pool. This is done according to
229 the current concurrency level not to overcommit. On success, the
230 number of woken threads is returned, it can be 0.
233 Tries to wake `val` threads from the pool. This is done bypassing the
234 current concurrency level (`OC` stands for `OVERCOMMIT`). On success,
235 the number of woken threads is returned, it can be 0.
238 Puts the thread to wait for a future `PWQR_WAKE` command. If this
239 thread must be parked to maintain concurrency below the target, then
240 the call blocks with no further ado.
242 If the concurrency level is below the target, then the kernel checks if the
243 address `addr` still contains the value `val` (in the fashion of `futex(2)`).
244 If it doesn't then the call doesn't block. Else the calling thread is blocked
245 until a `PWQR_WAKE` command is received.
248 Puts the thread in park mode. Those are spare threads to avoid
249 cloning/exiting threads when the pool is regulated. Those threads are
250 released by the regulator only, and can only be woken from userland
251 with the `PWQR_WAKE_OC` command, and once all waiting threads have
254 The call blocks until an overcommiting wake requires the thread, or the kernel
255 regulator needs to grow the pool with new running threads.
259 When successful `pwqr_ctl` returns a nonnegative value.
260 On error, -1 is returned, and errno is set to indicate the error.
265 `pwqfd` is not a valid file descriptor.
268 `pwqfd` is a valid pwqr file descriptor but is in a broken state: it
269 has been closed while other threads were in a pwqr_ctl call.
271 NOTE: this is due to the current implementation and would probably not be here
275 Error in reading value from `addr` from userspace.
280 Errors specific to `PWQR_REGISTER`:
283 There was insufficient memory to perform the operation.
285 Errors specific to `PWQR_WAIT`:
288 When the kernel evaluated if `addr` still contained `val` it didn't.
289 This works like `futex(2)`.
291 Errors specific to `PWQR_WAIT` and `PWQR_PARK`:
294 The call was interrupted by a syscall (note that sometimes the kernel
295 masks this fact when it has more important "errors" to report like
298 The thread has been woken by a `PWQR_WAKE` or `PWQR_WAKE_OC` call, but