Documentation/pwqr.adoc

   1 Pthread WorkQueue Regulator
   2 ===========================
   3
   4 The Pthread Workqueue Regulator is meant to help userland regulate thread
   5 pools based on the actual amount of threads that are running, the capacity of
   6 the machines, the amount of blocked threads ...
   7
   8 kernel-land design
   9 ------------------
  10
  11 In the kernel, threads registered in the pwq regulator can be in 4 states:
  12
  13 blocked::
  14         This is the state of threads that are curently blocked in a syscall.
  15
  16 running::
  17         This is the state of threads that are either really running, or have
  18         been preempted out by the kernel. In other words it's the number of
  19         schedulable threads.
  20
  21 waiting::
  22         This is the state of threads that are currently in a `PWQR_WAIT` call
  23         from userspace (see `pwqr_ctl`) but that would not overcommit if
  24         released by a `PWQR_WAKE` call.
  25
  26 quarantined::
  27         This is the state of threads that are currently in a `PWQR_WAIT` call
  28         from userspace (see `pwqr_ctl`) but that would overcommit if released
  29         by a `PWQR_WAKE` call.
  30 +
  31 This state avoids waking a thread to force userland to "park" the thread, this
  32 is racy, make the scheduler work for nothing useful.  Though if `PWQR_WAKE` is
  33 called, quarantined threads are woken but with a `EDQUOT` errno set, and only
  34 one by one, no matter how wakes have been asked.
  35 +
  36 This state actually has only one impact: when `PWQR_WAKE` is called for more
  37 than one threads, for example 4, and that userland knows that there is 5
  38 threads in WAIT state, but that actually 3 of them are in the quarantine, only
  39 2 will be woken up, and the `PWQR_WAKE` call will return 2. Any subsequent
  40 `PWQR_WAKE` call will wake up one quarantined thread to let it be parked, but
  41 returning 0 each time to hide that from userland.
  42
  43 parked::
  44         This is the state of threads currently in a `PWQR_PARK` call from
  45         userspace (see `pwqr_ctl`).
  46
  47
  48 The regulator tries to maintain the following invariant:
  49
  50         running + waiting == target_concurrency
  51     ||  (running + waiting < target_concurrency && waiting > 0)
  52
  53 When `running + waiting` overcommits::
  54         The kernel puts waiting threads into the quarantine, which doesn't
  55         require anything from userland. It's something userland discovers only
  56         when it needs a waiting thread, which may never happen.
  57 +
  58 If there are no waiting threads, then well, the workqueue overcommits, and
  59 that's one of the TODO items at the moment (see Notes)
  60
  61 When `running + waiting` undercommits::
  62         If waiting is non-zero then well, we don't care, it's that userland
  63         actually doesn't need work to be performed.
  64 +
  65 If waiting is zero, then a parked thread (if such a thread) is woken up so
  66 that userland has a chance to consume jobs.
  67 +
  68 Unparking threads only when waiting becomes zero avoid flip-flops when the job
  69 flow is small, and that some of the running threads sometimes blocks (IOW
  70 running sometimes decreases, making `running + waiting` be below target
  71 concurrency for very small amount of time).
  72 +
  73 NOTE: unparking only happens after a delay (0.1s in the current
  74 implementation) during which `waiting` must have been remained zero and the
  75 overal load to be under commiting resources for the whole period.
  76
  77 The regulation between running and waiting threads is left to userspace that
  78 is a way better judge than kernel land that has absolutely no knowledge about
  79 the current workload. Also, doing so means that when there are lots of jobs to
  80 process and that the pool has a size that doesn't require more regulation,
  81 kernel isn't called for mediation/regulation AT ALL.
  82
  83
  84 Todos
  85 ~~~~~
  86 When we're overcommiting for a "long" time, userspace should be notified in
  87 some way it should try to reduce its amount of running threads. Note that the
  88 Apple implementation (before Lion at least) has the same issue. Though if you
  89 imagine someone that spawns a zillion jobs that call very slow `msync()s` or
  90 blocking `read()s` over the network, then that all those go back to running
  91 state, the overcommit is huge.
  92
  93 There are several ways to "fix" this:
  94
  95 in kernel (poll solution)::
  96         Let the file descriptor be pollable, and let it be readable (returning
  97         something like the amount of overcommit at read() time for example) so
  98         that userland is notified that it should try to reduce the amount of
  99         runnable threads.
 100 +
 101 It sounds very easy, but it has one major drawback: it meaks the pwqfd must be
 102 somehow registered into the eventloop, and it's not very suitable for a
 103 pthread_workqueue implementation. In other words, if you can plug into the
 104 event-loop because it's a custom one or one that provides thread regulation
 105 then it's fine, if you can't (glib, libdispatch, ...) then you need a thread
 106 that will basically just poll() on this file-descriptor, it's really wasteful.
 107 +
 108 NOTE: this has been implemented now, but still it looks "expensive" to hook
 109 for some users. So if some alternative way to be signalled could exist, it'd
 110 be really awesome.
 111
 112 in userspace::
 113         Userspace knows how many "running" threads there are, it's easy to
 114         track the amount of registered threads, and parked/waiting threads are
 115         already accounted for. When "waiting" is zero, if "registerd - parked"
 116         is "High" userspace could choose to randomly try to park one thread.
 117 +
 118 userspace can use non blocking read() to probe if it's overcommiting.
 119 +
 120 It's in NONE when userspace belives it's not necessary to probe (e.g. when the
 121 amount of running + waiting threads isn't that large, say less than 110% of
 122 the concurrency or any kind of similar rule).
 123 +
 124 It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64
 125 jobs to mitigate the cost of the syscall. If the probe returns '1' then ask
 126 for down-commiting and stay in SLOW mode, if it returns AGAIN all is fine, if
 127 it returns more than '1' ask for down-commiting and go to AGGRESSIVE.
 128 +
 129 When AGGRESSVE threads check if they must park more often and in a more
 130 controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very
 131 long), for example based on some poor man's timer (clock_gettime(MONOTONIC)
 132 sounds fine). State transition works as for SLOW.
 133 +
 134 The issue I have with this is that it sounds to add quite some code in the
 135 fastpath code, hence I dislike it a lot.
 136
 137 my dream::
 138         To be able to define a new signal we could asynchronously send to the
 139         process. The signal handler would just put some global flag to '1',
 140         the threads in turn would check for this flag in their job consuming
 141         loop, and the first thread that sees it to '1', xchg()s 0 for it, and
 142         goes to PARK mode if it got the '1'. It's fast, inexpensive.
 143 +
 144 Sadly AFAICT defining new signals() isn't such a good idea. Another
 145 possibility is to give an address for the flag at pwqr_create() time and let
 146 the kernel directly write into userland. The problem is, I feel like it's a
 147 very wrong interface somehow. I should ask some kernel hacker to know if that
 148 would be really frowned upon. If not, then that's the leanest solution of all.
 149
 150 pwqr_create
 151 -----------
 152 SYNOPSIS
 153 ~~~~~~~~
 154
 155         int pwqr_create(int flags);
 156
 157 DESCRIPTION
 158 ~~~~~~~~~~~
 159 This call returns a new PWQR file-descriptor. The regulator is initialized
 160 with a concurrency corresponding to the number of online CPUs at the time of
 161 the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`.
 162
 163 `flags`::
 164         a mask of flags among `O_CLOEXEC`, and `O_NONBLOCK`.
 165
 166 Available operations on the pwqr file descriptor are:
 167
 168 `poll`, `epoll` and friends::
 169         the PWQR file descriptor can be watched for POLLIN events (not POLLOUT
 170         ones as it can not be written to).
 171
 172 `read`::
 173         The file returned can be read upon. The read blocks (or fails setting
 174         `EAGAIN` if in non blocking mode) until the regulator believes the
 175         pool is overcommitting. The buffer passed to read should be able to
 176         hold an integer. When `read(3)` is successful, it writes the amount of
 177         overcommiting threads (understand: the number of threads to park so
 178         that the pool isn't overcommiting anymore).
 179
 180 RETURN VALUE
 181 ~~~~~~~~~~~~
 182 On success, this call return a nonnegative file descriptor.
 183 On error, -1 is returned, and errno is set to indicate the error.
 184
 185 ERRORS
 186 ~~~~~~
 187 [EINVAL]::
 188         Invalid value specified in flags
 189 [ENFILE]::
 190         The system limit on the total number of open files has been reached.
 191 [ENOMEM]::
 192         There was insufficient memory to create the kernel object.
 193
 194
 195 pwqr_ctl
 196 --------
 197 SYNOPSIS
 198 ~~~~~~~~
 199
 200         int pwqr_ctl(int pwqrfd, int op, int val, void *addr);
 201
 202
 203 DESCRIPTION
 204 ~~~~~~~~~~~
 205
 206 This system call performs control operations on the pwqr instance referred to
 207 by the file descriptor `pwqrfd`.
 208
 209 Valid values for the `op` argument are:
 210
 211 `PWQR_GET_CONC`::
 212         Requests the current concurrency level for this regulator.
 213
 214 `PWQR_SET_CONC`::
 215         Modifies the current concurrency level for this regulator. The new
 216         value is passed as the `val` argument. The requests returns the old
 217         concurrency level on success.
 218 +
 219 A zero or negative value for `val` means 'automatic' and is recomputed as the
 220 current number of online CPUs as `sysconf(_SC_NPROCESSORS_ONLN)` would return.
 221
 222 `PWQR_REGISTER`::
 223         Registers the calling thread to be taken into account by the pool
 224         regulator. If the thread is already registered into another regulator,
 225         then it's automatically unregistered from it.
 226
 227 `PWQR_UNREGISTER`::
 228         Deregisters the calling thread from the pool regulator.
 229
 230 `PWQR_WAKE`::
 231         Tries to wake `val` threads from the pool. This is done according to
 232         the current concurrency level not to overcommit. On success, a hint of
 233         the number of woken threads is returned, it can be 0.
 234 +
 235 This is only a hint of the number of threads woken up for two reasons. First,
 236 the kernel could really have woken up a thread, but when it becomes scheduled,
 237 it could *then* decide that it would overcommit (because some other thread
 238 unblocked inbetween for example), and block it again.
 239 +
 240 But it can also lie in the other direction: userland is supposed to account
 241 for waiting threads. So when we're overcommiting and userland want a waiting
 242 thread to be unblocked, we actually say we woke none, but still unblock one
 243 (the famous quarantined threads we talk about above). This allow the userland
 244 counter of waiting threads to decrease, but we know the thread won't be usable
 245 so we return 0.
 246
 247 `PWQR_WAKE_OC`::
 248         Tries to wake `val` threads from the pool. This is done bypassing the
 249         current concurrency level (`OC` stands for `OVERCOMMIT`). On success,
 250         the number of woken threads is returned, it can be 0, but it's the
 251         real count that has been (or will soon be) woken up. If it's less than
 252         required, it's because there aren't enough parked threads.
 253
 254 `PWQR_WAIT`::
 255         Puts the thread to wait for a future `PWQR_WAKE` command. If this
 256         thread must be parked to maintain concurrency below the target, then
 257         the call blocks with no further ado.
 258 +
 259 If the concurrency level is below the target, then the kernel checks if the
 260 address `addr` still contains the value `val` (in the fashion of `futex(2)`).
 261 If it doesn't then the call doesn't block. Else the calling thread is blocked
 262 until a `PWQR_WAKE` command is received.
 263 +
 264 `addr` must of course be a pointer to an aligned integer which stores the
 265 reference ticket in userland.
 266
 267 `PWQR_PARK`::
 268         Puts the thread in park mode. Those are spare threads to avoid
 269         cloning/exiting threads when the pool is regulated. Those threads are
 270         released by the regulator only, and can only be woken from userland
 271         with the `PWQR_WAKE_OC` command, and once all waiting threads have
 272         been woken.
 273 +
 274 The call blocks until an overcommiting wake requires the thread, or the kernel
 275 regulator needs to grow the pool with new running threads.
 276
 277 RETURN VALUE
 278 ~~~~~~~~~~~~
 279 When successful `pwqr_ctl` returns a nonnegative value.
 280 On error, -1 is returned, and errno is set to indicate the error.
 281
 282 ERRORS
 283 ~~~~~~
 284 [EBADF]::
 285         `pwqfd` is not a valid file descriptor.
 286
 287 [EBADFD]::
 288         `pwqfd` is a valid pwqr file descriptor but is in a broken state: it
 289         has been closed while other threads were in a pwqr_ctl call.
 290 +
 291 NOTE: this is due to the current implementation and would probably not be here
 292 with a real syscall.
 293
 294 [EFAULT]::
 295          Error in reading value from `addr` from userspace.
 296
 297 [EINVAL]::
 298         TODO
 299
 300 Errors specific to `PWQR_REGISTER`:
 301
 302 [ENOMEM]::
 303         There was insufficient memory to perform the operation.
 304
 305 Errors specific to `PWQR_WAIT`:
 306
 307 [EWOULDBLOCK]::
 308         When the kernel evaluated if `addr` still contained `val` it didn't.
 309         This works like `futex(2)`.
 310
 311 Errors specific to `PWQR_WAIT` and `PWQR_PARK`:
 312
 313 [EINTR]::
 314         The call was interrupted by a syscall (note that sometimes the kernel
 315         masks this fact when it has more important "errors" to report like
 316         `EDQUOT`).
 317 [EDQUOT]::
 318         The thread has been woken by a `PWQR_WAKE` or `PWQR_WAKE_OC` call, but
 319         is overcommiting.
 320