Pthread WorkQueue Regulator =========================== The Pthread Workqueue Regulator is meant to help userland regulate thread pools based on the actual amount of threads that are running, the capacity of the machines, the amount of blocked threads ... kernel-land design ------------------ In the kernel, threads registered in the pwq regulator can be in 4 states: blocked:: This is the state of threads that are curently blocked in a syscall. running:: This is the state of threads that are either really running, or have been preempted out by the kernel. In other words it's the number of schedulable threads. waiting:: This is the state of threads that are currently in a `PWQR_CTL_WAIT` call from userspace (see `pwqr_ctl`) but that would not overcommit if released by a `PWQR_CTL_WAKE` call. quarantined:: This is the state of threads that are currently in a `PWQR_CTL_WAIT` call from userspace (see `pwqr_ctl`) but that would overcommit if released by a `PWQR_CTL_WAKE` call. + This state avoids waking a thread to force userland to "park" the thread, this is racy, make the scheduler work for nothing useful. Though if `PWQR_CTL_WAKE` is called, quarantined threads are woken but with a `EDQUOT` errno set, and only one by one, no matter how wakes have been asked. + This state actually has only one impact: when `PWQR_CTL_WAKE` is called for more than one threads, for example 4, and that userland knows that there is 5 threads in WAIT state, but that actually 3 of them are in the quarantine, only 2 will be woken up, and the `PWQR_CTL_WAKE` call will return 2. Any subsequent `PWQR_CTL_WAKE` call will wake up one quarantined thread to let it be parked, but returning 0 each time to hide that from userland. parked:: This is the state of threads currently in a `PWQR_CTL_PARK` call from userspace (see `pwqr_ctl`). The regulator tries to maintain the following invariant: running + waiting == target_concurrency || (running + waiting < target_concurrency && waiting > 0) When `running + waiting` overcommits:: The kernel puts waiting threads into the quarantine, which doesn't require anything from userland. It's something userland discovers only when it needs a waiting thread, which may never happen. + If there are no waiting threads, then well, the workqueue overcommits, and that's one of the TODO items at the moment (see Notes) When `running + waiting` undercommits:: If waiting is non-zero then well, we don't care, it's that userland actually doesn't need work to be performed. + If waiting is zero, then a parked thread (if such a thread) is woken up so that userland has a chance to consume jobs. + Unparking threads only when waiting becomes zero avoid flip-flops when the job flow is small, and that some of the running threads sometimes blocks (IOW running sometimes decreases, making `running + waiting` be below target concurrency for very small amount of time). + NOTE: unparking only happens after a delay (0.1s in the current implementation) during which `waiting` must have been remained zero and the overal load to be under commiting resources for the whole period. The regulation between running and waiting threads is left to userspace that is a way better judge than kernel land that has absolutely no knowledge about the current workload. Also, doing so means that when there are lots of jobs to process and that the pool has a size that doesn't require more regulation, kernel isn't called for mediation/regulation AT ALL. Todos ~~~~~ When we're overcommiting for a "long" time, userspace should be notified in some way it should try to reduce its amount of running threads. Note that the Apple implementation (before Lion at least) has the same issue. Though if you imagine someone that spawns a zillion jobs that call very slow `msync()s` or blocking `read()s` over the network, then that all those go back to running state, the overcommit is huge. There are several ways to "fix" this: in kernel (poll solution):: Let the file descriptor be pollable, and let it be readable (returning something like the amount of overcommit at read() time for example) so that userland is notified that it should try to reduce the amount of runnable threads. + It sounds very easy, but it has one major drawback: it meaks the pwqfd must be somehow registered into the eventloop, and it's not very suitable for a pthread_workqueue implementation. In other words, if you can plug into the event-loop because it's a custom one or one that provides thread regulation then it's fine, if you can't (glib, libdispatch, ...) then you need a thread that will basically just poll() on this file-descriptor, it's really wasteful. + NOTE: this has been implemented now, but still it looks "expensive" to hook for some users. So if some alternative way to be signalled could exist, it'd be really awesome. in userspace:: Userspace knows how many "running" threads there are, it's easy to track the amount of registered threads, and parked/waiting threads are already accounted for. When "waiting" is zero, if "registerd - parked" is "High" userspace could choose to randomly try to park one thread. + userspace can use non blocking read() to probe if it's overcommiting. + It's in NONE when userspace belives it's not necessary to probe (e.g. when the amount of running + waiting threads isn't that large, say less than 110% of the concurrency or any kind of similar rule). + It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64 jobs to mitigate the cost of the syscall. If the probe returns '1' then ask for down-commiting and stay in SLOW mode, if it returns AGAIN all is fine, if it returns more than '1' ask for down-commiting and go to AGGRESSIVE. + When AGGRESSVE threads check if they must park more often and in a more controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very long), for example based on some poor man's timer (clock_gettime(MONOTONIC) sounds fine). State transition works as for SLOW. + The issue I have with this is that it sounds to add quite some code in the fastpath code, hence I dislike it a lot. my dream:: To be able to define a new signal we could asynchronously send to the process. The signal handler would just put some global flag to '1', the threads in turn would check for this flag in their job consuming loop, and the first thread that sees it to '1', xchg()s 0 for it, and goes to PARK mode if it got the '1'. It's fast, inexpensive. + Sadly AFAICT defining new signals() isn't such a good idea. Another possibility is to give an address for the flag at pwqr_create() time and let the kernel directly write into userland. The problem is, I feel like it's a very wrong interface somehow. I should ask some kernel hacker to know if that would be really frowned upon. If not, then that's the leanest solution of all. pwqr_create ----------- SYNOPSIS ~~~~~~~~ int pwqr_create(int flags); DESCRIPTION ~~~~~~~~~~~ This call returns a new PWQR file-descriptor. The regulator is initialized with a concurrency corresponding to the number of online CPUs at the time of the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`. `flags`:: a mask of flags among `PWQR_FL_CLOEXEC`, and `PWQR_FL_NONBLOCK`. Available operations on the pwqr file descriptor are: `poll`, `epoll` and friends:: the PWQR file descriptor can be watched for POLLIN events (not POLLOUT ones as it can not be written to). `read`:: The file returned can be read upon. The read blocks (or fails setting `EAGAIN` if in non blocking mode) until the regulator believes the pool is overcommitting. The buffer passed to read should be able to hold an integer. When `read(3)` is successful, it writes the amount of overcommiting threads (understand: the number of threads to park so that the pool isn't overcommiting anymore). RETURN VALUE ~~~~~~~~~~~~ On success, this call return a nonnegative file descriptor. On error, -1 is returned, and errno is set to indicate the error. ERRORS ~~~~~~ [EINVAL]:: Invalid value specified in flags [ENFILE]:: The system limit on the total number of open files has been reached. [ENOMEM]:: There was insufficient memory to create the kernel object. pwqr_ctl -------- SYNOPSIS ~~~~~~~~ int pwqr_ctl(int pwqrfd, int op, int val, void *addr); DESCRIPTION ~~~~~~~~~~~ This system call performs control operations on the pwqr instance referred to by the file descriptor `pwqrfd`. Valid values for the `op` argument are: `PWQR_CTL_GET_CONC`:: Requests the current concurrency level for this regulator. `PWQR_CTL_SET_CONC`:: Modifies the current concurrency level for this regulator. The new value is passed as the `val` argument. The requests returns the old concurrency level on success. + A zero or negative value for `val` means 'automatic' and is recomputed as the current number of online CPUs as `sysconf(_SC_NPROCESSORS_ONLN)` would return. `PWQR_CTL_REGISTER`:: Registers the calling thread to be taken into account by the pool regulator. If the thread is already registered into another regulator, then it's automatically unregistered from it. `PWQR_CTL_UNREGISTER`:: Deregisters the calling thread from the pool regulator. `PWQR_CTL_WAKE`:: Tries to wake `val` threads from the pool. This is done according to the current concurrency level not to overcommit. On success, a hint of the number of woken threads is returned, it can be 0. + This is only a hint of the number of threads woken up for two reasons. First, the kernel could really have woken up a thread, but when it becomes scheduled, it could *then* decide that it would overcommit (because some other thread unblocked inbetween for example), and block it again. + But it can also lie in the other direction: userland is supposed to account for waiting threads. So when we're overcommiting and userland want a waiting thread to be unblocked, we actually say we woke none, but still unblock one (the famous quarantined threads we talk about above). This allow the userland counter of waiting threads to decrease, but we know the thread won't be usable so we return 0. `PWQR_CTL_WAKE_OC`:: Tries to wake `val` threads from the pool. This is done bypassing the current concurrency level (`OC` stands for `OVERCOMMIT`). On success, the number of woken threads is returned, it can be 0, but it's the real count that has been (or will soon be) woken up. If it's less than required, it's because there aren't enough parked threads. `PWQR_CTL_WAIT`:: Puts the thread to wait for a future `PWQR_CTL_WAKE` command. If this thread must be parked to maintain concurrency below the target, then the call blocks with no further ado. + If the concurrency level is below the target, then the kernel checks if the address `addr` still contains the value `val` (in the fashion of `futex(2)`). If it doesn't then the call doesn't block. Else the calling thread is blocked until a `PWQR_CTL_WAKE` command is received. + `addr` must of course be a pointer to an aligned integer which stores the reference ticket in userland. `PWQR_CTL_PARK`:: Puts the thread in park mode. Those are spare threads to avoid cloning/exiting threads when the pool is regulated. Those threads are released by the regulator only, and can only be woken from userland with the `PWQR_CTL_WAKE_OC` command, and once all waiting threads have been woken. + The call blocks until an overcommiting wake requires the thread, or the kernel regulator needs to grow the pool with new running threads. RETURN VALUE ~~~~~~~~~~~~ When successful `pwqr_ctl` returns a nonnegative value. On error, -1 is returned, and errno is set to indicate the error. ERRORS ~~~~~~ [EBADF]:: `pwqfd` is not a valid file descriptor. [EBADFD]:: `pwqfd` is a valid pwqr file descriptor but is in a broken state: it has been closed while other threads were in a pwqr_ctl call. + NOTE: this is due to the current implementation and would probably not be here with a real syscall. [EFAULT]:: Error in reading value from `addr` from userspace. [EINVAL]:: TODO Errors specific to `PWQR_CTL_REGISTER`: [ENOMEM]:: There was insufficient memory to perform the operation. Errors specific to `PWQR_CTL_WAIT`: [EWOULDBLOCK]:: When the kernel evaluated if `addr` still contained `val` it didn't. This works like `futex(2)`. Errors specific to `PWQR_CTL_WAIT` and `PWQR_CTL_PARK`: [EINTR]:: The call was interrupted by a syscall (note that sometimes the kernel masks this fact when it has more important "errors" to report like `EDQUOT`). [EDQUOT]:: The thread has been woken by a `PWQR_CTL_WAKE` or `PWQR_CTL_WAKE_OC` call, but is overcommiting.