Pthread WorkQueue Regulator =========================== The Pthread Workqueue Regulator is meant to help userland regulate thread pools based on the actual amount of threads that are running, the capacity of the machines, the amount of blocked threads ... kernel-land design ------------------ In the kernel, threads registered in the pwq regulator can be in 4 states: blocked:: This is the state of threads that are curently blocked in a syscall. running:: This is the state of threads that are either really running, or have been preempted out by the kernel. In other words it's the number of schedulable threads. waiting:: This is the state of threads that are currently in a `PWQR_WAIT` call from userspace (see `pwqr_ctl`) but that would not overcommit if released by a `PWQR_WAKE` call. quarantined:: This is the state of threads that are currently in a `PWQR_WAIT` call from userspace (see `pwqr_ctl`) but that would overcommit if released by a `PWQR_WAKE` call. + This state avoids waking a thread to force userland to "park" the thread, this is racy, make the scheduler work for nothing useful. Though if `PWQR_WAKE` is called, quarantined threads are woken but with a `EDQUOT` errno set. parked:: This is the state of threads currently in a `PWQR_PARK` call from userspace (see `pwqr_ctl`). The regulator tries to maintain the following invariant: running + waiting == target_concurrency || (running + waiting < target_concurrency && waiting > 0) When `running + waiting` overcommits:: The kernel puts waiting threads into the quarantine, which doesn't require anything from userland. It's something userland discovers only when it needs a waiting thread, which may never happen. + If there are no waiting threads, then well, the workqueue overcommits, and that's one of the TODO items at the moment (see Notes) When `running + waiting` undercommits:: If waiting is non-zero then well, we don't care, it's that userland actually doesn't need work to be performed. + If waiting is zero, then a parked thread (if such a thread) is woken up so that userland has a chance to consume jobs. + Unparking threads only when waiting becomes zero avoid flip-flops when the job flow is small, and that some of the running threads sometimes blocks (IOW running sometimes decreases, making `running + waiting` be below target concurrency for very small amount of time). The regulation between running and waiting threads is left to userspace that is a way better judge than kernel land that has absolutely no knowledge about the current workload. Also, doing so means that when there are lots of jobs to process and that the pool has a size that doesn't require more regulation, kernel isn't called for mediation/regulation AT ALL. NOTE: right now threads are unparked as soon as `running + waiting` undercommit, and some delay should be applied to be sure it's not a really short blocking syscall that made us undercommit. NOTE: when we're overcommiting for a "long" time, userspace should be notified in some way it should try to reduce its amount of running threads. Note that the Apple implementation (before Lion at least) has the same issue. Though if you imagine someone that spawns a zillion jobs that call very slow `msync()s` or blocking `read()s` over the network, then that all those go back to running state, the overcommit is huge. A way to mitigate this atm is that when userspace belives the amount of threads is abnormally high it should periodically try to PARK the threads. If that blocks the thread, then it's that we were overcommiting. Note that it may be the best solution rather than a kernel-side implementation. To be thought over. pwqr_create ----------- SYNOPSIS ~~~~~~~~ int pwqr_create(int flags); DESCRIPTION ~~~~~~~~~~~ This call returns a new PWQR file-descriptor. The regulator is initialized with a concurrency corresponding to the number of online CPUs at the time of the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`. `flags`:: a mask of flags, currently only O_CLOEXEC. RETURN VALUE ~~~~~~~~~~~~ On success, this call return a nonnegative file descriptor. On error, -1 is returned, and errno is set to indicate the error. ERRORS ~~~~~~ [EINVAL]:: Invalid value specified in flags [ENFILE]:: The system limit on the total number of open files has been reached. [ENOMEM]:: There was insufficient memory to create the kernel object. pwqr_ctl -------- SYNOPSIS ~~~~~~~~ int pwqr_ctl(int pwqrfd, int op, int val, void *addr); DESCRIPTION ~~~~~~~~~~~ This system call performs control operations on the pwqr instance referred to by the file descriptor `pwqrfd`. Valid values for the `op` argument are: `PWQR_GET_CONC`:: Requests the current concurrency level for this regulator. `PWQR_SET_CONC`:: Modifies the current concurrency level for this regulator. The new value is passed as the `val` argument. The requests returns the old concurrency level on success. + A zero or negative value for `val` means 'automatic' and is recomputed as the current number of online CPUs as `sysconf(_SC_NPROCESSORS_ONLN)` would return. `PWQR_REGISTER`:: Registers the calling thread to be taken into account by the pool regulator. If the thread is already registered into another regulator, then it's automatically unregistered from it. `PWQR_UNREGISTER`:: Deregisters the calling thread from the pool regulator. `PWQR_WAKE`:: Tries to wake `val` threads from the pool. This is done according to the current concurrency level not to overcommit. On success, the number of woken threads is returned, it can be 0. `PWQR_WAKE_OC`:: Tries to wake `val` threads from the pool. This is done bypassing the current concurrency level (`OC` stands for `OVERCOMMIT`). On success, the number of woken threads is returned, it can be 0. `PWQR_WAIT`:: Puts the thread to wait for a future `PWQR_WAKE` command. If this thread must be parked to maintain concurrency below the target, then the call blocks with no further ado. + If the concurrency level is below the target, then the kernel checks if the address `addr` still contains the value `val` (in the fashion of `futex(2)`). If it doesn't then the call doesn't block. Else the calling thread is blocked until a `PWQR_WAKE` command is received. `PWQR_PARK`:: Puts the thread in park mode. Those are spare threads to avoid cloning/exiting threads when the pool is regulated. Those threads are released by the regulator only, and can only be woken from userland with the `PWQR_WAKE_OC` command, and once all waiting threads have been woken. + The call blocks until an overcommiting wake requires the thread, or the kernel regulator needs to grow the pool with new running threads. RETURN VALUE ~~~~~~~~~~~~ When successful `pwqr_ctl` returns a nonnegative value. On error, -1 is returned, and errno is set to indicate the error. ERRORS ~~~~~~ [EBADF]:: `pwqfd` is not a valid file descriptor. [EBADFD]:: `pwqfd` is a valid pwqr file descriptor but is in a broken state: it has been closed while other threads were in a pwqr_ctl call. + NOTE: this is due to the current implementation and would probably not be here with a real syscall. [EFAULT]:: Error in reading value from `addr` from userspace. [EINVAL]:: TODO Errors specific to `PWQR_REGISTER`: [ENOMEM]:: There was insufficient memory to perform the operation. Errors specific to `PWQR_WAIT`: [EWOULDBLOCK]:: When the kernel evaluated if `addr` still contained `val` it didn't. This works like `futex(2)`. Errors specific to `PWQR_WAIT` and `PWQR_PARK`: [EINTR]:: The call was interrupted by a syscall (note that sometimes the kernel masks this fact when it has more important "errors" to report like `EDQUOT`). [EDQUOT]:: The thread has been woken by a `PWQR_WAKE` or `PWQR_WAKE_OC` call, but is overcommiting.