X-Git-Url: http://git.madism.org/?p=~madcoder%2Fpwqr.git;a=blobdiff_plain;f=Documentation%2Fpwqr.adoc;h=240c20a10c5a809f3b65a891ccb4f1b28e9dd7d3;hp=57a85d4a5c9c31eccdba0373a7d990aa89c5a177;hb=a5f7e5aaf5bb2168aca37066eb6b49d126689aa6;hpb=0fb6dd1141fb6076b2753133c797f0325c55a0df;ds=sidebyside diff --git a/Documentation/pwqr.adoc b/Documentation/pwqr.adoc index 57a85d4..240c20a 100644 --- a/Documentation/pwqr.adoc +++ b/Documentation/pwqr.adoc @@ -19,21 +19,29 @@ running:: schedulable threads. waiting:: - This is the state of threads that are currently in a `PWQR_WAIT` call - from userspace (see `pwqr_ctl`) but that would not overcommit if - released by a `PWQR_WAKE` call. + This is the state of threads that are currently in a `PWQR_CTL_WAIT` + call from userspace (see `pwqr_ctl`) but that would not overcommit if + released by a `PWQR_CTL_WAKE` call. quarantined:: - This is the state of threads that are currently in a `PWQR_WAIT` call - from userspace (see `pwqr_ctl`) but that would overcommit if released - by a `PWQR_WAKE` call. + This is the state of threads that are currently in a `PWQR_CTL_WAIT` + call from userspace (see `pwqr_ctl`) but that would overcommit if + released by a `PWQR_CTL_WAKE` call. + This state avoids waking a thread to force userland to "park" the thread, this -is racy, make the scheduler work for nothing useful. Though if `PWQR_WAKE` is -called, quarantined threads are woken but with a `EDQUOT` errno set. +is racy, make the scheduler work for nothing useful. Though if +`PWQR_CTL_WAKE` is called, quarantined threads are woken but with a `EDQUOT` +errno set, and only one by one, no matter how wakes have been asked. ++ +This state actually has only one impact: when `PWQR_CTL_WAKE` is called for +more than one threads, for example 4, and that userland knows that there is 5 +threads in WAIT state, but that actually 3 of them are in the quarantine, only +2 will be woken up, and the `PWQR_CTL_WAKE` call will return 2. Any subsequent +`PWQR_CTL_WAKE` call will wake up one quarantined thread to let it be parked, +but returning 0 each time to hide that from userland. parked:: - This is the state of threads currently in a `PWQR_PARK` call from + This is the state of threads currently in a `PWQR_CTL_PARK` call from userspace (see `pwqr_ctl`). @@ -61,6 +69,10 @@ Unparking threads only when waiting becomes zero avoid flip-flops when the job flow is small, and that some of the running threads sometimes blocks (IOW running sometimes decreases, making `running + waiting` be below target concurrency for very small amount of time). ++ +NOTE: unparking only happens after a delay (0.1s in the current +implementation) during which `waiting` must have been remained zero and the +overal load to be under commiting resources for the whole period. The regulation between running and waiting threads is left to userspace that is a way better judge than kernel land that has absolutely no knowledge about @@ -68,21 +80,72 @@ the current workload. Also, doing so means that when there are lots of jobs to process and that the pool has a size that doesn't require more regulation, kernel isn't called for mediation/regulation AT ALL. -NOTE: right now threads are unparked as soon as `running + waiting` -undercommit, and some delay should be applied to be sure it's not a really -short blocking syscall that made us undercommit. -NOTE: when we're overcommiting for a "long" time, userspace should be notified -in some way it should try to reduce its amount of running threads. Note that -the Apple implementation (before Lion at least) has the same issue. Though if -you imagine someone that spawns a zillion jobs that call very slow `msync()s` -or blocking `read()s` over the network, then that all those go back to running +Todos +~~~~~ +When we're overcommiting for a "long" time, userspace should be notified in +some way it should try to reduce its amount of running threads. Note that the +Apple implementation (before Lion at least) has the same issue. Though if you +imagine someone that spawns a zillion jobs that call very slow `msync()s` or +blocking `read()s` over the network, then that all those go back to running state, the overcommit is huge. -A way to mitigate this atm is that when userspace belives the amount of -threads is abnormally high it should periodically try to PARK the threads. If -that blocks the thread, then it's that we were overcommiting. Note that it may -be the best solution rather than a kernel-side implementation. To be thought -over. + +There are several ways to "fix" this: + +in kernel (poll solution):: + Let the file descriptor be pollable, and let it be readable (returning + something like the amount of overcommit at read() time for example) so + that userland is notified that it should try to reduce the amount of + runnable threads. ++ +It sounds very easy, but it has one major drawback: it meaks the pwqfd must be +somehow registered into the eventloop, and it's not very suitable for a +pthread_workqueue implementation. In other words, if you can plug into the +event-loop because it's a custom one or one that provides thread regulation +then it's fine, if you can't (glib, libdispatch, ...) then you need a thread +that will basically just poll() on this file-descriptor, it's really wasteful. ++ +NOTE: this has been implemented now, but still it looks "expensive" to hook +for some users. So if some alternative way to be signalled could exist, it'd +be really awesome. + +in userspace:: + Userspace knows how many "running" threads there are, it's easy to + track the amount of registered threads, and parked/waiting threads are + already accounted for. When "waiting" is zero, if "registerd - parked" + is "High" userspace could choose to randomly try to park one thread. ++ +userspace can use non blocking read() to probe if it's overcommiting. ++ +It's in NONE when userspace belives it's not necessary to probe (e.g. when the +amount of running + waiting threads isn't that large, say less than 110% of +the concurrency or any kind of similar rule). ++ +It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64 +jobs to mitigate the cost of the syscall. If the probe returns '1' then ask +for down-commiting and stay in SLOW mode, if it returns AGAIN all is fine, if +it returns more than '1' ask for down-commiting and go to AGGRESSIVE. ++ +When AGGRESSVE threads check if they must park more often and in a more +controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very +long), for example based on some poor man's timer (clock_gettime(MONOTONIC) +sounds fine). State transition works as for SLOW. ++ +The issue I have with this is that it sounds to add quite some code in the +fastpath code, hence I dislike it a lot. + +my dream:: + To be able to define a new signal we could asynchronously send to the + process. The signal handler would just put some global flag to '1', + the threads in turn would check for this flag in their job consuming + loop, and the first thread that sees it to '1', xchg()s 0 for it, and + goes to PARK mode if it got the '1'. It's fast, inexpensive. ++ +Sadly AFAICT defining new signals() isn't such a good idea. Another +possibility is to give an address for the flag at pwqr_create() time and let +the kernel directly write into userland. The problem is, I feel like it's a +very wrong interface somehow. I should ask some kernel hacker to know if that +would be really frowned upon. If not, then that's the leanest solution of all. pwqr_create ----------- @@ -98,7 +161,21 @@ with a concurrency corresponding to the number of online CPUs at the time of the call, as would be returned by `sysconf(_SC_NPROCESSORS_ONLN)`. `flags`:: - a mask of flags, currently only O_CLOEXEC. + a mask of flags among `PWQR_FL_CLOEXEC`, and `PWQR_FL_NONBLOCK`. + +Available operations on the pwqr file descriptor are: + +`poll`, `epoll` and friends:: + the PWQR file descriptor can be watched for POLLIN events (not POLLOUT + ones as it can not be written to). + +`read`:: + The file returned can be read upon. The read blocks (or fails setting + `EAGAIN` if in non blocking mode) until the regulator believes the + pool is overcommitting. The buffer passed to read should be able to + hold an integer. When `read(3)` is successful, it writes the amount of + overcommiting threads (understand: the number of threads to park so + that the pool isn't overcommiting anymore). RETURN VALUE ~~~~~~~~~~~~ @@ -131,51 +208,67 @@ by the file descriptor `pwqrfd`. Valid values for the `op` argument are: -`PWQR_GET_CONC`:: +`PWQR_CTL_GET_CONC`:: Requests the current concurrency level for this regulator. -`PWQR_SET_CONC`:: +`PWQR_CTL_SET_CONC`:: Modifies the current concurrency level for this regulator. The new value is passed as the `val` argument. The requests returns the old concurrency level on success. + - A zero or negative value for `val` means 'automatic' and is recomputed - as the current number of online CPUs as - `sysconf(_SC_NPROCESSORS_ONLN)` would return. +A zero or negative value for `val` means 'automatic' and is recomputed as the +current number of online CPUs as `sysconf(_SC_NPROCESSORS_ONLN)` would return. -`PWQR_REGISTER`:: +`PWQR_CTL_REGISTER`:: Registers the calling thread to be taken into account by the pool regulator. If the thread is already registered into another regulator, then it's automatically unregistered from it. -`PWQR_UNREGISTER`:: +`PWQR_CTL_UNREGISTER`:: Deregisters the calling thread from the pool regulator. -`PWQR_WAKE`:: +`PWQR_CTL_WAKE`:: Tries to wake `val` threads from the pool. This is done according to - the current concurrency level not to overcommit. On success, the - number of woken threads is returned, it can be 0. - -`PWQR_WAKE_OC`:: + the current concurrency level not to overcommit. On success, a hint of + the number of woken threads is returned, it can be 0. ++ +This is only a hint of the number of threads woken up for two reasons. First, +the kernel could really have woken up a thread, but when it becomes scheduled, +it could *then* decide that it would overcommit (because some other thread +unblocked inbetween for example), and block it again. ++ +But it can also lie in the other direction: userland is supposed to account +for waiting threads. So when we're overcommiting and userland want a waiting +thread to be unblocked, we actually say we woke none, but still unblock one +(the famous quarantined threads we talk about above). This allow the userland +counter of waiting threads to decrease, but we know the thread won't be usable +so we return 0. + +`PWQR_CTL_WAKE_OC`:: Tries to wake `val` threads from the pool. This is done bypassing the current concurrency level (`OC` stands for `OVERCOMMIT`). On success, - the number of woken threads is returned, it can be 0. + the number of woken threads is returned, it can be 0, but it's the + real count that has been (or will soon be) woken up. If it's less than + required, it's because there aren't enough parked threads. -`PWQR_WAIT`:: - Puts the thread to wait for a future `PWQR_WAKE` command. If this +`PWQR_CTL_WAIT`:: + Puts the thread to wait for a future `PWQR_CTL_WAKE` command. If this thread must be parked to maintain concurrency below the target, then the call blocks with no further ado. + If the concurrency level is below the target, then the kernel checks if the address `addr` still contains the value `val` (in the fashion of `futex(2)`). If it doesn't then the call doesn't block. Else the calling thread is blocked -until a `PWQR_WAKE` command is received. +until a `PWQR_CTL_WAKE` command is received. ++ +`addr` must of course be a pointer to an aligned integer which stores the +reference ticket in userland. -`PWQR_PARK`:: +`PWQR_CTL_PARK`:: Puts the thread in park mode. Those are spare threads to avoid cloning/exiting threads when the pool is regulated. Those threads are released by the regulator only, and can only be woken from userland - with the `PWQR_WAKE_OC` command, and once all waiting threads have + with the `PWQR_CTL_WAKE_OC` command, and once all waiting threads have been woken. + The call blocks until an overcommiting wake requires the thread, or the kernel @@ -204,24 +297,24 @@ with a real syscall. [EINVAL]:: TODO -Errors specific to `PWQR_REGISTER`: +Errors specific to `PWQR_CTL_REGISTER`: [ENOMEM]:: There was insufficient memory to perform the operation. -Errors specific to `PWQR_WAIT`: +Errors specific to `PWQR_CTL_WAIT`: [EWOULDBLOCK]:: When the kernel evaluated if `addr` still contained `val` it didn't. This works like `futex(2)`. -Errors specific to `PWQR_WAIT` and `PWQR_PARK`: +Errors specific to `PWQR_CTL_WAIT` and `PWQR_CTL_PARK`: [EINTR]:: The call was interrupted by a syscall (note that sometimes the kernel masks this fact when it has more important "errors" to report like `EDQUOT`). [EDQUOT]:: - The thread has been woken by a `PWQR_WAKE` or `PWQR_WAKE_OC` call, but - is overcommiting. + The thread has been woken by a `PWQR_CTL_WAKE` or `PWQR_CTL_WAKE_OC` + call, but is overcommiting.