X-Git-Url: http://git.madism.org/?p=~madcoder%2Fpwqr.git;a=blobdiff_plain;f=Documentation%2Fpwqr.adoc;h=067b2aec896d2a75cee235562ea9776508b01306;hp=57a85d4a5c9c31eccdba0373a7d990aa89c5a177;hb=b29ccbcc4ab80bc7e7b2131349459cc9e96d49ef;hpb=0fb6dd1141fb6076b2753133c797f0325c55a0df diff --git a/Documentation/pwqr.adoc b/Documentation/pwqr.adoc index 57a85d4..067b2ae 100644 --- a/Documentation/pwqr.adoc +++ b/Documentation/pwqr.adoc @@ -31,6 +31,13 @@ quarantined:: This state avoids waking a thread to force userland to "park" the thread, this is racy, make the scheduler work for nothing useful. Though if `PWQR_WAKE` is called, quarantined threads are woken but with a `EDQUOT` errno set. ++ +This state actually has only one impact: when `PWQR_WAKE` is called for more +than one threads, for example 4, and that userland knows that there is 5 +threads in WAIT state, but that actually 3 of them are in the quarantine, only +2 will be woken up, and the `PWQR_WAKE` call will return 2. Any subsequent +`PWQR_WAKE` call will wake up one quarantined thread to let it be parked, but +returning 0 each time to hide that from userland. parked:: This is the state of threads currently in a `PWQR_PARK` call from @@ -61,6 +68,10 @@ Unparking threads only when waiting becomes zero avoid flip-flops when the job flow is small, and that some of the running threads sometimes blocks (IOW running sometimes decreases, making `running + waiting` be below target concurrency for very small amount of time). ++ +NOTE: unparking only happens after a delay (0.1s in the current +implementation) during which `waiting` must have been remained zero and the +overal load to be under commiting resources for the whole period. The regulation between running and waiting threads is left to userspace that is a way better judge than kernel land that has absolutely no knowledge about @@ -68,21 +79,83 @@ the current workload. Also, doing so means that when there are lots of jobs to process and that the pool has a size that doesn't require more regulation, kernel isn't called for mediation/regulation AT ALL. -NOTE: right now threads are unparked as soon as `running + waiting` -undercommit, and some delay should be applied to be sure it's not a really -short blocking syscall that made us undercommit. -NOTE: when we're overcommiting for a "long" time, userspace should be notified -in some way it should try to reduce its amount of running threads. Note that -the Apple implementation (before Lion at least) has the same issue. Though if -you imagine someone that spawns a zillion jobs that call very slow `msync()s` -or blocking `read()s` over the network, then that all those go back to running +Todos +~~~~~ +When we're overcommiting for a "long" time, userspace should be notified in +some way it should try to reduce its amount of running threads. Note that the +Apple implementation (before Lion at least) has the same issue. Though if you +imagine someone that spawns a zillion jobs that call very slow `msync()s` or +blocking `read()s` over the network, then that all those go back to running state, the overcommit is huge. -A way to mitigate this atm is that when userspace belives the amount of -threads is abnormally high it should periodically try to PARK the threads. If -that blocks the thread, then it's that we were overcommiting. Note that it may -be the best solution rather than a kernel-side implementation. To be thought -over. + +There are several ways to "fix" this: + +in kernel (poll solution):: + Let the file descriptor be pollable, and let it be readable (returning + something like the amount of overcommit at read() time for example) so + that userland is notified that it should try to reduce the amount of + runnable threads. ++ +It sounds very easy, but it has one major drawback: it meaks the pwqfd must be +somehow registered into the eventloop, and it's not very suitable for a +pthread_workqueue implementation. + +in kernel (hack-ish solution):: + The kernel could voluntarily unpark/unblock a thread with another + errno that would signal overcommiting. Unlike the pollable proposal, + this doesn't require hooking in the event loop. Though it requires + having one such thread, which may not be the case when userland has + reached the peak number of threads it would ever want to use. ++ +Is this really a problem? I'm not sure. Especially since when that happens +userland could pick a victim thread that would call `PWQR_PARK` after each +processed job, which would allow some kind of poor man's poll. ++ +The drawback I see in that solution is that we wake up YET ANOTHER thread at a +moment when we're already overcommiting, which sounds counter productive. +That's why I didn't implement that. + +in userspace:: + Userspace knows how many "running" threads there are, it's easy to + track the amount of registered threads, and parked/waiting threads are + already accounted for. When "waiting" is zero, if "registerd - parked" + is "High" userspace could choose to randomly try to park one thread. ++ +I think `PWQR_PARK` could use `val` to have some "probing" mode, that would +return `0` if it wouldn't block and `-1/EWOULDBLOCK` if it would in the non +probing mode. Userspace could maintain some global probing_mode flag, that +would be a tristate: NONE, SLOW, AGGRESSVE. ++ +It's in NONE when userspace belives it's not necessary to probe (e.g. when the +amount of running + waiting threads isn't that large, say less than 110% of +the concurrency or any kind of similar rule). ++ +It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64 +jobs to mitigate the cost of the syscall. If the probe returns EWOULDBLOCK +then the thread goes to PARK mode, and the probing_mode goes to AGGRESSVE. ++ +When AGGRESSVE threads check if they must park more often and in a more +controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very +long), for example based on some poor man's timer (clock_gettime(MONOTONIC) +sounds fine). As soon as a probe returns 0 or we're in the NONE conditions, +then the probing_mode goes back to NONE/SLOW. ++ +The issue I have with this is that it sounds to add quite some code in the +fastpath code, hence I dislike it a lot. + +my dream:: + To be able to define a new signal we could asynchronously send to the + process. The signal handler would just put some global flag to '1', + the threads in turn would check for this flag in their job consuming + loop, and the first thread that sees it to '1', xchg()s 0 for it, and + goes to PARK mode if it got the '1'. It's fast, inexpensive. ++ +Sadly AFAICT defining new signals() isn't such a good idea. Another +possibility is to give an address for the flag at pwqr_create() time and let +the kernel directly write into userland. The problem is, I feel like it's a +very wrong interface somehow. I should ask some kernel hacker to know if that +would be really frowned upon. If not, then that's the leanest solution of all. pwqr_create -----------