This state avoids waking a thread to force userland to "park" the thread, this
is racy, make the scheduler work for nothing useful. Though if `PWQR_WAKE` is
called, quarantined threads are woken but with a `EDQUOT` errno set.
++
+This state actually has only one impact: when `PWQR_WAKE` is called for more
+than one threads, for example 4, and that userland knows that there is 5
+threads in WAIT state, but that actually 3 of them are in the quarantine, only
+2 will be woken up, and the `PWQR_WAKE` call will return 2. Any subsequent
+`PWQR_WAKE` call will wake up one quarantined thread to let it be parked, but
+returning 0 each time to hide that from userland.
parked::
This is the state of threads currently in a `PWQR_PARK` call from
flow is small, and that some of the running threads sometimes blocks (IOW
running sometimes decreases, making `running + waiting` be below target
concurrency for very small amount of time).
++
+NOTE: unparking only happens after a delay (0.1s in the current
+implementation) during which `waiting` must have been remained zero and the
+overal load to be under commiting resources for the whole period.
The regulation between running and waiting threads is left to userspace that
is a way better judge than kernel land that has absolutely no knowledge about
process and that the pool has a size that doesn't require more regulation,
kernel isn't called for mediation/regulation AT ALL.
-NOTE: right now threads are unparked as soon as `running + waiting`
-undercommit, and some delay should be applied to be sure it's not a really
-short blocking syscall that made us undercommit.
-NOTE: when we're overcommiting for a "long" time, userspace should be notified
-in some way it should try to reduce its amount of running threads. Note that
-the Apple implementation (before Lion at least) has the same issue. Though if
-you imagine someone that spawns a zillion jobs that call very slow `msync()s`
-or blocking `read()s` over the network, then that all those go back to running
+Todos
+~~~~~
+When we're overcommiting for a "long" time, userspace should be notified in
+some way it should try to reduce its amount of running threads. Note that the
+Apple implementation (before Lion at least) has the same issue. Though if you
+imagine someone that spawns a zillion jobs that call very slow `msync()s` or
+blocking `read()s` over the network, then that all those go back to running
state, the overcommit is huge.
-A way to mitigate this atm is that when userspace belives the amount of
-threads is abnormally high it should periodically try to PARK the threads. If
-that blocks the thread, then it's that we were overcommiting. Note that it may
-be the best solution rather than a kernel-side implementation. To be thought
-over.
+
+There are several ways to "fix" this:
+
+in kernel (poll solution)::
+ Let the file descriptor be pollable, and let it be readable (returning
+ something like the amount of overcommit at read() time for example) so
+ that userland is notified that it should try to reduce the amount of
+ runnable threads.
++
+It sounds very easy, but it has one major drawback: it meaks the pwqfd must be
+somehow registered into the eventloop, and it's not very suitable for a
+pthread_workqueue implementation.
+
+in kernel (hack-ish solution)::
+ The kernel could voluntarily unpark/unblock a thread with another
+ errno that would signal overcommiting. Unlike the pollable proposal,
+ this doesn't require hooking in the event loop. Though it requires
+ having one such thread, which may not be the case when userland has
+ reached the peak number of threads it would ever want to use.
++
+Is this really a problem? I'm not sure. Especially since when that happens
+userland could pick a victim thread that would call `PWQR_PARK` after each
+processed job, which would allow some kind of poor man's poll.
++
+The drawback I see in that solution is that we wake up YET ANOTHER thread at a
+moment when we're already overcommiting, which sounds counter productive.
+That's why I didn't implement that.
+
+in userspace::
+ Userspace knows how many "running" threads there are, it's easy to
+ track the amount of registered threads, and parked/waiting threads are
+ already accounted for. When "waiting" is zero, if "registerd - parked"
+ is "High" userspace could choose to randomly try to park one thread.
++
+I think `PWQR_PARK` could use `val` to have some "probing" mode, that would
+return `0` if it wouldn't block and `-1/EWOULDBLOCK` if it would in the non
+probing mode. Userspace could maintain some global probing_mode flag, that
+would be a tristate: NONE, SLOW, AGGRESSVE.
++
+It's in NONE when userspace belives it's not necessary to probe (e.g. when the
+amount of running + waiting threads isn't that large, say less than 110% of
+the concurrency or any kind of similar rule).
++
+It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64
+jobs to mitigate the cost of the syscall. If the probe returns EWOULDBLOCK
+then the thread goes to PARK mode, and the probing_mode goes to AGGRESSVE.
++
+When AGGRESSVE threads check if they must park more often and in a more
+controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very
+long), for example based on some poor man's timer (clock_gettime(MONOTONIC)
+sounds fine). As soon as a probe returns 0 or we're in the NONE conditions,
+then the probing_mode goes back to NONE/SLOW.
++
+The issue I have with this is that it sounds to add quite some code in the
+fastpath code, hence I dislike it a lot.
+
+my dream::
+ To be able to define a new signal we could asynchronously send to the
+ process. The signal handler would just put some global flag to '1',
+ the threads in turn would check for this flag in their job consuming
+ loop, and the first thread that sees it to '1', xchg()s 0 for it, and
+ goes to PARK mode if it got the '1'. It's fast, inexpensive.
++
+Sadly AFAICT defining new signals() isn't such a good idea. Another
+possibility is to give an address for the flag at pwqr_create() time and let
+the kernel directly write into userland. The problem is, I feel like it's a
+very wrong interface somehow. I should ask some kernel hacker to know if that
+would be really frowned upon. If not, then that's the leanest solution of all.
pwqr_create
-----------
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/timer.h>
#include <linux/uaccess.h>
#include <linux/wait.h>
#include "pwqr.h"
+#define PWQR_UNPARK_DELAY (HZ / 10)
#define PWQR_HASH_BITS 5
#define PWQR_HASH_SIZE (1 << PWQR_HASH_BITS)
struct pwqr_sb {
struct kref kref;
struct rcu_head rcu;
+ struct timer_list timer;
wait_queue_head_t wqh;
pid_t tgid;
sb->running += running_delta;
overcommit = sb->running + sb->waiting - sb->concurrency;
- if (overcommit == 0)
- return;
-
- if (overcommit > 0) {
+ if (overcommit == 0) {
+ /* do nothing */
+ } else if (overcommit > 0) {
if (overcommit > sb->waiting) {
sb->quarantined += sb->waiting;
sb->waiting = 0;
sb->waiting += sb->quarantined;
sb->quarantined = 0;
} else if (sb->waiting == 0 && sb->parked) {
- wake_up_locked(&sb->wqh);
+ if (!timer_pending(&sb->timer)) {
+ mod_timer(&sb->timer, jiffies +
+ PWQR_UNPARK_DELAY);
+ }
+ return;
}
}
+
+ if (timer_pending(&sb->timer))
+ del_timer(&sb->timer);
+}
+
+static void pwqr_sb_timer_cb(unsigned long arg)
+{
+ struct pwqr_sb *sb = (struct pwqr_sb *)arg;
+ unsigned long flags;
+
+ pwqr_sb_lock_irqsave(sb, flags);
+ if (sb->waiting == 0 && sb->parked && sb->running < sb->concurrency) {
+ if (sb->overcommit_wakes == 0)
+ wake_up_locked(&sb->wqh);
+ }
+ pwqr_sb_unlock_irqrestore(sb, flags);
}
static struct pwqr_sb *pwqr_sb_create(void)
init_waitqueue_head(&sb->wqh);
sb->tgid = current->tgid;
sb->concurrency = num_online_cpus();
+ init_timer(&sb->timer);
+ sb->timer.function = pwqr_sb_timer_cb;
+ sb->timer.data = (unsigned long)sb;
__module_get(THIS_MODULE);
return sb;
{
struct pwqr_sb *sb = container_of(kref, struct pwqr_sb, kref);
+ del_timer_sync(&sb->timer);
call_rcu(&sb->rcu, pwqr_sb_finalize);
}
static inline void pwqr_sb_put(struct pwqr_sb *sb)