From: Pierre Habouzit <pierre.habouzit@intersec.com>
Date: Sat, 14 Jan 2012 20:01:47 +0000 (+0100)
Subject: Implement the reluctancy to unpark threads.
X-Git-Url: http://git.madism.org/?p=~madcoder%2Fpwqr.git;a=commitdiff_plain;h=b29ccbcc4ab80bc7e7b2131349459cc9e96d49ef;ds=sidebyside

Implement the reluctancy to unpark threads.

This means that a pool needs to undercommit for 0.1s before we allow it to
grow its number of in-pool threads.

Document the last todo: how to reduce the pool when we're overcommiting.
Right now we only pray that userland will put some threads to WAIT. But
frankly it's less than ideal. With the repulsion to start a new thread we
hope though that the overcommit will never ever grow out of proportion for
now.

Signed-off-by: Pierre Habouzit <pierre.habouzit@intersec.com>
---

diff --git a/Documentation/pwqr.adoc b/Documentation/pwqr.adoc
index 57a85d4..067b2ae 100644
--- a/Documentation/pwqr.adoc
+++ b/Documentation/pwqr.adoc
@@ -31,6 +31,13 @@ quarantined::
 This state avoids waking a thread to force userland to "park" the thread, this
 is racy, make the scheduler work for nothing useful.  Though if `PWQR_WAKE` is
 called, quarantined threads are woken but with a `EDQUOT` errno set.
++
+This state actually has only one impact: when `PWQR_WAKE` is called for more
+than one threads, for example 4, and that userland knows that there is 5
+threads in WAIT state, but that actually 3 of them are in the quarantine, only
+2 will be woken up, and the `PWQR_WAKE` call will return 2. Any subsequent
+`PWQR_WAKE` call will wake up one quarantined thread to let it be parked, but
+returning 0 each time to hide that from userland.
 
 parked::
 	This is the state of threads currently in a `PWQR_PARK` call from
@@ -61,6 +68,10 @@ Unparking threads only when waiting becomes zero avoid flip-flops when the job
 flow is small, and that some of the running threads sometimes blocks (IOW
 running sometimes decreases, making `running + waiting` be below target
 concurrency for very small amount of time).
++
+NOTE: unparking only happens after a delay (0.1s in the current
+implementation) during which `waiting` must have been remained zero and the
+overal load to be under commiting resources for the whole period.
 
 The regulation between running and waiting threads is left to userspace that
 is a way better judge than kernel land that has absolutely no knowledge about
@@ -68,21 +79,83 @@ the current workload. Also, doing so means that when there are lots of jobs to
 process and that the pool has a size that doesn't require more regulation,
 kernel isn't called for mediation/regulation AT ALL.
 
-NOTE: right now threads are unparked as soon as `running + waiting`
-undercommit, and some delay should be applied to be sure it's not a really
-short blocking syscall that made us undercommit.
 
-NOTE: when we're overcommiting for a "long" time, userspace should be notified
-in some way it should try to reduce its amount of running threads. Note that
-the Apple implementation (before Lion at least) has the same issue. Though if
-you imagine someone that spawns a zillion jobs that call very slow `msync()s`
-or blocking `read()s` over the network, then that all those go back to running
+Todos
+~~~~~
+When we're overcommiting for a "long" time, userspace should be notified in
+some way it should try to reduce its amount of running threads. Note that the
+Apple implementation (before Lion at least) has the same issue. Though if you
+imagine someone that spawns a zillion jobs that call very slow `msync()s` or
+blocking `read()s` over the network, then that all those go back to running
 state, the overcommit is huge.
-A way to mitigate this atm is that when userspace belives the amount of
-threads is abnormally high it should periodically try to PARK the threads. If
-that blocks the thread, then it's that we were overcommiting. Note that it may
-be the best solution rather than a kernel-side implementation. To be thought
-over.
+
+There are several ways to "fix" this:
+
+in kernel (poll solution)::
+	Let the file descriptor be pollable, and let it be readable (returning
+	something like the amount of overcommit at read() time for example) so
+	that userland is notified that it should try to reduce the amount of
+	runnable threads.
++
+It sounds very easy, but it has one major drawback: it meaks the pwqfd must be
+somehow registered into the eventloop, and it's not very suitable for a
+pthread_workqueue implementation.
+
+in kernel (hack-ish solution)::
+	The kernel could voluntarily unpark/unblock a thread with another
+	errno that would signal overcommiting. Unlike the pollable proposal,
+	this doesn't require hooking in the event loop. Though it requires
+	having one such thread, which may not be the case when userland has
+	reached the peak number of threads it would ever want to use.
++
+Is this really a problem? I'm not sure. Especially since when that happens
+userland could pick a victim thread that would call `PWQR_PARK` after each
+processed job, which would allow some kind of poor man's poll.
++
+The drawback I see in that solution is that we wake up YET ANOTHER thread at a
+moment when we're already overcommiting, which sounds counter productive.
+That's why I didn't implement that.
+
+in userspace::
+	Userspace knows how many "running" threads there are, it's easy to
+	track the amount of registered threads, and parked/waiting threads are
+	already accounted for. When "waiting" is zero, if "registerd - parked"
+	is "High" userspace could choose to randomly try to park one thread.
++
+I think `PWQR_PARK` could use `val` to have some "probing" mode, that would
+return `0` if it wouldn't block and `-1/EWOULDBLOCK` if it would in the non
+probing mode. Userspace could maintain some global probing_mode flag, that
+would be a tristate: NONE, SLOW, AGGRESSVE.
++
+It's in NONE when userspace belives it's not necessary to probe (e.g. when the
+amount of running + waiting threads isn't that large, say less than 110% of
+the concurrency or any kind of similar rule).
++
+It's in SLOW mode else. In slow mode each thread does a probe every 32 or 64
+jobs to mitigate the cost of the syscall. If the probe returns EWOULDBLOCK
+then the thread goes to PARK mode, and the probing_mode goes to AGGRESSVE.
++
+When AGGRESSVE threads check if they must park more often and in a more
+controlled fashion (every 32 or 64 jobs isn't nice because jobs can be very
+long), for example based on some poor man's timer (clock_gettime(MONOTONIC)
+sounds fine). As soon as a probe returns 0 or we're in the NONE conditions,
+then the probing_mode goes back to NONE/SLOW.
++
+The issue I have with this is that it sounds to add quite some code in the
+fastpath code, hence I dislike it a lot.
+
+my dream::
+	To be able to define a new signal we could asynchronously send to the
+	process. The signal handler would just put some global flag to '1',
+	the threads in turn would check for this flag in their job consuming
+	loop, and the first thread that sees it to '1', xchg()s 0 for it, and
+	goes to PARK mode if it got the '1'. It's fast, inexpensive.
++
+Sadly AFAICT defining new signals() isn't such a good idea. Another
+possibility is to give an address for the flag at pwqr_create() time and let
+the kernel directly write into userland. The problem is, I feel like it's a
+very wrong interface somehow. I should ask some kernel hacker to know if that
+would be really frowned upon. If not, then that's the leanest solution of all.
 
 pwqr_create
 -----------
diff --git a/kernel/pwqr.c b/kernel/pwqr.c
index 4d4c4cd..e4e3f58 100644
--- a/kernel/pwqr.c
+++ b/kernel/pwqr.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
+#include <linux/timer.h>
 #include <linux/uaccess.h>
 #include <linux/wait.h>
 
@@ -38,6 +39,7 @@
 
 #include "pwqr.h"
 
+#define PWQR_UNPARK_DELAY	(HZ / 10)
 #define PWQR_HASH_BITS		5
 #define PWQR_HASH_SIZE		(1 << PWQR_HASH_BITS)
 
@@ -49,6 +51,7 @@ struct pwqr_task_bucket {
 struct pwqr_sb {
 	struct kref		kref;
 	struct rcu_head		rcu;
+	struct timer_list	timer;
 	wait_queue_head_t	wqh;
 	pid_t			tgid;
 
@@ -97,10 +100,9 @@ static inline void __pwqr_sb_update_state(struct pwqr_sb *sb, int running_delta)
 
 	sb->running += running_delta;
 	overcommit = sb->running + sb->waiting - sb->concurrency;
-	if (overcommit == 0)
-		return;
-
-	if (overcommit > 0) {
+	if (overcommit == 0) {
+		/* do nothing */
+	} else if (overcommit > 0) {
 		if (overcommit > sb->waiting) {
 			sb->quarantined += sb->waiting;
 			sb->waiting      = 0;
@@ -118,9 +120,29 @@ static inline void __pwqr_sb_update_state(struct pwqr_sb *sb, int running_delta)
 			sb->waiting     += sb->quarantined;
 			sb->quarantined  = 0;
 		} else if (sb->waiting == 0 && sb->parked) {
-			wake_up_locked(&sb->wqh);
+			if (!timer_pending(&sb->timer)) {
+				mod_timer(&sb->timer, jiffies +
+					  PWQR_UNPARK_DELAY);
+			}
+			return;
 		}
 	}
+
+	if (timer_pending(&sb->timer))
+		del_timer(&sb->timer);
+}
+
+static void pwqr_sb_timer_cb(unsigned long arg)
+{
+	struct pwqr_sb *sb = (struct pwqr_sb *)arg;
+	unsigned long flags;
+
+	pwqr_sb_lock_irqsave(sb, flags);
+	if (sb->waiting == 0 && sb->parked && sb->running < sb->concurrency) {
+		if (sb->overcommit_wakes == 0)
+			wake_up_locked(&sb->wqh);
+	}
+	pwqr_sb_unlock_irqrestore(sb, flags);
 }
 
 static struct pwqr_sb *pwqr_sb_create(void)
@@ -135,6 +157,9 @@ static struct pwqr_sb *pwqr_sb_create(void)
 	init_waitqueue_head(&sb->wqh);
 	sb->tgid        = current->tgid;
 	sb->concurrency = num_online_cpus();
+	init_timer(&sb->timer);
+	sb->timer.function = pwqr_sb_timer_cb;
+	sb->timer.data     = (unsigned long)sb;
 
 	__module_get(THIS_MODULE);
 	return sb;
@@ -156,6 +181,7 @@ static void pwqr_sb_release(struct kref *kref)
 {
 	struct pwqr_sb *sb = container_of(kref, struct pwqr_sb, kref);
 
+	del_timer_sync(&sb->timer);
 	call_rcu(&sb->rcu, pwqr_sb_finalize);
 }
 static inline void pwqr_sb_put(struct pwqr_sb *sb)