David's Blog

What I learned by moving blocking/CPU-bound work off of the Tokio runtime

Here's a dump of some notes about what I learned today while trying to move some operations in a small Tokio/Axum web service (this blog) off of the Tokio runtime threads. I have two commits for moving database + markdown rendering operations off of the tokio threads.

While (in theory) blocking the tokio runtime threads is not a good idea, apparently moving the database and markdown rendering calls into the blocking threadpool (via tokio::task::spawn_blocking) doesn't work very well.

In my light load testing, I found that the previous approach handled a higher request throughput than the newer one that put these queries and cpu expensive tasks on other threads. To quickly summarize the load generation I did, I used wrk2 to send traffic at my server. Both wrk2 and the web server were running on the same 6-core (12 hyper thread) machine. Without these two thread offloading changes, my server handled 1000 rps with a mean latency of 3ms, p90 of 4ms, and max of 6ms. At 5000 rps it had a mean latency of 8ms, a p90 of 22ms, and max of 58ms. However, once my two commits were in to move my db calls + markdown-to-html logic off of the Tokio runtime, the server was starting to fall apart at 1000 rps, sometimes not even completing that workload.

There's some fair warning that simply shifting CPU-bound work to the blocking threadpool might not work very well. This is because the blocking pool is for IO based work. That makes sense to me, you don't want an unbounded queue of CPU bound work. I've debugged systems that have this problem before. The "CPU bound" work in this case is a call to a markdown library to convert markdown of a post into HTML. This takes between <1ms up to low single-digit milliseconds in release builds. I wouldn't expect such a small bit of CPU work to be able to build up faster than the runtime's blocking threads could keep up with.

The surprising bit, though, is more that delegating my DB calls to the blocking pool also exhibits the same behavior of the server going unresponsive with relatively low traffic. It's complicated by the fact that I use a sqlite database so this isn't going over a network or anything (and in fact the way sqlite works is that the queries are executed in-process). This means the db queries are a mix of IO (file operations) and CPU (executing queries). This is also on an especially small table. There query operations, when I instrumented them, were unsurprisingly in the sub-millisecond range (I was seeing ~80-200us, IIRC). I suppose that since the table is so small, the majority of the work here is again CPU since the db is probably entirely in page cache.

Problem solving


The next step to try would be to take the advice of the Tokio docs and limit the threads doing this CPU work. Their suggestion is to use a separate CPU threadpool. One alternative would be to assume that leaving tasks that are <1 ms up to a few ms on the runtime threads is fine, and to instead use tokio::task::spawn at places I want concurrency.  Seems a little silly because I'd have to wrap the whole thing in an async block to make a future. It'd potentially help for the two queries I can do concurrently. On the whole, I'm not actually sure if the overhead of shifting these sqlite queries over threads is more than just running the queries synchronously in line but that was why I was experimenting a bit. Another approach might be to keep using spawn_blocking but introduce some concurrency control + load shedding on the calls before scheduling the blocking tasks. I ended up trying out the rayon approach.

The Tokio docs recommend using rayon for a CPU-bound thread pool + a tokio::sync::oneshot to communicate the result back into a Tokio task. Before gluing together these bits, I figured I'd see if someone had already done something similar and found tokio-rayon. It provides the convenience function I needed to submit tasks to the rayon global executor (by default, sized at the number of logical cores in your system). It actually has both LIFO and FIFO versions of spawn. The downside was, this wasn't sufficient for solving my server going non-responsive under load when offloading these particular calls to other threads.

The next rabbit hole


I eventually picked at something in logs something I'd noticed before but discarded too early as I was fixated on limiting the number of active threads doing work. The log message was just  Error(None) coming from within my application. For whatever reason the logs weren't providing a module or file here (or I missed it). I had already been experimenting with moving some stuff over from the log crate to the tracing crate and log messages can be converted to tracing Events just by changing the library I used, no other code changes required. This pointed me at the module and function at least and this is when I started suspecting my database connection pool. A few edits to the log message I suspected confirmed that suspicion.

It turns out that at least part of my problem wasn't related to the threadpool usage at all! Instead, my http handler was encountering errors acquiring db connections from the connection pool (a pool of sqlite "connections" using diesel::r2d2). The default connection pool size was 10 and the default timeout for acquiring a connection was 30s. This meant if the connection pool was overwhelmed by connection requests, a bunch of those would stack up and all fail after 30s. This partly explains why my changes made the server go unresponsive! I was requesting potentially more connections than before my code changes.

Two reasons for this:
  1. Before, I was doing synchronous db calls and getting one connection per http request. I'd then use that connection sequentially for each of the 3 db calls (post content, related photos, tags -- and yes that could be a join but). When running these queries on the tokio runtime threads, the concurrent db calls were limited to the number of tokio threads (number of cores). I suspect this kept me from hitting the db connection pool limit just by luck. After my change, we'd get a connection for each task that was spawned but since we were now spawning a task and awaiting the db query result, this gives the runtime a chance to start handling a new request on that same runtime thread and therefore utilize more db connections.
  2. The other change I made was to do the tag and photo queries concurrently, which would utilize 2 connections at once for 1 http request. This probably helped pile up the connection use but is a constant factor rather than request-driven factor.
The whole problem wasn't just the thread pool usage queuing up like I thought it was. It was that plus resource exhaustion on my db threadpool!

I made a small change to increase the size of that db connection pool up to 36 (12 cores * 3, pulled this out of thin air and seemed likely to work better for my machine than the default 10) and lowered the acquisition timeout to 5 seconds. After this, my server was at least able to complete both the 1000 and 5000 rps workloads I was sending to it but latency was much worse than the starting point above. I could go further on this but I'm gonna wrap this one up here.

Conclusion

At the end of the day, none of this matters for my use case because my blog doesn't get thousands or tens of thousands of requests per second like a production web service would. I just like to use this site as a test bed for these sorts of things.

I think I have decent intuition for debugging these sorts of server threadpool <-> resource threadpool interactions in production systems but it's always good to explore it in new platforms. The concepts are the same but learning more tools (I experimented with Rust tracing and tokio console as part of this) and debugging even "for fun" systems is always worthwhile for firming up intuition.

I haven't yet gone back to see if instead of introducing rayon if I can now use the regular old Tokio blocking threadpool now that I've tuned up my db threadpool but like I mentioned before, I think the main takeaway is that tasks that take less than a few milliseconds might just not be worth offloading, depending on your performance requirements.