Tuesday, July 05, 2011

The First Rule of Threading: You Don’t Need Threads!

I’ve recently been introduced to a code base that illustrates a very common threading anti-pattern. Say you’ve got a batch of data that you need to process, but processing each item takes a significant amount of time. Doing each item sequentially means that the entire batch takes an unacceptably long time. A naive approach to solving this problem is to create a new thread to process each item. Something like this:

foreach (var item in batch)
var itemToProcess = item;
var thread = new Thread(_ => ProcessItem(itemToProcess));

The problem with this is that each thread takes significant resources to setup and maintain. If there are hundreds of items in the batch we could find ourselves short of memory.

It’s worth considering why ProcessItem takes so long. Most business applications don’t do processor intensive work. If you’re not protein folding, the reason your process is talking a long time is usually because it’s waiting on IO – communicating with the database or web services somewhere, or reading and writing files. Remember, IO operations aren’t somewhat slower than processor bound ones, they are many many orders of magnitude slower. As Gustavo Duarte says in his excellent post What Your Computer Does While You Wait:

Reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), and main system memory is taking a 4-minute walk down the hall to buy a Twix bar. Keeping with the office analogy, waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months.

You don’t need to keep a thread around while you’re waiting for an IO operation to complete. Windows will look after the IO operation for you, so long as you use the correct API. If you are writing these kinds of batch operations, you should always favour asynchronous IO over spawning threads. Most (but not all unfortunately) IO operations in the Base Class Library (BCL) have asynchronous versions based on the Asynchronous Programming Model (APM). So, for example:

string MyIoOperation(string arg)

Would have an equivalent pair of APM methods:

IAsyncResult BeginMyIoOperation(string arg, AsyncCallback callback, object state);
string EndMyIoOperation(IAsyncResult);

You typically ignore the return value from BeginXXX and call the EndXXX inside a delegate you provide for the AsyncCallback:

BeginMyIoOperation("Hello World", asyncResult => 
var result = EndMyIoOperation(asyncResult);
}, null);

Your main thread doesn’t block when you call BeginMyIoOperation, so you can run hundreds of them in short order. Eventually your IO operations will complete and the callback you defined will be run on a worker thread in the CLR’s thread pool. Profiling your application will show that only a handful of threads are used while your hundreds of IO operations happily run in parallel. Much nicer!

Of course all this will become much easier with the async features of C# 5, but that’s no excuse not to do the right thing today with the APM.


Ariel Ben Horesh said...

It is worth adding that I would also give Rx (Reactive Extensions) a check when going this way.
Rx really helps when you use the Begin/End pattern.


Thanks for this post!

Octoberclub said...

Jeffrey Richter also provides a nice explanation of this on channel 9:


Brian Schlining said...
This comment has been removed by the author.
Brian Schlining said...

Here's a similar technique using Scala's new parallel collections:
batch.par.foreach( item => ProcessItem(item))

Craig Ringer said...

For anyone this who's using Java for a project: async I/O on Java is primarily provided by the nio and nio.2 (jdk7 only) libraries in the java.io and java.nio packages. See http://download.oracle.com/javase/6/docs/technotes/guides/io/index.html, http://java.sun.com/developer/technicalArticles/javase/nio/.

Like C# programmers, Java programmers tend to be excessively thread-happy, so I'm glad to see you highlighting the (ab)use of threads for I/O concurrency in cases where async I/O is more than good enough.