Thursday, April 26, 2012

The Database As Queue Anti-Pattern

“When all you have is a hammer, every problem looks like a nail.”

When all you know is SQL Server, it’s tempting to try and solve every problem with a relational database, but often it’s not the best tool for the job. Today I want to introduce you to an excellent example of this, the ‘Database As Queue’ anti-pattern. It’s a common specimen that you’ll find in the wild in many enterprise development shops.

It works like this: Usually there’s a piece of data that needs to be processed through some workflow, often by a number of different applications or processes.

Database_as_queue

There’s usually a table with some data and a status column. Each process is designed to process data at a particular status. Process 1 in this example inserts records with the status ‘New’, Process 2 picks up ‘New’ records and updates the status to ‘P2 Complete’, Process 3 picks up ‘P2 Complete’ records… you get the picture.

When I say ‘picks up’, what I mean is that each process sits there polling the database at some interval running a select statement something like this:

SELECT * FROM MyWorkflow WHERE Status = 'New'

Why is it an anti-pattern?

Firstly, there’s the polling. Either you have a short interval and hammer your database into submission, or you have a long interval and even the fastest route through your system is the sum of all the (long) intervals.

Secondly there’s the fact that SQL Server is very efficient at insertions, or updates, or queries, but rarely all three on the same table. To make the polling queries perform, you need to put an index on the status, but indexes make for slow inserts. Locks can become a major problem.

Thirdly there’s the issue of how you clear records once the workflow is complete. If you don’t clear the table down at intervals, it will keep growing until it starts to hit the performances of your queries and eventually become a serious operations problem. The problem is, deletes are inefficient on SQL Server.

Fourthly, sharing a database between applications (or services) is a bad thing. It’s just too tempting to put amorphous shared state in there and before you know it you’ll have a hugely coupled monster.

So what should I do?

Simple, use the right tool for the job: this scenario is crying out for a messaging system. It solves all the problems described above; no more polling, efficient message delivery, no need to clear completed messages from queues, and no shared state.

56 comments:

  1. I would agree completely if it were renamed "The Database as an Enterprise Queue Anti-Pattern". In hosted environments where you don't have control over the hardware (or access to install proper queuing mechanisms) database-based queues can be quite useful. I'd imagine ecommerce apps depend heavily on such queues. Good article and definitely food for thought!

    ReplyDelete
  2. This is something which is on my mind at the moment. What about the scenario where all parts of the process are acting on the same database record and the record needs to be persisted forever?

    Would it be sensible in that case to use a messaging system as a queue/notification tool, effectively orchestrating to entire operation, but without passing any data around (other that the database record primary key)?

    ReplyDelete
  3. For someone still learning... Could you expand on your last paragraph with a little more detail on messaging systems?

    ReplyDelete
  4. Chris, Shyh,

    So a messaging system would be a piece of infrastructure that allows components to communicate via messages. They usually feature some kind of store-and-forward queue, and the more sophisticated solutions support patterns such as publish/subscribe and request/response. I'm currently using a product called RabbitMQ (http://www.rabbitmq.com), I'd recommend reading the documentation on their web site if you're interested in getting into this further.

    ReplyDelete
  5. Azure as Queue Storage, which in the absence of any messaging structure is perfect for this kind of job. It is FIFO also meaning you can put in a number of messages that refer to an object and they will be retrieved in the same order.

    The Azure AppFabric can be installed locally - meaning it would be an option for those who cannot stretch to a full messaging structure.

    ReplyDelete
  6. As with many anti-patterns, this works fine at first. It just leads you to scaling issues at lower unit of work throughput rates. Many apps will never hit that wall.

    As for queuing systems getting rid of polling, don't be so sure. While the client API may execute your code as a callback, under the hood there may be polling. This is true for Spring AMQP and SQL Server Service Broker, for example.

    ReplyDelete
  7. Most RDBMS have the ability to push information with triggers. The updates can be eliminated, by having a datetime field or some other incremental number field, so the most recent record's status (MAX) is selected. This is sort of a 'functional' approach.

    ReplyDelete
  8. Anonymous1:53 pm

    talkingscott: SQL Server message broker does not normally use polling. See http://msdn.microsoft.com/en-us/library/ms187331.aspx

    ReplyDelete
  9. Anonymous2:02 pm

    And of course guess what MSMQ uses for it's persistence ...

    ReplyDelete
  10. Topher Cooper2:19 pm

    This is not the only example of a tendency to use DBs where they are ill suited. Molinaro's
    "SQL Cookbook" has a chapter on "Date Arithmetic" which uses tables set up as fixed calendars giving relations that allow one to do complex selects to answer queries like "what is the date of the first Tuesday in May in 1943". For each algorithm I was easily able to write a line or two of non table referencing T-SQL code that solved the problem 5 to 10 times faster.

    ReplyDelete
  11. Anonymous2:56 pm

    I tried moving from this anti-pattern to a messaging system (RabbitMQ), but couldn't because of this problem:

    How do you track job progress in a messaging system?

    Think of a long running job like a video transcoding.

    I could hack a solution where I would post the progress updates to another queue, and when someone takes a progress update from that queue it reposts it back - this is required since we must support getting a progress report at any time and at any frequency.

    If I would have used a database to track the progress, now I would have two systems to manage - the data base and the messaging system.

    ReplyDelete
  12. I can not agree completely with your premise - the reality is (as others have pointed out) is that any messaging queuing system worth using, is going to employ a database.

    What I think is missing from these types of implementations is low level support in the database to perform easy to use asynchronous notifications - yes you can use external procedures fired by triggers, etc., but those techniques are cumbersome and really that easy to use.

    ReplyDelete
  13. Anonymous3:27 pm

    If you're using a database engine other than Microsoft's SQL Server, this becomes far less of a performance problem.

    Take MySQL for instance - keyed SELECTs are fast (faster with HandlerSocket), fast enough that even with 1000ms or 500ms polling the performance impact will be negligible, and I can partition the table using the job state as the key for partitioning. Suddenly reads are still fast and writes are negligibly impacted. Records that have completed their final step can either (A) be deleted by the final process or (B) have their status updated and then get stored off in another partition; either way, not impacting performance.

    Messaging systems take far more time to implement (read: have a higher impact on time to market) than simply doing it in the database, and if you're using a database engine that doesn't suck you're not going to have a performance problem.

    ReplyDelete
  14. Anonymole3:27 pm

    Messaging systems need persistence too, for message integrity and atomic consistency. Then there's the problem of brokered asynchronous processing by multiple "process update" agents. Next is the reportability of state of the system, "at what state are all my jobs - right now?"

    No, what we really need is a better all around RDBMS/queue/pub-sub/OLAP/OLTP system for dynamic data storage, processing and retrieval.

    ReplyDelete
  15. Anonymous3:39 pm

    The biggest problem with database messaging as I see it is multiple workers. Unless I lock whole table when doing SELECT ... WHERE status = 'new' there is a big chance that few workers will pick up the same task. And locking whole table does not increase performance. It can be done with transactions, but then a worker should be ready to drop task and ask new if transaction failed, and such collisions will be multiple. So combination of messaging + database status usually works better.

    ReplyDelete
  16. Robust message queues have to have a persistent store behind them, so you're back to a database. The design shown, having 4 records, was not the design that I would have used. Other comments reinforce this.

    ReplyDelete
  17. Klaus5:13 pm

    Yup, seen it all.

    An advantage of typical RDBMses is data consistency via transactions (all the ACID stuff), which you get for "free". However, senseless and excessive abuse as mentioned in the artcile can still lead to "issues". One of my former customers had to yank up the Isolation Level to the max, crippling almost all concurrency. On a database, which is accessed by (literally) hundreds (or maybe thousands) of sensors and actors. Over LAN and WAN connections. Worldwide. One developer fucks up, and a complete manufacturing plant goes silent. Yay.

    ReplyDelete
  18. Which is why SQL Server ships with service broker ...

    http://msdn.microsoft.com/en-us/library/ms345108(v=sql.90).aspx

    ReplyDelete
  19. Anonymous6:06 pm

    The example used is particularly bad, as this is taking the status as a string, possible a unicode string, therefore double byte, making indexing tricky.

    make this an tiny int, and stick it in a lightweight table which has the purpose of tracking workflow.

    with the last part of the process deleting, surely that will also be efficient in its own way, until you have multiple processes running through in workflow (talking thousands) as a full table scan on a tiny table with an int (key) and tiny int (status) won't take (that) long in Dvořák time.

    Message queuing is great, but takes time and resource to establish. and also used wrong can have massive ramifications.

    ReplyDelete
    Replies
    1. Anonymous6:08 pm

      Dvořák is meant to be database.
      (predictive phone)

      Delete
  20. Anonymous6:54 pm

    Sometimes you just need a small lightweight queue to offline process data that takes too long to process with your other work.

    Using a database table can certainly be a bad idea if timeliness is important, and if the volume is high. But if those two things are not true, using a dedicated message broker like Rabbit MQ is over-engineering it. Not only are you introducing an additional point of failure, you also lose the transactional integrity guarantee that your message is not consumed unless your process succeeds.

    The answer to the shared state problem is: don't. Message brokers can be used to share state too, that's not something they magically prevent any more than any other queueing mechanism.

    If timelinness or volume are a concern, or you already already have RabbitMQ (or other broker) running, and, you can give up transactional integrity, go for it. Otherwise you're creating fragility for little to no benefit.

    ReplyDelete
  21. Anonymous7:22 pm

    PostgreSQL has LISTEN/NOTIFY, it's well suited for this pattern.

    ReplyDelete
  22. Most of my apps run on Google's App Engine, which has Task Queues. They have worked well for me, as the web interface lets me see what's in the queue, pause it, purge it, delete individual tasks, etc. It's also quite nice that tasks run your code by hitting a URL. So I can kick off a task manually (when testing for example) by simply pointing my browser to the same URL.

    ReplyDelete
  23. I agree with the comments of others that SQL Server can be a "real" messaging system with the use of Service Broker. You can use the Query Notifications feature to get rid of polling.

    ReplyDelete
  24. With proper design, the DB as queue pattern works quite well. The issues you raised are all solvable with some thought.

    ReplyDelete
  25. Anonymous7:37 am

    You put some emphasis on =machine efficiency, which might be valuable if it allows use of a smaller or less expensive computer. Another way to get by with a lesser machine is to defer some processing until overnight. Stuff like mailing out bills and notices, posting transactions to the accounting system, processing wire transfers, etc, can expeditiously be done in batches overnight. Using off-peak resources, who cares if the database is a little slower in this mode? Everyone does this.

    ReplyDelete
  26. we've recently migrated our background workers from message queue to database and found database solution more flexible, trackable and migration a great success overall.

    we used clustered activemq on java with spring-jms. well, both parts of this system were broken. at least once in two month our activemq corrupted disk database and failed SILENTLY, simply ignoring incoming messages. spring-jms did not provide ability to control message acnowledgement(all messages were auto-acnowledged).

    historically, we had only one queue for each service, but messages of different type could be sent to those queues. we were unable to produce solution for fetching only specific messages type from queues. somehow we needed to peek into messages before accepting them. but this seems to be out of message queues functionality :(

    ReplyDelete
  27. I agree with what Seun Osewa said, all the "problems" with databases queues are easy to solve. databases have fine-grained locking for single row, that guarantee multithread safety. inefficiency of select/insert/update seems to be only feat of SQL Server - our solution on mysql worked quite smooth. and concerning "hammering a server" - polling is not a problem for server - they were designed to handle heavy load, but of course can be a problem for a client. problem with long delays can be overcame by careful delay policy. we, for instance, do delay only when previous query returned no tasks to process - when we just finished a task, we do a query immediately. this delay policy introduce some latency(1 delay period), but have maxium possible throughput.

    and in general, i try to use solutions i do understand. i do understand how complex database installations works, but i do not understand how do complex(e.g routing, clustering) message queue installations work. and standarts like ampq just scare me, this is uncomprehensible for a single person.

    ReplyDelete
  28. Anonymous5:18 am

    It depends. Id rather see a system built this way initially, as simple system are easy to reason about and extend. I recently worked on some horrible system with MSMQ and NserviceBus and all config that goes with it, and windows services etc, we revert the database queue system because the maint was less, any developer could understand it, it was easier to deploy, and it wasnt using some technology just because someone classed it as an anti-pattern.

    ReplyDelete
  29. Hi Mike,

    Agree generally. Sometimes it is useful having message queues in the DB- in which SQL Server Service broker or Oracle AQ fits the bill.

    Obviously using a table is an anti pattern (especially when other options exist).

    ReplyDelete
  30. Anonymous2:18 pm

    We all do not have the luxury of having access to Messaging, Memcached and other integration solutions so we have to stick with a database.

    We have used the "Poor Man's Job Queue" (http://ssmusoke.wordpress.com/2012/04/10/the-poor-mans-job-queue/) and it has held up to some peak uploads

    ReplyDelete
  31. Rant indeed. No, this is a very useful pattern. What you have posted is as though you just discovered proper message queues. The other commenters have pretty well documented some of the advantages of this GOOD pattern.

    ReplyDelete
  32. hmm... and a reliable queue has to use something transactional and durable in which to keep the state of the queue ! The point might be to delegate the use of the database to the queue-ing system instead of trying to do it yourself (b/c it's a lot more involved than it seems at first)

    ReplyDelete
  33. Moreover, SQL Server has special support for "queuing tables": WITH(READPAST, UPDLOCK) helps you run multiple "consumers" on a queue.

    On the other hand, as we found out, SQL Server's default statistics update is very bad for this sort of behavior: If a table goes from zero to more than zero records or vice versa, this will *always* trigger a statistics recomputation, and all your plans are gone ... So, disabling auto statistics on such tables is a must.

    ReplyDelete
  34. Using a database for *integration* isn't always a good idea but using it as a queue may be fine.

    It has to do with *how* you use it. What you are describing is most definitely not the correct way to implement the 'queue' concept.

    However, using a single table to store messages and having a single process access that queue (even with threading) in a FIFO way where messages being processed are removed from the queue would work just fine. And it would be using the table as q real queue.

    A queue should *never* be queried. Another reason why your example isn't really a queue.

    The workflow type scenario you describe is fine if the processing of 'messages' does not happen based on the quasi-queue table. The table you have represents the state, which is fine. However, proper queueing techniques should be used for the actual messages even if the transport is a sql table.

    ReplyDelete
  35. Thomas Kejser, formerly of the SQLCAT team, blogged about this and how to eliminate the problems for creating queue tables in SQL Server:

    http://blog.kejser.org/2012/05/25/implementing-message-queues-in-relational-databases/

    It's a good read for anyone that is interested in how to scale this.

    ReplyDelete
  36. Anonymous8:54 pm

    How do you think persistent queues are implemented ? I am marking this article as a FAIL!

    ReplyDelete
  37. I think the people talking about how you could use an RDBMS for queuing, or that some messaging systems use an RDBMS for persistence have missed the point somewhat. In those cases your queuing infrastructure is encapsulated and extracted away from your business logic. In the example given your business logic is directly coupled to your queuing mechanism.

    These two things are separate concerns and should be treated as such. It's easy to arrive at this design as it makes so much sense when you first start building your system. Things (let's say orders) arrive in the system and have a status of 0. You pick up all the orders at 0 and do something with them and update them to status 10. Repeat until they're at status 100 aka 'complete'.

    That could even work long-term if you don't need to scale. If however the number of orders (or whatever it is you're processing) increases you will hit a wall. It's very likely that this kind of table will require multiple indices on it, it's also certain that it will be being constantly inserted into, updated and read from. That is not a good combination for database performance. The article mentions 'clearing down' records, but often in these scenarios it's extremely unclear when this should happen as there are often business requirements about needing access to these records for a long time ('our customers need to be able to request a refund on their orders after they are complete').

    I think though that the killer blow in the article is this line 'sharing a database between applications (or services) is a bad thing.' Yes, yes it is. This will slow your development and change becomes harder and harder as your applications ossify from the shared schema up.

    ReplyDelete
  38. "...the fastest route through your system is the sum of all the (long) intervals."

    Shouldn't that read the *slowest* possible route through your system. That's analagous to hitting every red light on the way to work right as they turn red.
    The fastest possible route would be no wait at all, like hitting every single green light on the way to work.

    I would say, the much more useful statistic is that the average time through the system would be the sum of half of each polling interval.

    Regarding performance of a DB-driven queue, why is there an assumption that every exclusive lock will be a table lock? SQL Server's default behavior is row locks until the lock escelation threshold is met (default 1024). That means dozens of updates could be performed on a table, and dozens of other processes could be reading from it at the same time. WITH(READPAST) will ignore rows with locks on them.

    Realisitically speaking, I rarely have more than one consumer process looking at the table anyway so locks are a non-issue. All the examples I've seen seem to imply that most database queues will have many concurrent workers. Is that usually the case for others?

    ReplyDelete
  39. Forgot to subscribe to replies...

    ReplyDelete
  40. So, the next step is to draw up a Data Flow Diagram to understand how the system should be refactored :)

    ReplyDelete
  41. Anonymous7:12 am

    There are at least two poor arguments against this article:

    1. A message queue needs a persistent store, therefore using a message queue means I am using an RDBMS.

    Of the 500 ways to poke a hole in that argument, I will choose the easiest. Some message queues do *allow* you to use an RDBMS as the backing store (for instance, ActiveMQ on MySQL), it is always a less efficient choice than highly tuned persistence mechanisms you could be using (like KahaDB). RabbitMQ doesn't allow you to even make a choice of your store. Making the assumption that data storage for message queues is similar to data storage in a typical database is why database-as-queue is an anti-pattern.

    2. If you are saying "well, I can design around this limitation or plan for this scale," you are missing the point. Yes, you *can* be successful using a database as a queue. You *can* put months of effort into tuning and plenty of money scaling up hardware. You *can* design an elegant retry mechanism, the ability to wiretap messages, apply messaging patterns, content-based-routing, or whatever you can imagine. Does it mean you should? By the time you are done(and have spent precious time fixing the bugs), you end up with what a message queue would have given you off the bat.

    Put simply, an RDBMS is the right tool for some problems, but an insufficient solution for others. A message queue is a solution designed to fix a problem similar to the one described. To attempt to solve it with an RDBMS is not necessarily doomed to failure, but you are not using the tool best suited to the problem.

    ReplyDelete
  42. Anonymous6:06 pm

    oversimplification aside, did you just equate workflow to queue management?

    Queues can be a way to implement portions of workflows, but semantically speaking, a workflow is more than just an assembly of queues.

    ReplyDelete
  43. RE: “When all you have is a hammer, every problem looks like a nail.”

    A good RDBMS is a toolbox. An ignoramus sees a hammer.

    ReplyDelete
  44. I recently posted an answer to a question on stackoverflow in defense of messaging vs db integration if you’re interested:

    http://stackoverflow.com/a/19229234/569662

    It’s the first time I’ve been able to clearly articulate it.

    ReplyDelete
  45. I don't agree in general. Oracle AQ is a table-based queue implementation that uses some advanced internal features that have been made available through general SQL in 10g (I believe) via the FOR UPDATE SKIP LOCKED clause.

    While I wouldn't pass millions of message types through AQ, AQ is certainly capable of providing an easy transactional queue implementation for basic needs.

    ReplyDelete
  46. While I agree that it is certainly overused, for ETL code, it is generally preferable that the code runs entirely self-contained. Recently, I wrote some speed improvements that replaced a CURSOR loop with a table and MSSQL Jobs enabling the task to run in parallel for the Staging part. The request database, which we have no control over, force us to iterate over the inputs. By using a table as a task, it ensures that the ETL is self-contained and mobile - I can be up and running in a minutes and be almost guaranteed it will work.

    For client applications, I prefer queues and please don't mention MSMQ, that thing is the anti-Christ.

    ReplyDelete
  47. Anonymous4:10 am

    Your mom is an anti-pattern

    ReplyDelete
  48. Anonymous12:38 pm

    If your message needs to belong to a particular unit of work you must store it with that unit of work and forward it unless your messaging service supports distributed transactions.. Even then distributed transactions can be kinda gross under failure scenarios. Even in this case you may still want to forward into a messaging service for aggregation, routing, and integration. But yeah, as others have pointed out, sometimes you problem IS a nail and you need the RDBMS hammer.

    ReplyDelete
  49. Anonymous9:51 pm

    Made a queue in postgres the other day, spawned 100 workers, pdo errored on to many connections and everything locked .. googled .. found this.. Spot on..

    ReplyDelete
  50. Anonymous9:40 pm

    Almost only quotes
    "When all you know" deeply doesn't include "SQL Server" you can be sure that
    "SQL Server is very efficient at insertions, or updates, or queries, but rarely all three on the same table."
    and
    "sharing a database between applications (or services) is a bad thing"

    ReplyDelete
  51. Anonymous:
    'Messaging systems take far more time to implement (read: have a higher impact on time to market) than simply doing it in the database, and if you're using a database engine that doesn't suck you're not going to have a performance problem.'

    You refer to 'SQL Server' as the DB that sucks, yet you sing the praises of MySQL - I fell out of my seat laughing at that one.

    Both your comment and the post seem based on outdated knowledge of SQL Server, and as others have pointed out most people with a modicum of knowledge would know to use Message Broker (or other message queue) for this sort of thing, and as has been pointed out, it doesn't rely on the definitely not good method of polling like some annoying kid wondering 'are we there yet'.

    ReplyDelete
  52. That is exactly my thoughts on the subject. Unfortunately I've inherited a terrible J2EE app that does exactly this. What a nightmare to fix and maintain!

    ReplyDelete
  53. The article is bang on - it only looks easy to create queuing systems in a database.

    Message brokers are not storing messages in a single shared table and performing searches through them - messages are going to specific queues. In other words, a broker does not have to search for messages to process.

    A message broker can be used as a front-end to a database. This provides the ability to disconnect your applications performance from user load - if all of your database operations are driven out of message queues, your application will perform at a consistent level regardless of workload. Even better is if all of your workload, not just database operations, is driven out of message queues.

    Monitoring progress using a message queue is super easy - have the processing jobs post messages to a monitoring queue which is consumed by your monitoring application.

    Finally, using a message broker allows you to perform all processing asynchronously, yet allows callers to simulate synchronous calls if they would like to. If a client wants to execute a job synchronously, the producer for that job creates a callback queue and a consumer to monitor it. It then passes the callback queue name into the request. The consumer of that job sends completion messages to that callback queue on completion. When done properly, the resulting data can be fed back into the callers output stream - in other words, the caller appears to have made a synchronous call, but it actually was processed asynchronously.

    The hardest part about adopting a message broker is convincing people that it is easier than creating your own. And it is actually is easier and far more useful in the long run. A message broker will quickly become a central aspect of almost all of your most complex processing problems.

    Try it, you will like it!

    ReplyDelete
  54. Anonymous7:34 am

    You really want to make sure that both the data you are storing and sending the message (or storing a message to be sent) are in the same TRANSACTION, otherwise you might end up storing data and the process never sends the message (or vice versa).

    Here is a great presentation on the topic:

    https://vimeo.com/111998645

    ReplyDelete

Note: only a member of this blog may post a comment.