Integration Testing Resque with Cucumber

August 16, 2010

Square takes integration testing seriously. We use Cucumber and RSpec to test our code during every step of development: on developer pairing machines, continuous integration servers, staging servers and production servers.

The Problem

Traditional Cucumber tests exercise the web stack from the web server through the database, but they don't typically cover asynchronous tasks like background processing and scheduled jobs. Integration tests involving these asynchronous jobs are hard to write due to race conditions between the test process and the background workers. We faced this problem when we began using Resque for processing background jobs. We love Cucumber and we love Resque; we wanted to find a way to use them together.

Sample test with a race condition

When someone accepts a payment using Square, we queue up a notification email in Resque. A pool of Resque workers constantly monitor the queue. One of them immediately grabs the email job, renders the email, connects to the mail server, and sends it to the user.

The following test can cause a race condition. The test will pass if the Resque worker had processed the email job by the time the Cucumber test looks for a new email… otherwise the test will fail.

Scenario: Capturing an authorization successfully
          results in an email notification
  Given my name is Jools
  And I have a valid API session
  And I use a new capturable card authorization
  When I POST to API 1.0 "payments capture"
  And I open my newest email
  Then I should see a link to "the payment page for the last payment"

An immediate solution would be to run all Resque jobs synchronously and skip the enqueueing/dequeueing parts of the stack. We do this for RSpec unit tests, but we want integration tests to directly test the full stack.

The diagram below shows the processes (boxes) involved in running a Cucumber test and the communication channels between the processes (black lines). To solve the race condition problem, the goal was to add another channel between the Cucumber process and the resque worker (gray line).

The Solution

To avoid the race condition, we start up a Resque worker as a child of the Cucumber test and then use Resque's signal handling to control when the worker processes jobs.

Our solution for integration testing with Resque works like this:

  1. Start a Cucumber test process.
  2. Start a Resque worker by forking the Cucumber test process.
    • In the child process: exec a Resque worker.
    • In the parent process: store the Resque worker's PID and continue on as normal.
  3. Pause the Resque worker on startup.
  4. Execute some Cucumber steps.
  5. Invoke a special Cucumber step to:
    1. Un-pause the Resque worker.
    2. Wait for it to finish processing all jobs.
    3. Re-pause the Resque worker.
  6. Make assertions about the result of the worker process.

View the code in the CucumberExternalResqueWorker gist.

Starting the Resque Worker

When the Cucumber test process starts we immediately fork and start a Resque worker. It can take a little while for this worker to finish starting up, so we wait around for up to a minute. In order to pause and un-pause the worker with signals, the PID returned to the parent process needs to be the PID of the Resque worker. We capture this PID by using Ruby's fork and exec commands. Fork causes a child process to be spawned and exec replaces the child process with the resque worker.

This fork and exec trick gives us the Resque worker's PID as a return value of fork in the Cucumber test process instead of having to manage external PID files. However, we learned the hard way that exec behaves differently if you call it with the array form or the string form. If you call exec with the array form, the command has the same PID as the process it replaces. If you call exec with the string form, the string is passed to sh -c and sh gets the PID of the process being replaced. We orphaned a lot of workers before we figured this out.

Pausing the Resque Worker

Resque workers can be paused by sending them the USR2 signal, and un-paused by sending CONT. A Rails initializer adds a before_first_fork hook to the Resque worker and makes the worker send itself a USR2 signal before it can process any jobs.

To run all the queued jobs, we use CucumberExternalResqueWorker.process_all, which un-pauses the worker, waits until it finishes processing jobs, and then re-pauses it.

Putting it together

We can now use asynchronous processing in a deterministic way.

We added a new Cucumber step to clear the Resque email queue:

Given "the email queue is empty" do
  CucumberExternalResqueWorker.reset_counter
  Resque.remove_queue(Mailer.queue)
  reset_mailer
end

And another new Cucumber step to process all Resque jobs:

When "all queued jobs are processed" do
  CucumberExternalResqueWorker.process_all
end

The updated Cucumber test uses the new steps to avoid race conditions between the Cucumber test in the Resque jobs:

Scenario: Capturing an authorization successfully
          results in an email notification
  Given my name is Jools
  And I have a valid API session
  And the email queue is empty
  And I use a new capturable card authorization
  When I POST to API 1.0 "payments capture"
  And all queued jobs are processed
  And I open my newest email
  Then I should see a link to "the payment page for the last payment"

Resque is almost totally awesome

How Resque helped us

We were able to solve this problem because of how well architected Resque is. Its support of POSIX signal handling and the built in extension hooks made it really easy to exercise control over our child worker. We didn't have to monkey patch anything and we used standard signals to control the workers. The fact that Resque managed its own workers and provided a single PID to control workers was also a big help.

How Resque hurt us

The one big problem we ran into was that the pending jobs counter in Resque isn't atomic. When a job is processed, the worker decrements the number of pending jobs remaining, does a bunch of processing, and then increments the counter for jobs being worked. This turns out to be a problem when a Resque job spawns other jobs. We initially tried to use the Resque.info[:pending] and Resque.info[:working] counters to track when our child worker had finished all the jobs, but because they don't update atomically, we would occasionally have child jobs that were never processed. We solved this by alias method chaining a counter into the enqueue and perform methods in our base Resque worker class.

Future Work

Being able to test our Resque workers in a full integration environment has been a huge improvement. We're big fans of Resque and Redis; they've been a pleasure to work with. Our CucumberExternalResqueWorker has been a great solution for us so far. There are a few features we'd like to add when we have time:

  • Patch Resque to have atomic counters so we don't have to monkey patch our base worker.
  • Add ability to run all jobs in a specific queue
  • Add ability to run exactly N jobs in a specific queue
  • Add timeouts for long running jobs
  • Turn CucumberExternalResqueWorker into a gem
  • Add an ENV option to disable starting a Resque worker

Hopefully our solution will help other people get up and running with full stack testing of Resque using Cucumber.

Zach Brock
Server team engineer. Mostly made of water.
Matthew O'Connor
Server team engineer. Math geek. Dog lover.

Comments

Get support help at squareup.com/support. We'll delete off-topic comments.