Integration Testing Resque with Cucumber
Square takes integration testing seriously. We use Cucumber and RSpec to test our code during every step of development: on developer pairing machines, continuous integration servers, staging servers and production servers.
Traditional Cucumber tests exercise the web stack from the web server through the database, but they don't typically cover asynchronous tasks like background processing and scheduled jobs. Integration tests involving these asynchronous jobs are hard to write due to race conditions between the test process and the background workers. We faced this problem when we began using Resque for processing background jobs. We love Cucumber and we love Resque; we wanted to find a way to use them together.
Sample test with a race condition
When someone accepts a payment using Square, we queue up a notification email in Resque. A pool of Resque workers constantly monitor the queue. One of them immediately grabs the email job, renders the email, connects to the mail server, and sends it to the user.
The following test can cause a race condition. The test will pass if the Resque worker had processed the email job by the time the Cucumber test looks for a new email… otherwise the test will fail.
Scenario: Capturing an authorization successfully results in an email notification Given my name is Jools And I have a valid API session And I use a new capturable card authorization When I POST to API 1.0 "payments capture" And I open my newest email Then I should see a link to "the payment page for the last payment"
An immediate solution would be to run all Resque jobs synchronously and skip the enqueueing/dequeueing parts of the stack. We do this for RSpec unit tests, but we want integration tests to directly test the full stack.
The diagram below shows the processes (boxes) involved in running a Cucumber test and the communication channels between the processes (black lines). To solve the race condition problem, the goal was to add another channel between the Cucumber process and the resque worker (gray line).
To avoid the race condition, we start up a Resque worker as a child of the Cucumber test and then use Resque's signal handling to control when the worker processes jobs.
Our solution for integration testing with Resque works like this:
- Start a Cucumber test process.
- Start a Resque worker by forking the Cucumber test process.
- In the child process:
execa Resque worker.
- In the parent process: store the Resque worker's PID and continue on as normal.
- In the child process:
- Pause the Resque worker on startup.
- Execute some Cucumber steps.
- Invoke a special Cucumber step to:
- Un-pause the Resque worker.
- Wait for it to finish processing all jobs.
- Re-pause the Resque worker.
- Make assertions about the result of the worker process.
View the code in the CucumberExternalResqueWorker gist.
Starting the Resque Worker
When the Cucumber test process starts we immediately fork and start a Resque worker. It can take a little while for this worker to finish starting up, so we wait around for up to a minute. In order to pause and un-pause the worker with signals, the PID returned to the parent process needs to be the PID of the Resque worker. We capture this PID by using Ruby's
Fork causes a child process to be spawned and
exec replaces the child process with the resque worker.
exec trick gives us the Resque worker's PID as a return value of
fork in the Cucumber test process instead of having to manage external PID files. However, we learned the hard way that
exec behaves differently if you call it with the array form or the string form. If you call
exec with the array form, the command has the same PID as the process it replaces. If you call
exec with the string form, the string is passed to
sh -c and
sh gets the PID of the process being replaced. We orphaned a lot of workers before we figured this out.
Pausing the Resque Worker
Resque workers can be paused by sending them the
USR2 signal, and un-paused by sending
CONT. A Rails initializer adds a
before_first_fork hook to the Resque worker and makes the worker send itself a USR2 signal before it can process any jobs.
To run all the queued jobs, we use
CucumberExternalResqueWorker.process_all, which un-pauses the worker, waits until it finishes processing jobs, and then re-pauses it.
Putting it together
We can now use asynchronous processing in a deterministic way.
We added a new Cucumber step to clear the Resque email queue:
Given "the email queue is empty" do CucumberExternalResqueWorker.reset_counter Resque.remove_queue(Mailer.queue) reset_mailer end
And another new Cucumber step to process all Resque jobs:
When "all queued jobs are processed" do CucumberExternalResqueWorker.process_all end
The updated Cucumber test uses the new steps to avoid race conditions between the Cucumber test in the Resque jobs:
Scenario: Capturing an authorization successfully results in an email notification Given my name is Jools And I have a valid API session And the email queue is empty And I use a new capturable card authorization When I POST to API 1.0 "payments capture" And all queued jobs are processed And I open my newest email Then I should see a link to "the payment page for the last payment"
Resque is almost totally awesome
How Resque helped us
We were able to solve this problem because of how well architected Resque is. Its support of POSIX signal handling and the built in extension hooks made it really easy to exercise control over our child worker. We didn't have to monkey patch anything and we used standard signals to control the workers. The fact that Resque managed its own workers and provided a single PID to control workers was also a big help.
How Resque hurt us
The one big problem we ran into was that the pending jobs counter in Resque isn't atomic. When a job is processed, the worker decrements the number of pending jobs remaining, does a bunch of processing, and then increments the counter for jobs being worked. This turns out to be a problem when a Resque job spawns other jobs. We initially tried to use the
Resque.info[:working] counters to track when our child worker had finished all the jobs, but because they don't update atomically, we would occasionally have child jobs that were never processed. We solved this by alias method chaining a counter into the
perform methods in our base Resque worker class.
Being able to test our Resque workers in a full integration environment has been a huge improvement. We're big fans of Resque and Redis; they've been a pleasure to work with. Our
CucumberExternalResqueWorker has been a great solution for us so far. There are a few features we'd like to add when we have time:
- Patch Resque to have atomic counters so we don't have to monkey patch our base worker.
- Add ability to run all jobs in a specific queue
- Add ability to run exactly N jobs in a specific queue
- Add timeouts for long running jobs
- Turn CucumberExternalResqueWorker into a gem
- Add an ENV option to disable starting a Resque worker
Hopefully our solution will help other people get up and running with full stack testing of Resque using Cucumber.