Zero Downtime Deploys for Rails

On WebPedro Belo 2012 at Railsconf 2012

Ruby on Rails-focused, but equally applicable to webapps in JavaScript/Node.js and beyond. Describes techniques to achieve deployments that do not require downtime, even for those that include database schema changes. While downtime for database changes in my view can be excused, the HTML-compatibility sections are very applicable to JavaScript heavy sites that may run for extended periods in the client's browser. Some technical suggested solutions are irrelevant in 2018, but overall timeless.

Database Compatibility

Consider removing a model attribute and its related database column. Along with the necessary code changes, you create a database migration:

class ApparentlyHarmlessMigration < ActiveRecord::Migration
  def self.up
    remove_column :users, :notes
  end
end

Upon deploying, however, you're likely to see your existing webapp instances raise database errors:

PGError: ERROR: column "notes" does not exist

Ruby on Rails' ActiveRecord, as any other ORM, will cache columns for performance in the production environment. App instances servicing requests that started before your deployment therefore still reference the removed notes column. Note this is likely the case even if you're not using a full ORM and are passing plain objects to a query library (akin to query.table("users").insert(user)) as those objects probably still refer to non-existent columns.

Pedro proposes approaching this problem through hot compatibility --- ensuring every two consecutive deployments be compatible with each other and able to run in parallel. For example, removing a column requires 2 deployments:

Write to notes.

Stop referencing notes while leaving it in the database:

class User
  def self.columns
    super.reject {|column| column.name == "notes" }
  end
end

Remove notes from the database.

For renaming columns, Pedro suggests temporarily writing to both columns, requiring a total of 3 deployments:

Read and write to notes.
Add column remarks. Read from notes, write to notes and remarks.
Populate remarks where needed; read and write only to remarks.
Remove notes.

HTML Compatibility

Beyond model and database mismatches, you're likely to see problems with HTML forms and API requests that were submitted from a page rendered by a previous version of the app. Take a form on the signin page with a renamed field for example:

 <form method="post" action="/session">
-  <input name="username">
+  <input name="email">
   <button>Sign in</button>
 </form>

In a fully server-side rendered app perhaps not that problematic --- submitting it would merely inform the visitor of not having filled their email field while presenting them with an updated form¹. A JavaScript-heavy signin form (or a single-page app), on the other, is bound to just break entirely. In both server- and client-side approaches there's a high risk of losing the person's submitted information and that's a cardinal sin of user experience.

A solution is to be backwards compatible with old requests and migrate parameters in the controller:

class AuthController < ApplicationController
  def filtered_params
    params.dup.tap do |params|
      params.merge!(email: params.delete(:username))
    end
  end
end

There's also a potential problem with page assets. If you're removing or renaming CSS stylesheets or JavaScript files, it's possible a person loading your page in the middle of a migration will reference assets you've removed in the latest deployment. Functional and visual anomalies can also happen if you change stylesheets or JavaScripts in ways that are incompatible with the HTML of the previous deployment. Pedro proposed versioning assets and keeping older assets around for some time.

Migration Strategies

Once you get your app's consecutive versions compatible, it's important to ensure your deployment process doesn't prevent existing app instances from servicing requests. It's easy to accidentally lock entire database tables through migrations.

For PostgreSQL, the locking implications are:

Operation	Performance
`ADD COLUMN`	O(n) lock if you set a default value (writes to every row). O(1) otherwise.
`ALTER COLUMN`	O(N) lock.
`REMOVE COLUMN`	O(1)
`CREATE INDEX`	O(N) lock unless you create via `CREATE INDEX CONCURRENTLY`.
`DELETE INDEX`	O(1)

To minimize the time of holding a lock on the entire table when adding a column with a default value, split the operation to two parts. One that adds a column with no default value and the other that sets it:

ALTER TABLE users ADD COLUMN notes TEXT;
ALTER TABLE users ALTER COLUMN notes SET DEFAULT "";

Note that this will leave existing rows and the column itself nullable. You should probably do batched updates (e.g. setting a few thousand rows at a time) at some point to fill in the missing values and then alter the column to NOT NULL.

PostgreSQL 11 is said to provide fast column additions even with default values.

Server

For being able to do zero downtime deployments, you'll also need a web server that can start up new instances and gracefully shut down old ones while leaving existing running requests to finish on their own.

This can be handled "above the stack" by an external load balancer (Nginx, HAProxy or Heroku's platform) in front of your Ruby webserver that optionally coordinates with your app when to stop routing requests. Alternatively, you can go with Unicorn that can also handle graceful restarts.

Although imagine the resulting frustration if that message instructed to click the browser's back button and try again instead, only to have the error persist.↩