Upgrading to Chef 0.12, and using Environments

19 Jul 2011

I've been running Chef version 0.9 for some months, and I've been looking forward to trying out 0.10; in particular the "environments" feature described in OpsCode's preview blog post, and Chef's documentation.

The Old Environments

In Chef 0.9 I had implemented different environments as follows:

  • have separate machines running chef-server for each environment
  • have separate chef role definitions for each role × environment combination (e.g. webserver_demo, webserver_staging)
  • use attributes overrides in the roles to have per-environment behaviour
  • selectively load roles into servers (e.g. webserver_demo is not needed on staging)
  • assign roles at boot time from a custom client.rb (in the AMI) using JSON from instance data
  • the same mechanism also names the nodes to include the role and environment, e.g. webserver-demo-i-a1b2c3d4
  • use git to share a chef repository between environments

This has worked well, but the role × environment matrix size explosion was a headache to manage, and keeping the right roles (and only the right roles) in the right servers was error-prone.

The New Environments

In Chef 0.10 environments are a first-class concept.

Upgrading some of the servers involved a little trial and error, fresh gem installs worked fine, and upgrading the existing clients was trivial.

Because I restrict port 22 on internal machines with EC2 security groups I did immediately run into a "ssh hang" bug KNIFE_EC2-2, but the linked patch works fine.

To then migrate the configuration for existing machines:

  1. create environment .rb files, applying the attributes from the old role files
  2. create the environments with knife
  3. move all nodes to the new enviroments using nodes.tranform
  4. create new role files, which have the roles/recipes but not the attributes from the old roles
  5. load the new roles, and apply to the relevant machines
  6. remove the old roles
  7. re-run chef-client and verify all roles succeed
  8. verify normal system operation

And that was basically that.

The next step, was to review how I create machines.