After ugrading Chef to use environments, I needed to update my custom AMIs. These custom AMIs were based on Ubuntu's Amazon EC2 Published AMIs, have some extra software pre-installed, and have a custom chef client.rb which gets configuration (chef server info, and client roles) from EC2 userdata to bootstrap itself. I then use scripts to instantiate machines from those AMIs, and pass them the appropriate userdata. This has been working great, but is not "the chef way" -- the recommendation is to use knife ec2 server create, which creates a machine, and then ssh'es in to bootstrap it. In Chef 0.9 I ran into various routing and ssh timing bugs that made this approach too unreliable, but in 0.10 that appears to have been resolved. The main advantage of this approach is that you don't need to make special AMIs; you just use the latest official Ubuntu ones, in any region/arch/store. The disadvantage of that is that you then have to wait for chef-client to install all the software, which in the case of Java and RVM/Ruby is a long time.

So the challenge is to:

  • make sure that the "knife ec2 server create" method produces functional machines from stock AMIs for all my roles
  • use custom AMIs to preload software, and use them from my existing scripts (which use ec2-run-instances) for selected roles

I also wanted to take the opportunity to upgrade OS and sanitise my Ruby install.

For the OS I wanted to switch from Ubuntu 10.4 Maverick Meerkat to 11.4 Natty Narwhal. Chef 0.10.2 includes only templates for [ubuntu10.04-apt, ubuntu10.04-gems.erb], which can be adapted for 11.4 by changing the "lucid" to "natty" (or pull the release name out of lsb_release), but then you end up with Chef 0.9, so you want to add "-0.10".

Here I ran into an interesting issue: the apt template does a apt-get install -y chef, and then writes settings to the client.rb, and then runs chef-client for the initial bootstrap. The problem is that the install also starts the /etc/init.d/chef-client service, so that executes before the modifications to client.rb are made, and before the chef-client bootstrap runs. In my template modifications I set the node_name, and as a result the first chef-client registered the client with the default name (the host name), and the subsequent invocation failed; and I ended up with nodes in the wrong environment. I think there is actually a generic template bug here.

We're using Ruby and RVM for applications on some machine roles, and I've run into various situations where there has been confusion between the system ruby, apt, RVM in /usr/local, RVM in user home directories, various gemsets, and the chef-client and our applications. To reduce that confusion I wanted to try the apt install rather than the default gem install, and limit RVM to a per-user install. [Update: there are some unique issues, such as knife not finding plugins (CHEF-2483)]

The Knife Template

Pulling it all together I ended up with this knife template ubuntu11.04-apt.erb:

#!/bin/bash
# This is a knife ec2 server create template for Ubuntu 11.4.
# It is based on the ubuntu10.04-apt.erb version in the 0.10.2 Chef distribution
# available here:
# https://github.com/opscode/chef/blob/master/chef/lib/chef/knife/bootstrap/
# with modifications to:
# - use the natty APT repository
# - install Chef 0.10.2
# - avoid starting the /etc/init.d/chef-client service until the client.rb
#   has been written
# - let a CHEF_NODE_NAME_PREFIX environment variable prefix the node name

bash -c '
# MAK: use lsb-release to pick up release name, and add -0.10 to get chef 0.10
<%= chef_server_url = Chef::Config[:chef_server_url] %>
<%= validation_client_name = Chef::Config[:validation_client_name] %>
<%= environment = Chef::Config[:environment] %>
if [ ! -f /usr/bin/chef-client ]; then
  echo "chef    chef/chef_server_url    string  <%= chef_server_url %>" \
   | debconf-set-selections
  [ -f /etc/apt/sources.list.d/opscode.list ] || \
    echo "deb http://apt.opscode.com "`lsb_release -cs`"-0.10 main" \
    > /etc/apt/sources.list.d/opscode.list
  wget -O- http://apt.opscode.com/packages@opscode.com.gpg.key | apt-key add -
fi
apt-get update

# MAK: use policy-rc.d to prevent chef-client starting and registering
# before we write client.rb
(cat <<'EOP'
#!/bin/sh
exit 101
EOP
) > /usr/sbin/policy-rc.d
chmod 755 /usr/sbin/policy-rc.d

apt-get install -y chef

# MAK: remove policy.rc
rm -f /usr/sbin/policy-rc.d

<% unless validation_client_name == "chef-validator" -%>
[  `grep -qx "validation_client_name \"<%= validation_client_name %>\"" \
    /etc/chef/client.rb` ] \
 || echo "validation_client_name \"<%= validation_client_name %>\"" \
 >> /etc/chef/client.rb
<% end -%>

(
cat <<'EOP'
<%= IO.read(Chef::Config[:validation_key]) %>
EOP
) > /tmp/validation.pem
awk NF /tmp/validation.pem > /etc/chef/validation.pem
rm /tmp/validation.pem

<% if @config[:chef_node_name] %>
[ `grep -qx "node_name \"<%= @config[:chef_node_name] %>\"" \
   /etc/chef/client.rb` ] \
 || echo "node_name \"<%= @config[:chef_node_name] %>\"" \
 >> /etc/chef/client.rb
<% end -%>

# MAK: use an environment variable to pass in a hostname prefix,
# so your node gets called e.g. web-server-i-123abc
<% if (! ENV['CHEF_NODE_NAME_PREFIX'].nil?) and
    ::File.exists?('/usr/bin/ec2metadata') %>
(
cat <<'EOP'
node_name "<%= ENV['CHEF_NODE_NAME_PREFIX'] %>`ec2metadata --instance-id`"
EOP
) >> /etc/chef/client.rb
<% end -%>

<% unless (environment == "" or environment == "_default") -%>
[  `grep -qx "environment \"<%= environment %>\"" /etc/chef/client.rb` ] \
 || echo "environment \"<%= environment %>\"" >> /etc/chef/client.rb
<% end -%>

(
cat <<'EOP'
<%= { "run_list" => @run_list }.to_json %>
EOP
) > /etc/chef/first-boot.json

/usr/bin/chef-client -j /etc/chef/first-boot.json

# MAK: start chef-client because we prevented that previously
/etc/init.d/chef-client start
'

which you can use likes this:

export CHEF_NODE_NAME_PREFIX=webserver-
knife ec2 server create -r "role[webserver]" \
  -I ami-ab16d2c2 --flavor m1.large -G webserver_demo \
  -x ubuntu --ssh-key demo-kp1 \
  --template ubuntu11.04-apt.erb \
  --environment demo

This works well for bringing up a generic instance with a given role from the command line, after which Chef kicks in and configures the machine.

The AMI

To create an AMI there are two approaches: snapshot a running instance, or build an AMI using loopback mounts and chroot. The former is somewhat easier, the latter is more secure and precise, and is recommended for public AMIs. For a discussion, see Eric Hammond's posts on Creating Public AMIs Securely for EC2 and Building EBS Boot AMIs Using Canonical's Downloadable EC2 Images.

For my private AMI I decided to use the simpler snapshot approach, at least initially to develop the install sequence, and I've split it into separate scripts for easier testing. See my github create-ami repo.