A Cosmic Dance in a Little Box

It’s Hack Week again. This time around I decided to look at running TripleO on openSUSE. If you’re not familiar with TripleO, it’s short for OpenStack on OpenStack, i.e. it’s a project to deploy OpenStack clouds on bare metal, using the components of OpenStack itself to do the work. I take some delight in bootstrapping of this nature – I think there’s a nice symmetry to it. Or, possibly, I’m just perverse.

Anyway, onwards. I had a chat to Robert Collins about TripleO while at PyCon AU 2013. He introduced me to diskimage-builder and suggested that making it capable of building openSUSE images would be a good first step. It turned out that making diskimage-builder actually run on openSUSE was probably a better first step, but I managed to get most of that out of the way in a random fit of hackery a couple of months ago. Further testing this week uncovered a few more minor kinks, two of which I’ve fixed here and here. It’s always the cross-distro work that seems to bring out the edge cases.

Then I figured there’s not much point making diskimage-builder create openSUSE images without knowing I can set up some sort of environment to validate them. So I’ve spent large parts of the last couple of days working my way through the TripleO Dev/Test instructions, deploying the default Ubuntu images with my openSUSE 12.3 desktop as VM host. For those following along at home the install-dependencies script doesn’t work on openSUSE (some manual intervention required, which I’ll try to either fix, document, or both, later). Anyway, at some point last night, I had what appeared to be a working seed VM, and a broken undercloud VM which was choking during cloud-init:

Calling http://169.254.169.254/2009-04-04/meta-data/instance-id' failed
Request timed out

Figuring that out, well…  There I was with a seed VM deployed from an image built with some scripts from several git repositories, automatically configured to run even more pieces of OpenStack than I’ve spoken about before, which in turn had attempted to deploy a second VM, which wanted to connect back to the first over a virtual bridge and via the magic of some iptables rules and I was running tcpdump and tailing logs and all the moving parts were just suddenly this GIANT COSMIC DANCE in a tiny little box on my desk on a hill on an island at the bottom of the world.

It was at this point I realised I had probably been sitting at my computer for too long.

It turns out the problem above was due to my_ip being set to an empty string in /etc/nova/nova.conf on the seed VM. Somehow I didn’t have the fix in my local source repo. An additional problem is that libvirt on openSUSE, like Fedora, doesn’t set uri_default="qemu:///system". This causes nova baremetal calls from the seed VM to the host to fail as mentioned in bug #1226310. This bug is apparently fixed, but apparently the fix doesn’t work for me (another thing to investigate), so I went with the workaround of putting uri_default="qemu:///system" in ~/.config/libvirt/libvirt.conf.

So now (after a rather spectacular amount of disk and CPU thrashing) there are three OpenStack clouds running on my desktop PC. No smoke has come out.

  • The seed VM has successfully spun up the “baremetal_0” undercloud VM and deployed OpenStack to it.
  • The undercloud VM has successfully spun up the “baremetal_1” and “baremetal_2” VMs and deployed them as the overcloud control and compute nodes.
  • I have apparently booted a demo VM in the overcloud, i.e. I’ve got a VM running inside a VM, although I haven’t quite managed to ssh into the latter yet (I suspect I’m missing a route or a firewall rule somewhere).

I think I had it right last night. There is a giant cosmic dance being performed in a tiny little box on my desk on a hill on an island at the bottom of the world.

Or, I’ve been sitting at my computer for too long again.

Coda

Keep your eyes on the road, your hands upon the wheel
— Roadhouse, The Doors

It’s been a bit over two and a half months since I rolled my car on an icy corner on the way to the last day of PyCon Au 2013. My body works properly again modulo some occasional faint stiffness in my right side and I’ve been discharged by my physiotherapist, so I thought publishing some retrospective thoughts might be appropriate.

Continue reading

I Want My… I Want My… NBN

I thought I’d start this post with some classic Dire Straits, largely for the extreme tech culture shock value, but also because “MTV” has the same number of syllables as “NBN”. I’ll wait while you watch it.

Now that that’s out of the way, I thought it might be interesting to share my recent NBN experience. Almost a year ago I said “Yay NBN! Bring it on! Especially if I don’t get stuck on satellite”. Thankfully it turns out I didn’t get stuck on satellite, instead I got NBN fixed wireless. For the uninitiated, this involves an antenna on your roof, pointed at a nearby tower:

An NTD (Network Terminating Device) is affixed to a wall inside your house:

The box on the left is the NTD, the box on the right is the line from the antenna on the roof. Not pictured are the four ports on the bottom of the NTD, one of which is now plugged into a BoB Lite from iiNet. According to the labels on the NTD, the unit remains the property of NBN Co, and you’re not allowed to tamper with it:

Getting online is easy. Whatever you plug into the NTD just gets an IP address via DHCP – there’s none of that screwing around with PPPoE usernames and passwords. The BoB Lite lets you configure one of its ethernet ports as a “WAN source” instead of using ADSL, which is what I’m doing. Alternately you could plug in a random Linux box and use that as a router, or even just plug your laptop straight in, which is what I did later when trying to diagnose a fault.

The wireless itself is LTE/4G, but unlike 4G on your mobile phone (which gets swamped to a greater or lesser degree when lots of people are in the same place), each NBN fixed wireless tower apparently serves a set number of premises (see the fact sheet PDF), so speed should remain relatively consistent. Here’s the obligatory speed tests, first from my ADSL connection:

And here’s what I get via NBN wireless:

Boo-yah!

Speaking as someone who works from home and who has to regularly download and upload potentially large amounts of data, this is a huge benefit. Subjectively, random web browsing doesn’t seem wildly different than before, but suddenly being able to download openSUSE ISOs, update test VMs, and pull build dependencies at ~2 megabytes per second has markedly decreased the amount of time I spend sitting around waiting. And let’s not leave uploads out of the picture here – I push code up to github, I publish my blog, I upload tarballs, I contribute to wikis. I’ve seen too much discussion of FTTP vs. FTTN focus on download speed only, which seems to assume that we’re all passive consumers of MTV videos. OK, fine, I’m on wireless and FTTP is never going to be an option where I live, but I don’t want anyone to lose sight of the fact that being able to produce and upload content is a vital part of participating in our shiny new digital future. A reliable connection with decent download and upload speed is vital.

Now that I’ve covered the happy part of my NBN experience, I’ll also share the kinks and glitches for completeness. I rang up to get connected on August 1st. At the time, the next available installation appointment was August 21st. On that day I got a call saying the technician wouldn’t be able to make it because his laptop was broken. I offered to let them use my laptop instead, but apparently this isn’t possible, so the installation was rescheduled for September 6th. All attempts at escalating this (i.e. getting the subsequent 2.5 week delay reduced, because after all it was their fault they had a broken laptop) failed. By the time the right person at iiNet was able to rattle enough sabres in the direction of NBN Co, the new installation date was close enough that it didn’t make a difference anymore. To be clear, as far as I can tell, iiNet is not at fault; the problem seems to be one of bureaucracy combined with probable understaffing/overdemand of NBN Co (apparently this newfangled interwebs thingy is popular). As an aside, I mentioned the “broken laptop” problem to a friend, who said a friend of his had also had installation rescheduled with the same excuse. I’m not quite sure what to make of that, but I will state for the record that I seriously doubt our new fearless leaders would have been able to make matters any better had they been in power at the time.

Anyway, installation finally went ahead and all was sweetnees and light for just under three weeks, until one of those wacky spring days where it’s sun, then howling wind, then sideways rain, then sun again, then two orange signal strength lights and a red ODU light:

ODU stands for Outdoor Unit, which is apparently the antenna. For at least an hour, the lights cycled between red and green ODU, and two orange / three green signal strength. Half an hour on the phone to iiNet support, as well as plugging a laptop directly into the NTD didn’t get me anywhere. I managed to get a DHCP lease on my laptop very briefly at one point, but otherwise the connection was completely hosed. The exceptionally courteous and helpful woman in iiNet’s Fibre team logged a fault with NBN Co, and I switched my ADSL – which I had kept for just such an emergency – back on, so that I could get some work done.

The next day everything was green, so I plugged back into the NTD. About an hour after this, I got a call from the same woman at iiNet saying they’d noticed I had a connection again and had thus cancelled the fault, but that she was going to keep an eye on things for a little while, and would check back with me over the next week or two. I also received an SMS with her direct email address, so I can advise her of any further trouble. I’ve since got smokeping running here against what I hope is a useful set of remote hosts so I’ll notice if anything goes freaky while I’m asleep or otherwise away from my desk:

I really wish I knew how to get a console on the NTD, or view some logs. I was told it’s basically a dumb box, but surely it knows a bit more internally than it indicates on the blinkenlights. As I said on twitter the other day, I feel like a mechanic trapped in the drivers seat of a car, with access only to the dashboard lights. Oh, well, fingers crossed…

Rebooted Pork: A Recipe

Recording this for posterity.

Ingredients:

  • Leftover roast pork (optimally home grown free range and happy, about which another post when I find the time), diced or thinly sliced.
  • A few onions, a couple of potatoes, some mushrooms, all thinly sliced.
  • Capsicum (red or green or both), diced.
  • Crushed garlic.
  • BBQ sauce (plus possibly tomato paste – see how you go).
  • Butter.
  • Salt.
  • Pepper (bonus points for Tasmanian bush pepper).
  • Oregano.
  • Cheese, chilli sauce, tortillas (optional).

Procedure:

  1. Heat frypan, add pork.
  2. Add onions and garlic, fry for a while.
  3. Add butter, potatoes, mushrooms, BBQ sauce, salt, pepper, oregano, fry some more.
  4. Cover, lower heat, add water and/or tomato paste if it seems necessary.
  5. Go away for at least half an hour. Check your email. Read Twitter. Do some actual useful work. But make sure you’re within smelling distance of the kitchen, just in case.
  6. Add capsicum.
  7. Wait a bit more, depending on how well done you like your capsicum.
  8. Serve, either in a bowl or wrapped in a tortilla, with or without cheese and chilli sauce according to taste.

Achievement Unlocked: A Cautionary Tale

This is the story of how I rolled my car and somehow didn’t die on the way to the last day of PyCon Au 2013. I’m writing this partly for catharsis (I think it helps emotionally to talk about events like this) and partly in the hope that my experience will remind others to be safe on the roads. If you find descriptions of physical and mental trauma distressing, you should probably stop reading now, but rest assured that I am basically OK.
Continue reading

The Pieces of OpenStack

I gave an introductory “WTF is OpenStack” talk at the OpenStack miniconf at PyCon-AU this morning. For the uninitiated, this eventually boiled down to defining OpenStack as:

A framework that allows you to efficiently and dynamically manage virtualization and storage, so your users can request computing power and disk space as they need it.

It was suggested, given recent developments, that I might change this to:

A framework that allows you to efficiently and dynamically manage ALL THE THINGS!

Leaving that aside for the moment though, several people pointed out that my state transition diagram serves to very clearly elucidate how the various pieces of OpenStack interact when someone wants to deploy a VM. Here is is for posterity:

There’s also an etherpad covering the day’s talks at https://etherpad.openstack.org/pyconau-os-hack. Note that the images in the above diagram (aside from the hand-drawn Tim) are courtesy of Martin Loschwitz.

Telework, Telework and the NBN

A few related (or semi-related) bits and pieces which turned up over the last month or so. The first is a video I was sent from onlinemba.com, entitled Telecommuting is Good for You and Good for Business. It cites a few studies showing greater productivity, reduced turnover and eco-friendliness (which I would tend to agree with based on personal experience). Unfortunately I can’t seem to see links to the studies themselves, but you should go watch it anyway, because it’s one of those cute “live animated” things, which I’m a sucker for.

That leads neatly to the second thing, which is from the iiNet blog, entitled The benefits of working from home. As with the above video, Yahoo gets a prominent mention for not allowing telework (although reports on that seem to be somewhat mixed). Anyway, benefits cited include lack of commute, lower overhead and fewer distractions, but the blog post does also make the very good point that you need to figure out whether or not telework is actually right for you. It also offers some tips I tend to agree with.

Now we get to the semi-related bit. Assuming telework is right for you, you want a good internet connection. As I’ve said before, “Yay NBN! Bring it on!“. As much disenchantment as I have with the major parties, this is something Labor is actually doing right. For an easy-to-appreciate comparison of what Labor and the LNP are proposing, check out How Fast is the NBN. Teleworkers, pay particular attention to the simulated Dropbox sync.

One More chef-client Run

Carrying on from my last post, the failed chef-client run came down to the init script in ceph 0.56 not yet knowing how to iterate /var/lib/ceph/{mon,osd,mds} and automatically start the appropriate daemons. This functionality seems to have been introduced in 0.58 or so by commit c8f528a. So I gave it another shot with a build of ceph 0.60.

On each of my ceph nodes, a bit of upgrading and cleanup. Note the choice of ceph 0.60 was mostly arbitrary, I just wanted the latest thing I could find an RPM for in a hurry. Also some of the rm invocations won’t be necessary, depending on what state things are actually in:

# zypper ar -f http://download.opensuse.org/repositories/home:/dalgaaf:/ceph:/extra/openSUSE_12.3/home:dalgaaf:ceph:extra.repo
# zypper ar -f http://gitbuilder.ceph.com/ceph-rpm-opensuse12-x86_64-basic/ref/next/x86_64/ ceph.com-next_openSUSE_12_x86_64
# zypper in ceph-0.60
# kill $(pidof ceph-mon)
# rm /etc/ceph/*
# rm /var/run/ceph/*
# rm -r /var/lib/ceph/*/*

That last gets rid of any half-created mon directories.

I also edited the Ceph environment to only have one mon (one of my colleagues rightly pointed out that you need an odd number of mons, and I had declared two previously, for no good reason). That’s knife environment edit Ceph on my desktop, and set "mon_initial_members": "ceph-0" instead of "ceph-0,ceph-1".

I also had to edit each of the nodes, to add an osd_devices array to each node, and remove the mon role from ceph-1. That’s knife node edit ceph-0.example.com then insert:

  "normal": {
    ...
    "ceph": {
      "osd_devices": [  ]
    }
  ...

Without the osd_devices array defined, the osd recipe fails (“undefined method `each_with_index’ for nil:NilClass”). I was kind of hoping an empty osd_devices array would allow ceph to use the root partition. No such luck, the cookbook really does expect you to be doing a sensible deployment with actual separate devices for your OSDs. Oh, well. I’ll try that another time. For now at least I’ve demonstrated that ceph-0.60 does give you what appears to be a clean mon setup when using the upstream cookbooks on openSUSE 12.3:

knife ssh name:ceph-0.example.com -x root chef-client
[2013-04-15T06:32:13+00:00] INFO: *** Chef 10.24.0 ***
[2013-04-15T06:32:13+00:00] INFO: Run List is [role[ceph-mon], role[ceph-osd], role[ceph-mds]]
[2013-04-15T06:32:13+00:00] INFO: Run List expands to [ceph::mon, ceph::osd, ceph::mds]
[2013-04-15T06:32:13+00:00] INFO: HTTP Request Returned 404 Not Found: No routes match the request: /reports/nodes/ceph-0.example.com/runs
[2013-04-15T06:32:13+00:00] INFO: Starting Chef Run for ceph-0.example.com
[2013-04-15T06:32:13+00:00] INFO: Running start handlers
[2013-04-15T06:32:13+00:00] INFO: Start handlers complete.
[2013-04-15T06:32:13+00:00] INFO: Loading cookbooks [apache2, apt, ceph]
[2013-04-15T06:32:13+00:00] INFO: Processing template[/etc/ceph/ceph.conf] action create (ceph::conf line 6)
[2013-04-15T06:32:13+00:00] INFO: template[/etc/ceph/ceph.conf] updated content
[2013-04-15T06:32:13+00:00] INFO: template[/etc/ceph/ceph.conf] mode changed to 644
[2013-04-15T06:32:13+00:00] INFO: Processing service[ceph_mon] action nothing (ceph::mon line 23)
[2013-04-15T06:32:13+00:00] INFO: Processing execute[ceph-mon mkfs] action run (ceph::mon line 40)
creating /var/lib/ceph/tmp/ceph-ceph-0.mon.keyring
added entity mon. auth auth(auid = 18446744073709551615 key=AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g== with 0 caps)
ceph-mon: mon.noname-a 192.168.4.118:6789/0 is local, renaming to mon.ceph-0
ceph-mon: set fsid to f80aba97-26c5-4aa3-971e-09c5a3afa32f
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph-0 for mon.ceph-0
[2013-04-15T06:32:14+00:00] INFO: execute[ceph-mon mkfs] ran successfully
[2013-04-15T06:32:14+00:00] INFO: execute[ceph-mon mkfs] sending start action to service[ceph_mon] (immediate)
[2013-04-15T06:32:14+00:00] INFO: Processing service[ceph_mon] action start (ceph::mon line 23)
[2013-04-15T06:32:15+00:00] INFO: service[ceph_mon] started
[2013-04-15T06:32:15+00:00] INFO: Processing ruby_block[tell ceph-mon about its peers] action create (ceph::mon line 64)
mon already active; ignoring bootstrap hint

[2013-04-15T06:32:16+00:00] INFO: ruby_block[tell ceph-mon about its peers] called
[2013-04-15T06:32:16+00:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 79)
2013-04-15 06:32:16.872040 7fca8e297780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-15 06:32:16.872042 7fca8e297780 -1 unable to authenticate as client.admin
2013-04-15 06:32:16.872400 7fca8e297780 -1 ceph_tool_common_init failed.
[2013-04-15T06:32:18+00:00] INFO: ruby_block[get osd-bootstrap keyring] called
[2013-04-15T06:32:18+00:00] INFO: Processing package[gdisk] action upgrade (ceph::osd line 37)
[2013-04-15T06:32:27+00:00] INFO: package[gdisk] upgraded from uninstalled to 
[2013-04-15T06:32:27+00:00] INFO: Processing service[ceph_osd] action nothing (ceph::osd line 48)
[2013-04-15T06:32:27+00:00] INFO: Processing directory[/var/lib/ceph/bootstrap-osd] action create (ceph::osd line 67)
[2013-04-15T06:32:27+00:00] INFO: Processing file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] action create (ceph::osd line 76)
[2013-04-15T06:32:27+00:00] INFO: entered create
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] owner changed to 0m
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] group changed to 0
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] mode changed to 440
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] created file /var/lib/ceph/bootstrap-osd/ceph.keyring.raw
[2013-04-15T06:32:27+00:00] INFO: Processing execute[format as keyring] action run (ceph::osd line 83)
creating /var/lib/ceph/bootstrap-osd/ceph.keyring
added entity client.bootstrap-osd auth auth(auid = 18446744073709551615 key=AQAOl2tR0M4bMRAAatSlUh2KP9hGBBAP6u5AUA== with 0 caps)
[2013-04-15T06:32:27+00:00] INFO: execute[format as keyring] ran successfully
[2013-04-15T06:32:28+00:00] INFO: Chef Run complete in 14.479108446 seconds
[2013-04-15T06:32:28+00:00] INFO: Running report handlers
[2013-04-15T06:32:28+00:00] INFO: Report handlers complete

Witness:

ceph-0:~ # rcceph status
=== mon.ceph-0 === 
mon.ceph-0: running {"version":"0.60-468-g98de67d"}

On the note of building an easy-to-deploy Ceph appliance, assuming you’re not using Chef and just want something to play with, I reckon the way to go is use config pretty similar to what would be deployed by this Chef cookbook, i.e. an absolute minimal /etc/ceph/ceph.conf, specifying nothing other than initial mons, then use the various Ceph CLI tools to create mons and osds on each node and just rely on the init script in Ceph >= 0.58 to do the right thing with what it finds (having to explicitly specify each mon, osd and mds in the Ceph config by name always bugged me). Bonus points for using csync2 to propagate /etc/ceph/ceph.conf across the cluster.

The Ceph Chef Experiment

Sometimes it’s most interesting to just dive in and see what breaks. There’s a Chef cookbook for Ceph on github which seems rather more recently developed than the one in SUSE-Cloud/barclamp-ceph, and seeing as its use is documented in the Ceph manual, I reckon that’s the one I want to be using. Of course, the README says “Tested as working: Ubuntu Precise (12.04)”, and I’m using openSUSE 12.3…

First things first, need a Chef server, so I installed openSUSE 12.3 on a VM, then installed Chef 10 on that, roughly following the manual installation instructions. Note for those following along at home – sometimes the blocks I’ve copied here are just commands, sometimes they include command output as well. You’ll figure it out 🙂

# zypper ar -f http://download.opensuse.org/repositories/systemsmanagement:/chef:/10/openSUSE_12.3/systemsmanagement:chef:10.repo
# zypper in rubygem-chef-server
# chkconfig couchdb on
# rccouchdb start
# chkconfig rabbitmq-server on
# rcrabbitmq-server start
# rabbitmqctl add_vhost /chef
# rabbitmqctl add_user chef testing
# rabbitmqctl set_permissions -p /chef chef ".*" ".*" ".*"
# for service in solr expander server server-webui; do
      chkconfig chef-$service on
      rcchef-$service start
  done

I didn’t bother editing /etc/chef/server.rb, the config as shipped works fine (not that the AMQP password is very secure, mind). The only catch is the web UI didn’t start. IIRC this is due to /etc/chef/webui.pem not existing yet (chef-server creates it, but this doesn’t finish until later).

Then configured knife:

# knife configure -i
WARNING: No knife configuration file found
Where should I put the config file? [/root/.chef/knife.rb]
Please enter the chef server URL: [http://os-chef.example.com:4000]
Please enter a clientname for the new client: [root]
Please enter the existing admin clientname: [chef-webui]
Please enter the location of the existing admin client's private key: [/etc/chef/webui.pem]
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef/validation.pem]
Please enter the path to a chef repository (or leave blank):
Creating initial API user...
Created client[root]
Configuration file written to /root/.chef/knife.rb

And make a client for me:

# knife client create tserong -d -a -f /tmp/tserong.pem
Created client[tserong]

Then set up my desktop as a Chef workstation (roughly following these docs, and again pulling Chef from systemsmanagement:chef:10 on OBS):

# sudo zypper in rubygem-chef
# cd ~
# git clone git://github.com/opscode/chef-repo.git
# cd chef-repo
# mkdir -p ~/.chef
# scp root@os-chef:/etc/chef/validation.pem ~/.chef/
# scp root@os-chef:/tmp/tserong.pem ~/.chef/
# knife configure
WARNING: No knife configuration file found
Where should I put the config file? [/home/tserong/.chef/knife.rb]
Please enter the chef server URL: [http://desktop.example.com:4000] http://os-chef.example.com:4000
Please enter an existing username or clientname for the API: [tserong]
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef/validation.pem] /home/tserong/.chef/validation.pem
Please enter the path to a chef repository (or leave blank): /home/tserong/chef-repo
[...]
Configuration file written to /home/tserong/.chef/knife.rb

Make sure it works:

# knife client list
chef-validator
chef-webui
root
tserong

Grab the cookbooks and upload them to the Chef server. The Ceph cookbook claims to depend on apache and apt, although presumably the former is only necessary for RADOSGW, and the latter for Debian-based systems. Anyway:

# cd ~/chef-repo
# git submodule add git@github.com:opscode-cookbooks/apache2.git cookbooks/apache2
# git submodule add git@github.com:opscode-cookbooks/apt.git cookbooks/apt
# git submodule add git@github.com:ceph/ceph-cookbooks.git cookbooks/ceph
# knife cookbook upload apache2
# knife cookbook upload apt
# knife cookbook upload ceph

Boot up a couple more VMs to be Ceph nodes, using the appliance image from last time. These need chef-client installed, and need to be registered with the chef server. knife bootstrap will install chef-client and dependencies for you, but after looking at the source, if /usr/bin/chef doesn’t exist, it actually uses wget or curl to pull http://opscode.com/chef/install.sh and runs that. How this is considered a good idea is completely baffling to me, so again I installed our chef build from OBS on each of my Ceph nodes (note to self: should add this to appliance image on Studio):

# zypper ar -f http://download.opensuse.org/repositories/systemsmanagement:/chef:/10/openSUSE_12.3/systemsmanagement:chef:10.repo
# zypper in rubygem-chef

And ran the now-arguably-safe knife bootstrap from my desktop:

# knife bootstrap ceph-0.example.com
Bootstrapping Chef on ceph-0.example.com
[...]
# knife bootstrap ceph-1.example.com
Bootstrapping Chef on ceph-1.example.com
[...]

Then, roughly following the Ceph Deploying with Chef document.

Generate a UUID and monitor secret (had to do the latter on one of my Ceph VMs, as ceph-authtool is conveniently already installed):

# uuidgen -r
f80aba97-26c5-4aa3-971e-09c5a3afa32f
# ceph-authtool /dev/stdout --name=mon. --gen-key
[mon.]
key = AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g==

Then on my desktop:

knife environment create Ceph

This I filled in with:

{
  "name": "Ceph",
  "description": "",
  "cookbook_versions": {
  },
  "json_class": "Chef::Environment",
  "chef_type": "environment",
  "default_attributes": {
    "ceph": {
      "monitor-secret": "AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g==",
      "config": {
        "fsid": "f80aba97-26c5-4aa3-971e-09c5a3afa32f",
        "mon_initial_members": "ceph-0,ceph-1",
        "global": {
        },
        "osd": {
          "osd journal size": "1000",
          "filestore xattr use omap": "true"
        }
      }
    }
  },
  "override_attributes": {
  }
}

Uploaded roles:

# knife role from file cookbooks/ceph/roles/ceph-mds.rb
# knife role from file cookbooks/ceph/roles/ceph-mon.rb
# knife role from file cookbooks/ceph/roles/ceph-osd.rb
# knife role from file cookbooks/ceph/roles/ceph-radosgw.rb

Assigned roles to nodes:

# knife node run_list add ceph-0.example.com 'role[ceph-mon],role[ceph-osd],role[ceph-mds]'
# knife node run_list add ceph-1.example.com 'role[ceph-mon],role[ceph-osd],role[ceph-mds]'

I didn’t bother with recipe[ceph::repo] as I don’t care about installation right now (Ceph is already installed in my VM images).

Had to set "chef_environment": "Ceph" for each node by running:

# knife node edit ceph-0.example.com
# knife node edit ceph-1.example.com

Didn’t set Ceph osd_devices per node – I’m just playing, so can sit on top of the root partition.

Now let’s see if it works:

# knife ssh name:ceph-0.example.com -x root chef-client
[2013-04-11T13:44:47+00:00] INFO: *** Chef 10.24.0 ***
[2013-04-11T13:44:48+00:00] INFO: Run List is [role[ceph-mon], role[ceph-osd], role[ceph-mds]]
[2013-04-11T13:44:48+00:00] INFO: Run List expands to [ceph::mon, ceph::osd, ceph::mds]
[2013-04-11T13:44:48+00:00] INFO: HTTP Request Returned 404 Not Found: No routes match the request: /reports/nodes/ceph-0.example.com/runs
[2013-04-11T13:44:48+00:00] INFO: Starting Chef Run for ceph-0.example.com
[2013-04-11T13:44:48+00:00] INFO: Running start handlers
[2013-04-11T13:44:48+00:00] INFO: Start handlers complete.
[2013-04-11T13:44:48+00:00] INFO: Loading cookbooks [apache2, apt, ceph]
No ceph-mon found.

[2013-04-11T13:44:48+00:00] INFO: Processing template[/etc/ceph/ceph.conf] action create (ceph::conf line 6)
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] backed up to /var/chef/backup/etc/ceph/ceph.conf.chef-20130411134448
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] updated content
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] owner changed to 0
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] group changed to 0
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] mode changed to 644
[2013-04-11T13:44:48+00:00] INFO: Processing service[ceph_mon] action nothing (ceph::mon line 23)
[2013-04-11T13:44:48+00:00] INFO: Processing execute[ceph-mon mkfs] action run (ceph::mon line 40)
creating /var/lib/ceph/tmp/ceph-ceph-0.mon.keyring
added entity mon. auth auth(auid = 18446744073709551615 key=AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g== with 0 caps)
ceph-mon: mon.noname-a 192.168.4.118:6789/0 is local, renaming to mon.ceph-0
ceph-mon: set fsid to f80aba97-26c5-4aa3-971e-09c5a3afa32f
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph-0 for mon.ceph-0
[2013-04-11T13:44:49+00:00] INFO: execute[ceph-mon mkfs] ran successfully
[2013-04-11T13:44:49+00:00] INFO: execute[ceph-mon mkfs] sending start action to service[ceph_mon] (immediate)
[2013-04-11T13:44:49+00:00] INFO: Processing service[ceph_mon] action start (ceph::mon line 23)
[2013-04-11T13:44:49+00:00] INFO: service[ceph_mon] started
[2013-04-11T13:44:49+00:00] INFO: Processing ruby_block[tell ceph-mon about its peers] action create (ceph::mon line 64)
connect to
/var/run/ceph/ceph-mon.ceph-0.asok
failed with
(2) No such file or directory

connect to
/var/run/ceph/ceph-mon.ceph-0.asok
failed with
(2) No such file or directory

[2013-04-11T13:44:49+00:00] INFO: ruby_block[tell ceph-mon about its peers] called
[2013-04-11T13:44:49+00:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 79)
2013-04-11 13:44:49.928800 7f58e9677700 0
-- :/23863 >> 192.168.4.117:6789/0 pipe(0x18f0d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

2013-04-11 13:44:52.928739 7f58efc1c700 0 -- :/23863 >> 192.168.4.118:6789/0 pipe(0x7f58e0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 13:44:55.929375 7f58e9677700 0 -- :/23863 >> 192.168.4.117:6789/0 pipe(0x7f58e0003010 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 13:44:58.929211 7f58efc1c700 0 -- :/23863 >> 192.168.4.118:6789/0 pipe(0x7f58e00039f0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 13:45:01.929787 7f58e9677700 0 -- :/23863 >> 192.168.4.117:6789/0 pipe(0x7f58e00023b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
[...]

And it’s stuck there, trying and failing to talk to something.

See those “no such file or directory” errors after “service[ceph_mon] started”? Yeah? Well, the mon isn’t started, hence the missing sockets in /var/run/ceph.

Why isn’t the mon started? Turns out the ceph init script won’t start any mon (or osd or mds for that matter) if you don’t have entries in the config file with some suffix, e.g. [mon.a]. And all I’ve got is:

[global]
  fsid =  f80aba97-26c5-4aa3-971e-09c5a3afa32f
  mon initial members = ceph-0,ceph-1
  mon host = 192.168.4.118:6789, 192.168.4.117:6789

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

But given the mon recipe triggers ceph-mon-all-starter if using upstart (which it would be, on the “Tested as working: Ubuntu Precise”), and ceph-mon-all-starter seems to just ultimately run something like ceph-mon --cluster=ceph -i ceph-0 regardless of what’s in the config file… Maybe I can cheat.

Directly starting ceph-mon from a shell on ceph-0 before the chef-client run turned out to be a bad idea (bit of a chicken and egg problem figuring out what to inject into the “mon host” line of the config file). So I put a bit of evil into the mon recipe:

diff --git a/recipes/mon.rb b/recipes/mon.rb
index 5cd76de..a518830 100644
--- a/recipes/mon.rb
+++ b/recipes/mon.rb
@@ -61,6 +61,10 @@ EOH
   notifies :start, "service[ceph_mon]", :immediately
 end
 
+execute 'hack to force mon start' do
+  command "ceph-mon --cluster=ceph -i #{node['hostname']}"
+end
+
 ruby_block "tell ceph-mon about its peers" do
   block do
     mon_addresses = get_mon_addresses()

Try again:

# knife ssh name:ceph-0.example.com -x root chef-client
[2013-04-11T15:10:43+00:00] INFO: *** Chef 10.24.0 ***
[2013-04-11T15:10:44+00:00] INFO: Run List is [role[ceph-mon], role[ceph-osd], role[ceph-mds]]
[2013-04-11T15:10:44+00:00] INFO: Run List expands to [ceph::mon, ceph::osd, ceph::mds]
[2013-04-11T15:10:44+00:00] INFO: HTTP Request Returned 404 Not Found: No routes match the request: /reports/nodes/ceph-0.example.com/runs
[2013-04-11T15:10:44+00:00] INFO: Starting Chef Run for ceph-0.example.com
[2013-04-11T15:10:44+00:00] INFO: Running start handlers
[2013-04-11T15:10:44+00:00] INFO: Start handlers complete.
[2013-04-11T15:10:44+00:00] INFO: Loading cookbooks [apache2, apt, ceph]
[2013-04-11T15:10:44+00:00] INFO: Storing updated cookbooks/ceph/recipes/mon.rb in the cache.
No ceph-mon found.

[2013-04-11T15:10:44+00:00] INFO: Processing template[/etc/ceph/ceph.conf] action create (ceph::conf line 6)
[2013-04-11T15:10:44+00:00] INFO: Processing service[ceph_mon] action nothing (ceph::mon line 23)
[2013-04-11T15:10:44+00:00] INFO: Processing execute[ceph-mon mkfs] action run (ceph::mon line 40)
[2013-04-11T15:10:44+00:00] INFO: Processing execute[hack to force mon start] action run (ceph::mon line 65)
starting mon.ceph-0 rank 1 at 192.168.4.118:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-0 fsid f80aba97-26c5-4aa3-971e-09c5a3afa32f
[2013-04-11T15:10:44+00:00] INFO: execute[hack to force mon start] ran successfully
[2013-04-11T15:10:44+00:00] INFO: Processing ruby_block[tell ceph-mon about its peers] action create (ceph::mon line 69)
adding peer 192.168.4.118:6789/0 to list: 192.168.4.117:6789/0,192.168.4.118:6789/0

adding peer 192.168.4.117:6789/0 to list: 192.168.4.117:6789/0,192.168.4.118:6789/0

[2013-04-11T15:10:44+00:00] INFO: ruby_block[tell ceph-mon about its peers] called
[2013-04-11T15:10:44+00:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 84)
2013-04-11 15:10:44.432266 7f8f9f8c0700  0 
-- :/25965 >> 192.168.4.117:6789/0 pipe(0x16d9d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

2013-04-11 15:10:50.433053 7f8f9f7bf700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94001d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 15:10:56.433268 7f8fa5e65700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94001d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 15:11:02.433987 7f8f9f8c0700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94002db0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 15:11:08.434358 7f8f9f7bf700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94004fb0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

At this point it’s stalled presumably waiting to talk to the other mon, so in another terminal window had to kick off a chef-client run on ceph-1 to get it into the same state as ceph-0 (knife ssh name:ceph-1.example.com -x root chef-client). This allowed both nodes to progress to the next problem:

2013-04-11 15:11:28.563438 7f8fa5e67780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-11 15:11:28.563443 7f8fa5e67780 -1 unable to authenticate as client.admin
2013-04-11 15:11:28.563814 7f8fa5e67780 -1 ceph_tool_common_init failed.
2013-04-11 15:11:29.572208 7f2369130780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-11 15:11:29.572210 7f2369130780 -1 unable to authenticate as client.admin
2013-04-11 15:11:29.572527 7f2369130780 -1 ceph_tool_common_init failed.
2013-04-11 15:11:31.380073 7f1907d18780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-11 15:11:31.380078 7f1907d18780 -1 unable to authenticate as client.admin
2013-04-11 15:11:31.380720 7f1907d18780 -1 ceph_tool_common_init failed.
2013-04-11 15:11:32.392345 7fc2bc462780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
[...]

And we’re spinning again.

But that’s enough for one day.