Hack Week 22: An Art Project

Back in 2012, I received a box of eight hundred openSUSE 12.1 promo DVDs, which I then set out to distribute to local Linux users’ groups, tech conferences, other SUSE crew in Australia, and so forth. I didn’t manage to shift all 800 DVDs at the time, and I recently rediscovered the remaining three hundred and eighty four while installing some new shelves. As openSUSE 12.1 went end of life in May 2013, it seemed likely the DVDs were now useless, but I couldn’t bring myself to toss them in landfill. Instead, given last week was Hack Week, I decided to use them for an art project. Here’s the end result:

Geeko mosaic made of cut up openSUSE DVDs, on a 900mm x 600mm piece of plywood

Making that mosaic was extremely fiddly. It’s possibly the most annoying Hack Week project I’ve ever done, but I’m very happy with the outcome 🙂

Continue reading

Hack Week 21: Keeping the Battery Full

As described in some detail in my last post, we have a single 10kWh Redflow ZCell zinc bromine flow battery hooked up to our solar PV via Victron inverter/chargers. This gives us the ability to:

  • Store almost all the excess energy we generate locally for later use.
  • When the sun isn’t shining, grid charge the battery at off-peak times then draw it down at peak times to save on our electricity bill (peak grid power is slightly more than twice as expensive as off-peak grid power).
  • Opportunistically survive grid outages, provided they don’t happen at the wrong time (i.e. when the sun is down and the battery is at 0% state of charge).

By their nature, ZCell flow batteries needs to undergo a maintenance cycle at least every three days, where they are discharged completely for a few hours. That’s why the last point above reads “opportunistically survive grid outages”. With a single ZCell, we can’t use the “minimum state of charge” feature of the Victron kit to always keep some charge in the battery in case of outages, because doing so conflicts with the ZCell maintenance cycles. Once we eventually get a second battery, this problem will go away because the maintenance cycles automatically interleave. In the meantime though, as my project for Hack Week 21, I decided to see if I could somehow automate the Victron scheduled charge configuration based on the ZCell maintenance cycle timing, to always keep the battery as full as possible for as long as possible.

Continue reading

Hackweek0x10: Fun in the Sun

We recently had a 5.94KW solar PV system installed – twenty-two 270W panels (14 on the northish side of the house, 8 on the eastish side), with an ABB PVI-6000TL-OUTD inverter. Naturally I want to be able to monitor the system, but this model inverter doesn’t have an inbuilt web server (which, given the state of IoT devices, I’m actually kind of happy about); rather, it has an RS-485 serial interface. ABB sell addon data logger cards for several hundred dollars, but Rick from Affordable Solar Tasmania mentioned he had another client who was doing monitoring with a little Linux box and an RS-485 to USB adapter. As I had a Raspberry Pi 3 handy, I decided to do the same.

Continue reading

‘Sup With The Tablet?

As I mentioned on Twitter last week, I’m very happy SUSE was able to support linux.conf.au 2015 with a keynote giveaway on Wednesday morning and sponsorship of the post-conference Beer O’Clock at Catalyst:

For those who were in attendance, I thought a little explanation of the keynote gift (a Samsung Galaxy Tab 4 8″) might be in order, especially given the winner came up to me during the post-conference drinks and asked “what’s up with the tablet?”

To put this in perspective, I’m in engineering at SUSE (I’ve spent a lot of time working on high availabilitydistributed storage and cloud software), and while it’s fair to say I represent the company in some sense simply by existing, I do not (and cannot) actually speak on behalf of my employer. Nevertheless, it fell to me to purchase a gift for us to provide to one lucky delegate sensible enough to arrive on time for Wednesday’s keynote.

I like to think we have a distinct engineering culture at SUSE. In particular, we run a hackweek once or twice a year where everyone has a full week to work on something entirely of their own choosing, provided it’s related to Free and Open Source Software. In that spirit (and given that we don’t make hardware ourselves) I thought it would be nice to be able to donate an Android tablet which the winner would either be able to hack on directly, or would be able to use in the course of hacking something else. So I’m not aware of any particular relationship between my employer and that tablet, but as it says on the back of the hackweek t-shirt I was wearing at the time:

Some things have to be done just because they are possible.
Not because they make sense.

 

Watching Grass Grow

For Hackweek 11 I thought it’d be fun to learn something about creating Android apps. The basic training is pretty straightforward, and the auto-completion (and auto-just-about-everything-else) in Android Studio is excellent. So having created a “hello world” app, and having learned something about activities and application lifecycle, I figured it was time to create something else. Something fun, but something I could reasonably complete in a few days. Given that Android devices are essentially just high res handheld screens with a bit of phone hardware tacked on, it seemed a crime not to write an app that draws something pretty. Continue reading

A Cosmic Dance in a Little Box

It’s Hack Week again. This time around I decided to look at running TripleO on openSUSE. If you’re not familiar with TripleO, it’s short for OpenStack on OpenStack, i.e. it’s a project to deploy OpenStack clouds on bare metal, using the components of OpenStack itself to do the work. I take some delight in bootstrapping of this nature – I think there’s a nice symmetry to it. Or, possibly, I’m just perverse.

Anyway, onwards. I had a chat to Robert Collins about TripleO while at PyCon AU 2013. He introduced me to diskimage-builder and suggested that making it capable of building openSUSE images would be a good first step. It turned out that making diskimage-builder actually run on openSUSE was probably a better first step, but I managed to get most of that out of the way in a random fit of hackery a couple of months ago. Further testing this week uncovered a few more minor kinks, two of which I’ve fixed here and here. It’s always the cross-distro work that seems to bring out the edge cases.

Then I figured there’s not much point making diskimage-builder create openSUSE images without knowing I can set up some sort of environment to validate them. So I’ve spent large parts of the last couple of days working my way through the TripleO Dev/Test instructions, deploying the default Ubuntu images with my openSUSE 12.3 desktop as VM host. For those following along at home the install-dependencies script doesn’t work on openSUSE (some manual intervention required, which I’ll try to either fix, document, or both, later). Anyway, at some point last night, I had what appeared to be a working seed VM, and a broken undercloud VM which was choking during cloud-init:

Calling http://169.254.169.254/2009-04-04/meta-data/instance-id' failed
Request timed out

Figuring that out, well…  There I was with a seed VM deployed from an image built with some scripts from several git repositories, automatically configured to run even more pieces of OpenStack than I’ve spoken about before, which in turn had attempted to deploy a second VM, which wanted to connect back to the first over a virtual bridge and via the magic of some iptables rules and I was running tcpdump and tailing logs and all the moving parts were just suddenly this GIANT COSMIC DANCE in a tiny little box on my desk on a hill on an island at the bottom of the world.

It was at this point I realised I had probably been sitting at my computer for too long.

It turns out the problem above was due to my_ip being set to an empty string in /etc/nova/nova.conf on the seed VM. Somehow I didn’t have the fix in my local source repo. An additional problem is that libvirt on openSUSE, like Fedora, doesn’t set uri_default="qemu:///system". This causes nova baremetal calls from the seed VM to the host to fail as mentioned in bug #1226310. This bug is apparently fixed, but apparently the fix doesn’t work for me (another thing to investigate), so I went with the workaround of putting uri_default="qemu:///system" in ~/.config/libvirt/libvirt.conf.

So now (after a rather spectacular amount of disk and CPU thrashing) there are three OpenStack clouds running on my desktop PC. No smoke has come out.

  • The seed VM has successfully spun up the “baremetal_0” undercloud VM and deployed OpenStack to it.
  • The undercloud VM has successfully spun up the “baremetal_1” and “baremetal_2” VMs and deployed them as the overcloud control and compute nodes.
  • I have apparently booted a demo VM in the overcloud, i.e. I’ve got a VM running inside a VM, although I haven’t quite managed to ssh into the latter yet (I suspect I’m missing a route or a firewall rule somewhere).

I think I had it right last night. There is a giant cosmic dance being performed in a tiny little box on my desk on a hill on an island at the bottom of the world.

Or, I’ve been sitting at my computer for too long again.

One More chef-client Run

Carrying on from my last post, the failed chef-client run came down to the init script in ceph 0.56 not yet knowing how to iterate /var/lib/ceph/{mon,osd,mds} and automatically start the appropriate daemons. This functionality seems to have been introduced in 0.58 or so by commit c8f528a. So I gave it another shot with a build of ceph 0.60.

On each of my ceph nodes, a bit of upgrading and cleanup. Note the choice of ceph 0.60 was mostly arbitrary, I just wanted the latest thing I could find an RPM for in a hurry. Also some of the rm invocations won’t be necessary, depending on what state things are actually in:

# zypper ar -f http://download.opensuse.org/repositories/home:/dalgaaf:/ceph:/extra/openSUSE_12.3/home:dalgaaf:ceph:extra.repo
# zypper ar -f http://gitbuilder.ceph.com/ceph-rpm-opensuse12-x86_64-basic/ref/next/x86_64/ ceph.com-next_openSUSE_12_x86_64
# zypper in ceph-0.60
# kill $(pidof ceph-mon)
# rm /etc/ceph/*
# rm /var/run/ceph/*
# rm -r /var/lib/ceph/*/*

That last gets rid of any half-created mon directories.

I also edited the Ceph environment to only have one mon (one of my colleagues rightly pointed out that you need an odd number of mons, and I had declared two previously, for no good reason). That’s knife environment edit Ceph on my desktop, and set "mon_initial_members": "ceph-0" instead of "ceph-0,ceph-1".

I also had to edit each of the nodes, to add an osd_devices array to each node, and remove the mon role from ceph-1. That’s knife node edit ceph-0.example.com then insert:

  "normal": {
    ...
    "ceph": {
      "osd_devices": [  ]
    }
  ...

Without the osd_devices array defined, the osd recipe fails (“undefined method `each_with_index’ for nil:NilClass”). I was kind of hoping an empty osd_devices array would allow ceph to use the root partition. No such luck, the cookbook really does expect you to be doing a sensible deployment with actual separate devices for your OSDs. Oh, well. I’ll try that another time. For now at least I’ve demonstrated that ceph-0.60 does give you what appears to be a clean mon setup when using the upstream cookbooks on openSUSE 12.3:

knife ssh name:ceph-0.example.com -x root chef-client
[2013-04-15T06:32:13+00:00] INFO: *** Chef 10.24.0 ***
[2013-04-15T06:32:13+00:00] INFO: Run List is [role[ceph-mon], role[ceph-osd], role[ceph-mds]]
[2013-04-15T06:32:13+00:00] INFO: Run List expands to [ceph::mon, ceph::osd, ceph::mds]
[2013-04-15T06:32:13+00:00] INFO: HTTP Request Returned 404 Not Found: No routes match the request: /reports/nodes/ceph-0.example.com/runs
[2013-04-15T06:32:13+00:00] INFO: Starting Chef Run for ceph-0.example.com
[2013-04-15T06:32:13+00:00] INFO: Running start handlers
[2013-04-15T06:32:13+00:00] INFO: Start handlers complete.
[2013-04-15T06:32:13+00:00] INFO: Loading cookbooks [apache2, apt, ceph]
[2013-04-15T06:32:13+00:00] INFO: Processing template[/etc/ceph/ceph.conf] action create (ceph::conf line 6)
[2013-04-15T06:32:13+00:00] INFO: template[/etc/ceph/ceph.conf] updated content
[2013-04-15T06:32:13+00:00] INFO: template[/etc/ceph/ceph.conf] mode changed to 644
[2013-04-15T06:32:13+00:00] INFO: Processing service[ceph_mon] action nothing (ceph::mon line 23)
[2013-04-15T06:32:13+00:00] INFO: Processing execute[ceph-mon mkfs] action run (ceph::mon line 40)
creating /var/lib/ceph/tmp/ceph-ceph-0.mon.keyring
added entity mon. auth auth(auid = 18446744073709551615 key=AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g== with 0 caps)
ceph-mon: mon.noname-a 192.168.4.118:6789/0 is local, renaming to mon.ceph-0
ceph-mon: set fsid to f80aba97-26c5-4aa3-971e-09c5a3afa32f
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph-0 for mon.ceph-0
[2013-04-15T06:32:14+00:00] INFO: execute[ceph-mon mkfs] ran successfully
[2013-04-15T06:32:14+00:00] INFO: execute[ceph-mon mkfs] sending start action to service[ceph_mon] (immediate)
[2013-04-15T06:32:14+00:00] INFO: Processing service[ceph_mon] action start (ceph::mon line 23)
[2013-04-15T06:32:15+00:00] INFO: service[ceph_mon] started
[2013-04-15T06:32:15+00:00] INFO: Processing ruby_block[tell ceph-mon about its peers] action create (ceph::mon line 64)
mon already active; ignoring bootstrap hint

[2013-04-15T06:32:16+00:00] INFO: ruby_block[tell ceph-mon about its peers] called
[2013-04-15T06:32:16+00:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 79)
2013-04-15 06:32:16.872040 7fca8e297780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-15 06:32:16.872042 7fca8e297780 -1 unable to authenticate as client.admin
2013-04-15 06:32:16.872400 7fca8e297780 -1 ceph_tool_common_init failed.
[2013-04-15T06:32:18+00:00] INFO: ruby_block[get osd-bootstrap keyring] called
[2013-04-15T06:32:18+00:00] INFO: Processing package[gdisk] action upgrade (ceph::osd line 37)
[2013-04-15T06:32:27+00:00] INFO: package[gdisk] upgraded from uninstalled to 
[2013-04-15T06:32:27+00:00] INFO: Processing service[ceph_osd] action nothing (ceph::osd line 48)
[2013-04-15T06:32:27+00:00] INFO: Processing directory[/var/lib/ceph/bootstrap-osd] action create (ceph::osd line 67)
[2013-04-15T06:32:27+00:00] INFO: Processing file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] action create (ceph::osd line 76)
[2013-04-15T06:32:27+00:00] INFO: entered create
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] owner changed to 0m
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] group changed to 0
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] mode changed to 440
[2013-04-15T06:32:27+00:00] INFO: file[/var/lib/ceph/bootstrap-osd/ceph.keyring.raw] created file /var/lib/ceph/bootstrap-osd/ceph.keyring.raw
[2013-04-15T06:32:27+00:00] INFO: Processing execute[format as keyring] action run (ceph::osd line 83)
creating /var/lib/ceph/bootstrap-osd/ceph.keyring
added entity client.bootstrap-osd auth auth(auid = 18446744073709551615 key=AQAOl2tR0M4bMRAAatSlUh2KP9hGBBAP6u5AUA== with 0 caps)
[2013-04-15T06:32:27+00:00] INFO: execute[format as keyring] ran successfully
[2013-04-15T06:32:28+00:00] INFO: Chef Run complete in 14.479108446 seconds
[2013-04-15T06:32:28+00:00] INFO: Running report handlers
[2013-04-15T06:32:28+00:00] INFO: Report handlers complete

Witness:

ceph-0:~ # rcceph status
=== mon.ceph-0 === 
mon.ceph-0: running {"version":"0.60-468-g98de67d"}

On the note of building an easy-to-deploy Ceph appliance, assuming you’re not using Chef and just want something to play with, I reckon the way to go is use config pretty similar to what would be deployed by this Chef cookbook, i.e. an absolute minimal /etc/ceph/ceph.conf, specifying nothing other than initial mons, then use the various Ceph CLI tools to create mons and osds on each node and just rely on the init script in Ceph >= 0.58 to do the right thing with what it finds (having to explicitly specify each mon, osd and mds in the Ceph config by name always bugged me). Bonus points for using csync2 to propagate /etc/ceph/ceph.conf across the cluster.

The Ceph Chef Experiment

Sometimes it’s most interesting to just dive in and see what breaks. There’s a Chef cookbook for Ceph on github which seems rather more recently developed than the one in SUSE-Cloud/barclamp-ceph, and seeing as its use is documented in the Ceph manual, I reckon that’s the one I want to be using. Of course, the README says “Tested as working: Ubuntu Precise (12.04)”, and I’m using openSUSE 12.3…

First things first, need a Chef server, so I installed openSUSE 12.3 on a VM, then installed Chef 10 on that, roughly following the manual installation instructions. Note for those following along at home – sometimes the blocks I’ve copied here are just commands, sometimes they include command output as well. You’ll figure it out 🙂

# zypper ar -f http://download.opensuse.org/repositories/systemsmanagement:/chef:/10/openSUSE_12.3/systemsmanagement:chef:10.repo
# zypper in rubygem-chef-server
# chkconfig couchdb on
# rccouchdb start
# chkconfig rabbitmq-server on
# rcrabbitmq-server start
# rabbitmqctl add_vhost /chef
# rabbitmqctl add_user chef testing
# rabbitmqctl set_permissions -p /chef chef ".*" ".*" ".*"
# for service in solr expander server server-webui; do
      chkconfig chef-$service on
      rcchef-$service start
  done

I didn’t bother editing /etc/chef/server.rb, the config as shipped works fine (not that the AMQP password is very secure, mind). The only catch is the web UI didn’t start. IIRC this is due to /etc/chef/webui.pem not existing yet (chef-server creates it, but this doesn’t finish until later).

Then configured knife:

# knife configure -i
WARNING: No knife configuration file found
Where should I put the config file? [/root/.chef/knife.rb]
Please enter the chef server URL: [http://os-chef.example.com:4000]
Please enter a clientname for the new client: [root]
Please enter the existing admin clientname: [chef-webui]
Please enter the location of the existing admin client's private key: [/etc/chef/webui.pem]
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef/validation.pem]
Please enter the path to a chef repository (or leave blank):
Creating initial API user...
Created client[root]
Configuration file written to /root/.chef/knife.rb

And make a client for me:

# knife client create tserong -d -a -f /tmp/tserong.pem
Created client[tserong]

Then set up my desktop as a Chef workstation (roughly following these docs, and again pulling Chef from systemsmanagement:chef:10 on OBS):

# sudo zypper in rubygem-chef
# cd ~
# git clone git://github.com/opscode/chef-repo.git
# cd chef-repo
# mkdir -p ~/.chef
# scp root@os-chef:/etc/chef/validation.pem ~/.chef/
# scp root@os-chef:/tmp/tserong.pem ~/.chef/
# knife configure
WARNING: No knife configuration file found
Where should I put the config file? [/home/tserong/.chef/knife.rb]
Please enter the chef server URL: [http://desktop.example.com:4000] http://os-chef.example.com:4000
Please enter an existing username or clientname for the API: [tserong]
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef/validation.pem] /home/tserong/.chef/validation.pem
Please enter the path to a chef repository (or leave blank): /home/tserong/chef-repo
[...]
Configuration file written to /home/tserong/.chef/knife.rb

Make sure it works:

# knife client list
chef-validator
chef-webui
root
tserong

Grab the cookbooks and upload them to the Chef server. The Ceph cookbook claims to depend on apache and apt, although presumably the former is only necessary for RADOSGW, and the latter for Debian-based systems. Anyway:

# cd ~/chef-repo
# git submodule add git@github.com:opscode-cookbooks/apache2.git cookbooks/apache2
# git submodule add git@github.com:opscode-cookbooks/apt.git cookbooks/apt
# git submodule add git@github.com:ceph/ceph-cookbooks.git cookbooks/ceph
# knife cookbook upload apache2
# knife cookbook upload apt
# knife cookbook upload ceph

Boot up a couple more VMs to be Ceph nodes, using the appliance image from last time. These need chef-client installed, and need to be registered with the chef server. knife bootstrap will install chef-client and dependencies for you, but after looking at the source, if /usr/bin/chef doesn’t exist, it actually uses wget or curl to pull http://opscode.com/chef/install.sh and runs that. How this is considered a good idea is completely baffling to me, so again I installed our chef build from OBS on each of my Ceph nodes (note to self: should add this to appliance image on Studio):

# zypper ar -f http://download.opensuse.org/repositories/systemsmanagement:/chef:/10/openSUSE_12.3/systemsmanagement:chef:10.repo
# zypper in rubygem-chef

And ran the now-arguably-safe knife bootstrap from my desktop:

# knife bootstrap ceph-0.example.com
Bootstrapping Chef on ceph-0.example.com
[...]
# knife bootstrap ceph-1.example.com
Bootstrapping Chef on ceph-1.example.com
[...]

Then, roughly following the Ceph Deploying with Chef document.

Generate a UUID and monitor secret (had to do the latter on one of my Ceph VMs, as ceph-authtool is conveniently already installed):

# uuidgen -r
f80aba97-26c5-4aa3-971e-09c5a3afa32f
# ceph-authtool /dev/stdout --name=mon. --gen-key
[mon.]
key = AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g==

Then on my desktop:

knife environment create Ceph

This I filled in with:

{
  "name": "Ceph",
  "description": "",
  "cookbook_versions": {
  },
  "json_class": "Chef::Environment",
  "chef_type": "environment",
  "default_attributes": {
    "ceph": {
      "monitor-secret": "AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g==",
      "config": {
        "fsid": "f80aba97-26c5-4aa3-971e-09c5a3afa32f",
        "mon_initial_members": "ceph-0,ceph-1",
        "global": {
        },
        "osd": {
          "osd journal size": "1000",
          "filestore xattr use omap": "true"
        }
      }
    }
  },
  "override_attributes": {
  }
}

Uploaded roles:

# knife role from file cookbooks/ceph/roles/ceph-mds.rb
# knife role from file cookbooks/ceph/roles/ceph-mon.rb
# knife role from file cookbooks/ceph/roles/ceph-osd.rb
# knife role from file cookbooks/ceph/roles/ceph-radosgw.rb

Assigned roles to nodes:

# knife node run_list add ceph-0.example.com 'role[ceph-mon],role[ceph-osd],role[ceph-mds]'
# knife node run_list add ceph-1.example.com 'role[ceph-mon],role[ceph-osd],role[ceph-mds]'

I didn’t bother with recipe[ceph::repo] as I don’t care about installation right now (Ceph is already installed in my VM images).

Had to set "chef_environment": "Ceph" for each node by running:

# knife node edit ceph-0.example.com
# knife node edit ceph-1.example.com

Didn’t set Ceph osd_devices per node – I’m just playing, so can sit on top of the root partition.

Now let’s see if it works:

# knife ssh name:ceph-0.example.com -x root chef-client
[2013-04-11T13:44:47+00:00] INFO: *** Chef 10.24.0 ***
[2013-04-11T13:44:48+00:00] INFO: Run List is [role[ceph-mon], role[ceph-osd], role[ceph-mds]]
[2013-04-11T13:44:48+00:00] INFO: Run List expands to [ceph::mon, ceph::osd, ceph::mds]
[2013-04-11T13:44:48+00:00] INFO: HTTP Request Returned 404 Not Found: No routes match the request: /reports/nodes/ceph-0.example.com/runs
[2013-04-11T13:44:48+00:00] INFO: Starting Chef Run for ceph-0.example.com
[2013-04-11T13:44:48+00:00] INFO: Running start handlers
[2013-04-11T13:44:48+00:00] INFO: Start handlers complete.
[2013-04-11T13:44:48+00:00] INFO: Loading cookbooks [apache2, apt, ceph]
No ceph-mon found.

[2013-04-11T13:44:48+00:00] INFO: Processing template[/etc/ceph/ceph.conf] action create (ceph::conf line 6)
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] backed up to /var/chef/backup/etc/ceph/ceph.conf.chef-20130411134448
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] updated content
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] owner changed to 0
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] group changed to 0
[2013-04-11T13:44:48+00:00] INFO: template[/etc/ceph/ceph.conf] mode changed to 644
[2013-04-11T13:44:48+00:00] INFO: Processing service[ceph_mon] action nothing (ceph::mon line 23)
[2013-04-11T13:44:48+00:00] INFO: Processing execute[ceph-mon mkfs] action run (ceph::mon line 40)
creating /var/lib/ceph/tmp/ceph-ceph-0.mon.keyring
added entity mon. auth auth(auid = 18446744073709551615 key=AQC8umZRaDlKKBAAqD8li3u2JObepmzFzDPM3g== with 0 caps)
ceph-mon: mon.noname-a 192.168.4.118:6789/0 is local, renaming to mon.ceph-0
ceph-mon: set fsid to f80aba97-26c5-4aa3-971e-09c5a3afa32f
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph-0 for mon.ceph-0
[2013-04-11T13:44:49+00:00] INFO: execute[ceph-mon mkfs] ran successfully
[2013-04-11T13:44:49+00:00] INFO: execute[ceph-mon mkfs] sending start action to service[ceph_mon] (immediate)
[2013-04-11T13:44:49+00:00] INFO: Processing service[ceph_mon] action start (ceph::mon line 23)
[2013-04-11T13:44:49+00:00] INFO: service[ceph_mon] started
[2013-04-11T13:44:49+00:00] INFO: Processing ruby_block[tell ceph-mon about its peers] action create (ceph::mon line 64)
connect to
/var/run/ceph/ceph-mon.ceph-0.asok
failed with
(2) No such file or directory

connect to
/var/run/ceph/ceph-mon.ceph-0.asok
failed with
(2) No such file or directory

[2013-04-11T13:44:49+00:00] INFO: ruby_block[tell ceph-mon about its peers] called
[2013-04-11T13:44:49+00:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 79)
2013-04-11 13:44:49.928800 7f58e9677700 0
-- :/23863 >> 192.168.4.117:6789/0 pipe(0x18f0d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

2013-04-11 13:44:52.928739 7f58efc1c700 0 -- :/23863 >> 192.168.4.118:6789/0 pipe(0x7f58e0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 13:44:55.929375 7f58e9677700 0 -- :/23863 >> 192.168.4.117:6789/0 pipe(0x7f58e0003010 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 13:44:58.929211 7f58efc1c700 0 -- :/23863 >> 192.168.4.118:6789/0 pipe(0x7f58e00039f0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 13:45:01.929787 7f58e9677700 0 -- :/23863 >> 192.168.4.117:6789/0 pipe(0x7f58e00023b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
[...]

And it’s stuck there, trying and failing to talk to something.

See those “no such file or directory” errors after “service[ceph_mon] started”? Yeah? Well, the mon isn’t started, hence the missing sockets in /var/run/ceph.

Why isn’t the mon started? Turns out the ceph init script won’t start any mon (or osd or mds for that matter) if you don’t have entries in the config file with some suffix, e.g. [mon.a]. And all I’ve got is:

[global]
  fsid =  f80aba97-26c5-4aa3-971e-09c5a3afa32f
  mon initial members = ceph-0,ceph-1
  mon host = 192.168.4.118:6789, 192.168.4.117:6789

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

But given the mon recipe triggers ceph-mon-all-starter if using upstart (which it would be, on the “Tested as working: Ubuntu Precise”), and ceph-mon-all-starter seems to just ultimately run something like ceph-mon --cluster=ceph -i ceph-0 regardless of what’s in the config file… Maybe I can cheat.

Directly starting ceph-mon from a shell on ceph-0 before the chef-client run turned out to be a bad idea (bit of a chicken and egg problem figuring out what to inject into the “mon host” line of the config file). So I put a bit of evil into the mon recipe:

diff --git a/recipes/mon.rb b/recipes/mon.rb
index 5cd76de..a518830 100644
--- a/recipes/mon.rb
+++ b/recipes/mon.rb
@@ -61,6 +61,10 @@ EOH
   notifies :start, "service[ceph_mon]", :immediately
 end
 
+execute 'hack to force mon start' do
+  command "ceph-mon --cluster=ceph -i #{node['hostname']}"
+end
+
 ruby_block "tell ceph-mon about its peers" do
   block do
     mon_addresses = get_mon_addresses()

Try again:

# knife ssh name:ceph-0.example.com -x root chef-client
[2013-04-11T15:10:43+00:00] INFO: *** Chef 10.24.0 ***
[2013-04-11T15:10:44+00:00] INFO: Run List is [role[ceph-mon], role[ceph-osd], role[ceph-mds]]
[2013-04-11T15:10:44+00:00] INFO: Run List expands to [ceph::mon, ceph::osd, ceph::mds]
[2013-04-11T15:10:44+00:00] INFO: HTTP Request Returned 404 Not Found: No routes match the request: /reports/nodes/ceph-0.example.com/runs
[2013-04-11T15:10:44+00:00] INFO: Starting Chef Run for ceph-0.example.com
[2013-04-11T15:10:44+00:00] INFO: Running start handlers
[2013-04-11T15:10:44+00:00] INFO: Start handlers complete.
[2013-04-11T15:10:44+00:00] INFO: Loading cookbooks [apache2, apt, ceph]
[2013-04-11T15:10:44+00:00] INFO: Storing updated cookbooks/ceph/recipes/mon.rb in the cache.
No ceph-mon found.

[2013-04-11T15:10:44+00:00] INFO: Processing template[/etc/ceph/ceph.conf] action create (ceph::conf line 6)
[2013-04-11T15:10:44+00:00] INFO: Processing service[ceph_mon] action nothing (ceph::mon line 23)
[2013-04-11T15:10:44+00:00] INFO: Processing execute[ceph-mon mkfs] action run (ceph::mon line 40)
[2013-04-11T15:10:44+00:00] INFO: Processing execute[hack to force mon start] action run (ceph::mon line 65)
starting mon.ceph-0 rank 1 at 192.168.4.118:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-0 fsid f80aba97-26c5-4aa3-971e-09c5a3afa32f
[2013-04-11T15:10:44+00:00] INFO: execute[hack to force mon start] ran successfully
[2013-04-11T15:10:44+00:00] INFO: Processing ruby_block[tell ceph-mon about its peers] action create (ceph::mon line 69)
adding peer 192.168.4.118:6789/0 to list: 192.168.4.117:6789/0,192.168.4.118:6789/0

adding peer 192.168.4.117:6789/0 to list: 192.168.4.117:6789/0,192.168.4.118:6789/0

[2013-04-11T15:10:44+00:00] INFO: ruby_block[tell ceph-mon about its peers] called
[2013-04-11T15:10:44+00:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 84)
2013-04-11 15:10:44.432266 7f8f9f8c0700  0 
-- :/25965 >> 192.168.4.117:6789/0 pipe(0x16d9d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

2013-04-11 15:10:50.433053 7f8f9f7bf700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94001d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 15:10:56.433268 7f8fa5e65700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94001d30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 15:11:02.433987 7f8f9f8c0700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94002db0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-11 15:11:08.434358 7f8f9f7bf700  0 -- 192.168.4.118:0/25965 >> 192.168.4.117:6789/0 pipe(0x7f8f94004fb0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

At this point it’s stalled presumably waiting to talk to the other mon, so in another terminal window had to kick off a chef-client run on ceph-1 to get it into the same state as ceph-0 (knife ssh name:ceph-1.example.com -x root chef-client). This allowed both nodes to progress to the next problem:

2013-04-11 15:11:28.563438 7f8fa5e67780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-11 15:11:28.563443 7f8fa5e67780 -1 unable to authenticate as client.admin
2013-04-11 15:11:28.563814 7f8fa5e67780 -1 ceph_tool_common_init failed.
2013-04-11 15:11:29.572208 7f2369130780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-11 15:11:29.572210 7f2369130780 -1 unable to authenticate as client.admin
2013-04-11 15:11:29.572527 7f2369130780 -1 ceph_tool_common_init failed.
2013-04-11 15:11:31.380073 7f1907d18780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2013-04-11 15:11:31.380078 7f1907d18780 -1 unable to authenticate as client.admin
2013-04-11 15:11:31.380720 7f1907d18780 -1 ceph_tool_common_init failed.
2013-04-11 15:11:32.392345 7fc2bc462780 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
[...]

And we’re spinning again.

But that’s enough for one day.

Hackweek 9: Ceph Appliance Odyssey

This week is SUSE Hack Week 9. I wanted to spend some time working on a Ceph appliance image to make it easy to play with Ceph on openSUSE and/or SLES.

I tried making a SLES 11 SP2 appliance with SUSE Studio. I had to add the filesystems and devel:libraries:c_c++ repos from OBS to get reasonably up-to-date Ceph 0.56 and libboost_thread.so.1.49.0, but on boot when the appliance tried to expand its root filesystem, it died claiming it couldn’t load libe2p.so.2. Studio claims to be pulling in e2fsprogs from both the SP2 Updates and filesystems repo, so maybe that’s the problem. It seems impossible to choose one or the other, as they are the same version. (Update: it was just pointed out to me that you can click the little box next to the version number to choose which one is installed – must try again.)

So I left that alone and tried an openSUSE 12.3 appliance. The filesystems/ceph build for 12.3 is disabled, so I branched it and kicked off a build which failed with an exciting OOM error:

[ 3831s] [ 3803.167109] Out of memory: Kill process 16364 (cc1plus) score 254 or sacrifice child
[ 3831s] [ 3803.167959] Killed process 16364 (cc1plus) total-vm:825128kB, anon-rss:168760kB, file-rss:4kB
[ 3831s] g++: internal compiler error: Killed (program cc1plus)
[ 3831s] Please submit a full bug report,
[ 3831s] with preprocessed source if appropriate.
[ 3831s] See  for instructions.

Guess I should do what it says and file a bug. But I really did want something to play with immediately, so I added http://ceph.com/rpm/opensuse12/x86_64/ as a repo, and pulled in the upstream Ceph 0.56 RPMs. This seems to have worked and given me an openSUSE 12.3 image I can use to run through the Ceph 5-Minute Quick Start, Block Device Quick Start and CephFS Quick Start. So, here’s my extremely terse openSUSEified version of those quick start documents:

5-Minute Quick Start

Deploy the Appliance Image

I’m doing this with a couple of VMs, so in my case I make a couple of copies of the image:

# cp ~/openSUSE_12.3_Ceph_0.56.x86_64-0.0.3.qcow2 \
    /var/lib/libvirt/images/ceph-quickstart-server.qcow2
# cp ~/openSUSE_12.3_Ceph_0.56.x86_64-0.0.3.qcow2 \
    /var/lib/libvirt/images/ceph-quickstart-client.qcow2

Then I use virt-manager to create two VMs, backed by those images. Boot ’em up, log in (root password is “linux”), run yast network and set sensible hostnames (“ceph-client” and “ceph-server” instead of “linux-kjqd”, although admittedly those names wouldn’t be very sensible in a real deployment with more than one node).

Edit the Configuration File

The appliance image includes the /etc/ceph/ceph.conf file from the original 5-minute quick start, so log in to ceph-server, edit that file and replace {hostname} and {ip-address} with their real values, then copy the configuration file to ceph-client:

# scp /etc/ceph/ceph.conf ceph-client:/etc/ceph/

Deploy the Configuration

On ceph-server, create directories for each daemon:

# mkdir -p /var/lib/ceph/osd/ceph-0
# mkdir -p /var/lib/ceph/osd/ceph-1
# mkdir -p /var/lib/ceph/mon/ceph-a
# mkdir -p /var/lib/ceph/mds/ceph-a

Still on ceph-server, run the following:

# cd /etc/ceph
# mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring

Start Ceph

On ceph-server:

# chkconfig ceph on
# rcceph start
# ceph health

This will initially show something like:

HEALTH_ERR 576 pgs stuck inactive; 576 pgs stuck unclean; no osds

Eventually it will say HEALTH_OK and you’re good to go.

Copy the Keyring to the Client

This is necessary for authentication:

# scp /etc/ceph/ceph.keyring ceph-client:/etc/ceph/

Block Device Quick Start

On ceph-client:

# rbd create foo --size 4096
# modprobe rbd
# rbd map foo --pool rbd --name client.admin
# mkfs.ext4 -m0 /dev/rbd1
# mkdir /mnt/myrbd
# mount /dev/rbd1 /mnt/myrbd

(Why is this /dev/rbd1, not /dev/rbd/rbd/foo as in the original quick start?)

CephFS Quick Start

On ceph-client (kernel driver, not FUSE):

# mkdir /mnt/mycephfs
# mount -t ceph -o name=admin,secret=$(ceph-authtool \
    --name client.admin /etc/ceph/ceph.keyring --print-key) \
    ceph-server:/ /mnt/mycephfs

Interestingly, this gives “mount: error writing /etc/mtab: Invalid argument”, but still seems to actually mount the filesystem.

Also note that it appears I have 32GB of space for Ceph to use, even though ceph-server only has a 16GB root partition. I rather think that’s because there’s two OSDs, but both are just running off the root filesystem, they’re not separate disks/filesystems. I assume this is one of those Don’t Try This At Home things.

 

A Real openSUSE 12.2 Overo Image

Carrying on from my last post, with advice from Alexander Graf on #opensuse-arm, I was able to put together a reasonable looking pre-built openSUSE 12.2 image for the Gumstix Overo.  This is currently in home:tserong:branches:openSUSE:12.2:ARM on OBS, and will presumably remain there until next time I’m able to hack on this.

The two things that needed doing were:

  1. u-boot-omap3overo package, which is u-boot-omap4panda with an extra spec file and a small patch to add some load addresses and make the default environment boot off ext2 instead of FAT:
Index: u-boot-2012.04.01/include/configs/omap3_overo.h
===================================================================
--- u-boot-2012.04.01.orig/include/configs/omap3_overo.h
+++ u-boot-2012.04.01/include/configs/omap3_overo.h
@@ -148,6 +148,8 @@

 #define CONFIG_EXTRA_ENV_SETTINGS \
 	"loadaddr=0x82000000\0" \
+	"kerneladdr=0x80200000\0" \
+	"ramdiskaddr=0x81000000\0" \
 	"console=ttyO2,115200n8\0" \
 	"mpurate=500\0" \
 	"optargs=\0" \
@@ -175,10 +177,10 @@
 		"omapdss.def_disp=${defaultdisplay} " \
 		"root=${nandroot} " \
 		"rootfstype=${nandrootfstype}\0" \
-	"loadbootscript=fatload mmc ${mmcdev} ${loadaddr} boot.scr\0" \
+	"loadbootscript=ext2load mmc ${mmcdev} ${loadaddr} boot.scr\0" \
 	"bootscript=echo Running bootscript from mmc ...; " \
 		"source ${loadaddr}\0" \
-	"loaduimage=fatload mmc ${mmcdev} ${loadaddr} uImage\0" \
+	"loaduimage=ext2load mmc ${mmcdev} ${loadaddr} uImage\0" \
 	"mmcboot=echo Booting from mmc ...; " \
 		"run mmcargs; " \
 		"bootm ${loadaddr}\0" \
  1. A JeOS-overo image, which is just another KIWI file for the base ARM JeOS image with a few tweaks to make it use my u-boot-omap3 overo package and the right kernel.  To give an idea, the diff between JeOS-panda.kiwi and JeOS-overo.kiwi is:
--- JeOS-panda.kiwi	2012-09-24 22:47:28.448247822 +1000
+++ JeOS-overo.kiwi	2012-09-24 22:49:16.844588795 +1000
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<image schemaversion="5.3" name="openSUSE-12.2-ARM-panda">
+<image schemaversion="5.3" name="openSUSE-12.2-ARM-overo">
   <!--
  *****************************************************************************
  *****************************************************************************
@@ -13,11 +13,11 @@
     <author>Marcus Schäfer</author>
     <contact>ms@novell.com</contact>
     <specification>
-   openSUSE 12.2 image for ARM (panda) boards
+   openSUSE 12.2 image for ARM (overo) boards
   </specification>
   </description>
   <preferences>
-    <type image="oem" filesystem="ext3" boot="oemboot/suse-12.2" bootloader="uboot" bootkernel="omap4panda" kernelcmdline="console=ttyO2 vram=16M">
+    <type image="oem" filesystem="ext3" boot="oemboot/suse-12.2" bootloader="uboot" kernelcmdline="console=ttyO2 vram=12M">
       <oemconfig>
         <oem-swapsize>500</oem-swapsize>
       </oemconfig>
@@ -35,6 +35,9 @@
     <user pwd="$1$wYJUgpM5$RXMMeASDc035eX.NbYWFl0" home="/root" name="root"/>
   </users>
   <repository type="rpm-md">
+    <source path="obs://home:tserong:branches:openSUSE:12.2:ARM/standard"/>
+  </repository>
+  <repository type="rpm-md">
     <source path="obs://openSUSE:12.2:ARM/standard"/>
   </repository>
   <!-- dont remove qemu binfmt helpers from initrd -->
@@ -44,7 +47,7 @@
   </strip>
   <packages type="bootstrap">
     <package name="kernel-omap2plus" bootinclude="true"/>
-    <package name="u-boot-omap4panda"/>
+    <package name="u-boot-omap3overo"/>
     <package name="aaa_base"/>
     <package name="aaa_base-extras"/>
     <package name="branding-openSUSE"/>

I couldn’t specify bootkernel="omap3overo" because that profile doesn’t exist in kiwi. Leaving this out, combined with the reference to my repository and <package name="kernel-omap2plus" bootinclude="true"/> miraculously gave me the right kernel.

To actually get it running, first the image went onto a MicroSD card:

# xzcat openSUSE-12.2-ARM-overo.armv7l-1.12.1-Build1.17.3.raw.xz | dd bs=4M of=/dev/mmcblk0
# sync

Then the card went into the Overo, power was applied and it immediately failed to boot, because X-Loader couldn’t read the boot sector on the MicroSD.

Texas Instruments X-Loader 1.4.4ss (Oct 20 2010 - 10:10:28)
OMAP3530-GP ES3.1
Board revision: 0
Reading boot sector
Error: reading boot sector
Loading u-boot.bin from nand

U-Boot 2010.09 (Oct 20 2010 - 10:11:49)

OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz
Gumstix Overo board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
NAND:  256 MiB
In:    serial
Out:   serial
Err:   serial
Board revision: 0
Tranceiver detected on mmc2
timed out in wait_for_bb: I2C_STAT=1000
timed out in wait_for_bb: I2C_STAT=1000
timed out in wait_for_pin: I2C_STAT=1000
I2C read: I/O error
Unrecognized expansion board
Die ID #529e0004000000000403a1f303025013
Net:   smc911x-0
Hit any key to stop autoboot:  0 
Overo #

But, the old u-boot in my Overo’s NAND is sufficiently advanced that it can read from an ext2 partition, so I was able to do this:

Overo # mmc init
mmc1 is available
Overo # setenv kerneladdr 0x80200000
Overo # setenv ramdiskaddr 0x81000000
Overo # ext2load mmc1 0:1 ${loadaddr} boot.scr
Loading file "boot.scr" from mmc1 device 0:1 (xxa1)
541 bytes read
Overo # run bootscript
Running bootscript from mmc ...
## Executing script at 82000000
kerneladdr=0x80200000
ramdiskaddr=0x81000000
Loading file "boot/linux.vmx" from mmc device 0:1 (xxa1)
4063704 bytes read
Loading file "boot/initrd.uboot" from mmc device 0:1 (xxa1)
38289754 bytes read
## Booting kernel from Legacy Image at 80200000 ...
   Image Name:   Linux-3.4.6-2.10-omap2plus
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:    4063640 Bytes = 3.9 MiB
   Load Address: 80008000
   Entry Point:  80008000
   Verifying Checksum ... OK
## Loading init Ramdisk from Legacy Image at 81000000 ...
   Image Name:   Initrd
   Image Type:   ARM Linux RAMDisk Image (uncompressed)
   Data Size:    38289690 Bytes = 36.5 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
   Loading Kernel Image ... OK
OK

Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.4.6-2.10-omap2plus (abuild@build14) (gcc version 4.7.1 20120723 [gcc-4_7-branch revision 189773] (SUSE Linux) ) #2 SMP Sat Sep 8 06:38:16 UTC 2012
[    0.000000] CPU: ARMv7 Processor [411fc083] revision 3 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
[    0.000000] Machine: Gumstix Overo
...

And away we go! Interestingly while KIWI is doings its bootstrap thing, I got a bunch of I/O errors:

...
[    9.048309] mmc0: new high speed SDHC card at address 0001
[    9.061431] mmcblk0: mmc0:0001 00000 7.41 GiB 
[    9.075683]  mmcblk0: p1 p2
[    9.162902] mmc1: new SDIO card at address 0001
setterm: cannot (un)set powersave mode: Invalid argument
Loading KIWI OEM Boot-System...
-------------------------------
Creating device nodes with udev
[   11.118896] udevd[168]: starting version 182
udevd[172]: ctx=0x4ae4d8 path=/lib/modules/3.4.6-2.10-omap2plus/kernel/drivers/net/wireless/libertas/libertas.ko error=No such file or directory
^M
[   11.648712] twl4030_usb twl4030_usb: Initialized TWL4030 USB module
[    4.950012] Including oem partition info file
[    5.145965] Searching for boot device...
[   16.582397] end_request: I/O error, dev mtdblock0, sector 0
[   16.588317] Buffer I/O error on device mtdblock0, logical block 0
[   16.595031] uncorrectable error : 
[   16.598480] end_request: I/O error, dev mtdblock0, sector 8
[   16.604522] Buffer I/O error on device mtdblock0, logical block 1
[   16.611145] uncorrectable error : 
[   16.614562] end_request: I/O error, dev mtdblock0, sector 16
[   16.620727] Buffer I/O error on device mtdblock0, logical block 2
[   16.627349] end_request: I/O error, dev mtdblock0, sector 24
[   16.633300] Buffer I/O error on device mtdblock0, logical block 3
[   16.640106] end_request: I/O error, dev mtdblock0, sector 0
[   16.645996] Buffer I/O error on device mtdblock0, logical block 0
[   17.727172] kjournald starting.  Commit interval 5 seconds
[   17.733154] EXT3-fs (mmcblk0p1): mounted filesystem with ordered data mode
modprobe: FATAL: Could not read '/lib/modules/3.4.6-2.10-omap2plus/kernel/fs/fat/vfat.ko': No such file or directory
(...modprobe error repeated several times...)
[   20.200988] end_request: I/O error, dev mtdblock0, sector 0
[   20.206939] Buffer I/O error on device mtdblock0, logical block 0
[   20.213684] uncorrectable error : 
[   20.217102] end_request: I/O error, dev mtdblock0, sector 8
[   20.223175] Buffer I/O error on device mtdblock0, logical block 1
[   20.229827] uncorrectable error : 
[   20.233245] end_request: I/O error, dev mtdblock0, sector 16
[   20.239410] Buffer I/O error on device mtdblock0, logical block 2
[   20.246032] end_request: I/O error, dev mtdblock0, sector 24
[   20.251983] Buffer I/O error on device mtdblock0, logical block 3
[   20.258880] end_request: I/O error, dev mtdblock0, sector 0
[   20.264770] Buffer I/O error on device mtdblock0, logical block 0
[    9.848419] Found boot device: /dev/mmcblk0
[   12.598419] Repartition the disk according to real geometry [ parted ]
[   16.442352] Repartition the disk according to real geometry [ parted ]
[   30.380310]  mmcblk0: p1 p2 p3
[   20.821899] Activating swap space on /dev/mmcblk0p3
[   21.109192] Filesystem of OEM system is: ext3 -> /dev/mmcblk0p2
[   21.243927] Resize EXT3 filesystem to full partition space...
/dev/mmcblk0p2: clean, 17708/49056 files, 119867/195839 blocks
Resizing the filesystem on /dev/mmcblk0p2 to 1776688 (4k) blocks.
Begin pass 1 (max = 49)
Extending the inode table     XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/mmcblk0p2 is now 1776688 blocks long.

[   99.565460] kjournald starting.  Commit interval 5 seconds
[   99.578399] EXT3-fs (mmcblk0p2): using internal journal
[   99.584350] EXT3-fs (mmcblk0p2): mounted filesystem with ordered data mode
[  100.448394] kjournald starting.  Commit interval 5 seconds
[  100.458648] EXT3-fs (mmcblk0p1): using internal journal
[  100.464263] EXT3-fs (mmcblk0p1): mounted filesystem with ordered data mode
[  108.013458] kjournald starting.  Commit interval 5 seconds
[  108.025207] EXT3-fs (mmcblk0p1): using internal journal
[  108.030761] EXT3-fs (mmcblk0p1): mounted filesystem with ordered data mode
/dev/mmcblk0p1: LABEL="BOOT" UUID="6505f4d4-6000-4609-899f-9b6634916922" SEC_TYPE="ext2" TYPE="ext3"
[   98.791229] Creating boot loader configuration
[  101.126831] Activating Image: [/dev/mmcblk0p2]

Anyway, it did boot successfully after that and I was able to log in, poke around a bit, and reboot. During reboot I got a couple of kernel oopses, same as (or presumably the same as) with my previous manually constructed image:

[  353.409118] Restarting system.
[  353.412963] Internal error: Oops: 80000007 [#1] SMP ARM
[  353.418640] Modules linked in: af_packet autofs4 dm_mod omapdrm(C) drm_kms_helper snd_soc_twl4030 snd_soc_core drm regmap_spi libertas_sdio snd_pcm fb_sys_fops sysimgblt sysfillrect libertas syscopyarea snd_timer snd cfg80211 soundcore snd_page_alloc rfkill twl4030_wdt lib80211 twl4030_usb
[  353.446960] CPU: 0    Tainted: G         C    (3.4.6-2.10-omap2plus #2)
[  353.454162] PC is at 0x0
[  353.456970] LR is at smp_send_stop+0x50/0xe4
[  353.461639] pc : [<00000000>]    lr : [<c0019378>]    psr: 600f0013
[  353.461669] sp : cc82be60  ip : 00000000  fp : 00012fc0
[  353.474060] r10: c07dd930  r9 : cc82a000  r8 : 4321fedc
[  353.479766] r7 : 45584543  r6 : cc82be64  r5 : c07a8ee0  r4 : 000f4241
[  353.486846] r3 : 00000000  r2 : 00000000  r1 : 00000006  r0 : cc82be64
[  353.493927] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  353.501647] Control: 10c5387d  Table: 838b4019  DAC: 00000015
[  353.507904] Process systemd-shutdow (pid: 1, stack limit = 0xcc82a2f8)
[  353.514984] Stack: (0xcc82be60 to 0xcc82c000)
[  353.519744] be60: 00000000 00000000 00000000 00000000 cc82a000 c00152f4 00000000 00000000
[  353.528656] be80: 01234567 c0051dd4 cc82bea8 c86a9dd8 0000002b 00000005 002dc6c0 00000000
[  353.537536] bea0: 0112a880 00000000 c0a868c8 c0071e48 c07a8f58 00000000 c2c56078 cc81ed4c
[  353.546417] bec0: c0a868c8 cc8bbaf8 c000e788 fffffffe 00000000 c050db8c cc8ca01c cc81eac0
[  353.555328] bee0: cc82a000 c07a9ce8 00000000 c0788880 c8539580 cc82a000 cc82bfac c050ad08
[  353.564208] bf00: cc82bf24 c0069e08 00000001 c07a9ce8 c07a9ce8 00000000 c0013e60 c0788880
[  353.573089] bf20: 42cb86f0 00000052 c0788880 c0788880 00000000 00000000 c0786398 c0788880
[  353.582000] bf40: c53df680 c5377284 c080c5bc c0013fc8 cc82a000 00000000 00012fc0 c538e400
[  353.590881] bf60: c538e400 c538e800 c07e44d8 c011c050 00000001 00000000 00000000 00000000
[  353.599761] bf80: 00000024 c0013fc8 cc82a000 00000000 00000000 00000000 00000058 c0013fc8
[  353.608673] bfa0: 00000000 c0013e00 00000000 00000000 fee1dead 28121969 01234567 45584543
[  353.617553] bfc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00012fc0
[  353.626434] bfe0: b6eda2b0 be862944 0000b7dc b6eda2d0 600f0010 fee1dead 00fbff04 107ffe00
[  353.635437] [<c0019378>] (smp_send_stop+0x50/0xe4) from [<c00152f4>] (machine_restart+0xc/0x4c)
[  353.644958] [<c00152f4>] (machine_restart+0xc/0x4c) from [<c0051dd4>] (sys_reboot+0x174/0x1f4)
[  353.654357] [<c0051dd4>] (sys_reboot+0x174/0x1f4) from [<c0013e00>] (ret_fast_syscall+0x0/0x30)
[  353.663818] Code: bad PC value
[  353.667480] ---[ end trace a8f4050048b60a4e ]---
[  353.690063] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[  353.690093] 
[  353.700286] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[  353.709259] pgd = c0004000
[  353.712341] [00000000] *pgd=00000000
[  353.716308] Internal error: Oops: 80000007 [#2] SMP ARM
[  353.721984] Modules linked in: af_packet autofs4 dm_mod omapdrm(C) drm_kms_helper snd_soc_twl4030 snd_soc_core drm regmap_spi libertas_sdio snd_pcm fb_sys_fops sysimgblt sysfillrect libertas syscopyarea snd_timer snd cfg80211 soundcore snd_page_alloc rfkill twl4030_wdt lib80211 twl4030_usb
[  353.750274] CPU: 0    Tainted: G      D  C    (3.4.6-2.10-omap2plus #2)
[  353.757476] PC is at 0x0
[  353.760253] LR is at smp_send_stop+0x50/0xe4
[  353.764923] pc : [<00000000>]    lr : [<c0019378>]    psr: 600f0113
[  353.764953] sp : cc82bbd0  ip : 00000000  fp : cc82a000
[  353.777374] r10: fffffffc  r9 : cc81eac0  r8 : cc82bc83
[  353.783050] r7 : c078c040  r6 : cc82bbd4  r5 : c07a8ee0  r4 : 000f4241
[  353.790130] r3 : 00000000  r2 : 00000000  r1 : 00000006  r0 : cc82bbd4
[  353.797210] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  353.804962] Control: 10c5387d  Table: 838b4019  DAC: 00000015
[  353.811187] Process systemd-shutdow (pid: 1, stack limit = 0xcc82a2f8)
[  353.818267] Stack: (0xcc82bbd0 to 0xcc82c000)
[  353.823028] bbc0:                                     c078c040 00000000 c0819230 c07dde04
[  353.831939] bbe0: cc81eac0 c05025b8 fffffffc cc82bc14 cc81eac0 c07dde04 c07dde04 cc81eac0
[  353.840820] bc00: c078c040 cc82bc83 fffffffc c0042358 c0651090 0000000b 00000000 cc82dc00
[  353.849700] bc20: c0817a58 cc82bc38 cc82a000 00000001 fffffff8 cc81ed18 cc82bc38 cc82bc38
[  353.858612] bc40: cc82bc83 cc82be18 c0817a58 cc82a000 00000001 cc82bc83 fffffff8 fffffffc
[  353.867492] bc60: cc82a000 c0017e3c cc82a2f8 0000000b 5d383536 00000000 00000008 bf000000
[  353.876373] bc80: 627c3694 50206461 61762043 0065756c 00000000 c3a11000 cc82be18 00000000
[  353.885284] bca0: c8539580 80000007 cc82be18 00000000 00000000 00000028 cc82a000 00000000
[  353.894165] bcc0: c8539580 80000007 cc82be18 00000000 00000000 00000028 cc82a000 c05023e4
[  353.903045] bce0: cc81eac0 c050d8c4 00000000 00000000 00000000 cc892c08 c0313000 c85395b8
[  353.911956] bd00: 00010000 00000000 c0820022 600f0093 00000022 c030bf84 cc892c00 00000004
[  353.920837] bd20: 00000005 00000010 c07a891c c03138c0 cc82bd5c c00659a8 00050000 00000001
[  353.929718] bd40: 00000000 c07f5acc c0819680 000064bc c07dbcec 00000007 c050d6fc c07ae574
[  353.938629] bd60: 00000000 cc82be18 cc82a000 c07dd930 00012fc0 c0008408 000064de 000064de
[  353.947509] bd80: c0819680 600f0093 ffffff96 00000000 000064de c003e93c 00000019 000064de
[  353.956390] bda0: 00000052 0000000f cc82be14 cc82be14 00000028 c0819732 00000000 c0819747
[  353.965301] bdc0: c07dbcec c003eeb0 00000000 c0819732 2d177fff 0000002c 48d331bb 00000052
[  353.974182] bde0: 00000003 c0819680 00000000 3b9aca00 10624dd3 600f0013 00000010 0088de08
[  353.983062] be00: 00000000 600f0013 ffffffff cc82be4c 4321fedc c050c1b8 cc82be64 00000006
[  353.991973] be20: 00000000 00000000 000f4241 c07a8ee0 cc82be64 45584543 4321fedc cc82a000
[  354.000854] be40: c07dd930 00012fc0 00000000 cc82be60 c0019378 00000000 600f0013 ffffffff
[  354.009735] be60: 00000000 00000000 00000000 00000000 cc82a000 c00152f4 00000000 00000000
[  354.018646] be80: 01234567 c0051dd4 cc82bea8 c86a9dd8 0000002b 00000005 002dc6c0 00000000
[  354.027526] bea0: 0112a880 00000000 c0a868c8 c0071e48 c07a8f58 00000000 c2c56078 cc81ed4c
[  354.036437] bec0: c0a868c8 cc8bbaf8 c000e788 fffffffe 00000000 c050db8c cc8ca01c cc81eac0
[  354.045318] bee0: cc82a000 c07a9ce8 00000000 c0788880 c8539580 cc82a000 cc82bfac c050ad08
[  354.054199] bf00: cc82bf24 c0069e08 00000001 c07a9ce8 c07a9ce8 00000000 c0013e60 c0788880
[  354.063110] bf20: 42cb86f0 00000052 c0788880 c0788880 00000000 00000000 c0786398 c0788880
[  354.071990] bf40: c53df680 c5377284 c080c5bc c0013fc8 cc82a000 00000000 00012fc0 c538e400
[  354.080871] bf60: c538e400 c538e800 c07e44d8 c011c050 00000001 00000000 00000000 00000000
[  354.089752] bf80: 00000024 c0013fc8 cc82a000 00000000 00000000 00000000 00000058 c0013fc8
[  354.098663] bfa0: 00000000 c0013e00 00000000 00000000 fee1dead 28121969 01234567 45584543
[  354.107543] bfc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00012fc0
[  354.116424] bfe0: b6eda2b0 be862944 0000b7dc b6eda2d0 600f0010 fee1dead 00fbff04 107ffe00
[  354.125396] [<c0019378>] (smp_send_stop+0x50/0xe4) from [<c05025b8>] (panic+0x98/0x1cc)
[  354.134124] [<c05025b8>] (panic+0x98/0x1cc) from [<c0042358>] (do_exit+0x6f4/0x7f8)
[  354.142486] [<c0042358>] (do_exit+0x6f4/0x7f8) from [<c0017e3c>] (die+0x294/0x320)
[  354.150756] [<c0017e3c>] (die+0x294/0x320) from [<c05023e4>] (__do_kernel_fault.part.8+0x54/0x74)
[  354.160430] [<c05023e4>] (__do_kernel_fault.part.8+0x54/0x74) from [<c050d8c4>] (do_page_fault+0x1c8/0x3ac)
[  354.171020] [<c050d8c4>] (do_page_fault+0x1c8/0x3ac) from [<c0008408>] (do_PrefetchAbort+0x34/0x9c)
[  354.180877] [<c0008408>] (do_PrefetchAbort+0x34/0x9c) from [<c050c1b8>] (__pabt_svc+0x38/0x80)
[  354.190185] Exception stack(0xcc82be18 to 0xcc82be60)
[  354.195678] be00:                                                       cc82be64 00000006
[  354.204589] be20: 00000000 00000000 000f4241 c07a8ee0 cc82be64 45584543 4321fedc cc82a000
[  354.213470] be40: c07dd930 00012fc0 00000000 cc82be60 c0019378 00000000 600f0013 ffffffff
[  354.222381] [] (__pabt_svc+0x38/0x80) from [] (smp_send_stop+0x50/0xe4)
[  354.231506] [] (smp_send_stop+0x50/0xe4) from [] (machine_restart+0xc/0x4c)
[  354.241027] [] (machine_restart+0xc/0x4c) from [] (sys_reboot+0x174/0x1f4)
[  354.250427] [] (sys_reboot+0x174/0x1f4) from [] (ret_fast_syscall+0x0/0x30)
[  354.259857] Code: bad PC value
[  354.263519] ---[ end trace a8f4050048b60a4f ]---
[  354.268707] Fixing recursive fault but reboot is needed!
[  354.383544] omap_i2c omap_i2c.3: timeout waiting for bus ready

So ignoring that for the moment, on the second boot (after KIWI had done its magic to finish setting up the MMC card), my new u-boot loaded. Note though that it’s still using the old environment variables at this point (which try to “fatload” in “loadbootscript”, which isn’t going to work), so that needed resetting:

U-Boot SPL 2012.04.01 (Sep 24 2012 - 12:48:02)
OMAP SD/MMC: 0
mkimage signature not found - ih_magic = ea000014

U-Boot 2012.04.01 (Sep 24 2012 - 12:48:02)

OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz
Gumstix Overo board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
NAND:  256 MiB
MMC:   OMAP SD/MMC: 0
In:    serial
Out:   serial
Err:   serial
Board revision: 0
Tranceiver detected on mmc2
No EEPROM on expansion board
Die ID #529e0004000000000403a1f303025013
Net:   smc911x-0
Hit any key to stop autoboot:  0 
Overo # nand erase 240000 20000

NAND erase: device 0 offset 0x240000, size 0x20000
Erasing at 0x240000 -- 100% complete.
OK
Overo # reset
resetting ...

Third time, we can see the new default environment:

U-Boot SPL 2012.04.01 (Sep 24 2012 - 12:48:02)
OMAP SD/MMC: 0
mkimage signature not found - ih_magic = ea000014


U-Boot 2012.04.01 (Sep 24 2012 - 12:48:02)

OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz
Gumstix Overo board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
NAND:  256 MiB
MMC:   OMAP SD/MMC: 0
*** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Board revision: 0
Tranceiver detected on mmc2
No EEPROM on expansion board
Die ID #529e0004000000000403a1f303025013
Net:   smc911x-0
Hit any key to stop autoboot:  0 
Overo # printenv
baudrate=115200
bootcmd=if mmc rescan ${mmcdev}; then if run loadbootscript; then run bootscript; else if run loaduimage; then run mmcboot; else run nan
dboot; fi; fi; else run nandboot; fi
bootdelay=5
bootscript=echo Running bootscript from mmc ...; source ${loadaddr}
console=ttyO2,115200n8
defaultdisplay=dvi
dieid#=529e0004000000000403a1f303025013
dvimode=1024x768MR-16@60
ethact=smc911x-0
kerneladdr=0x80200000
loadaddr=0x82000000
loadbootscript=ext2load mmc ${mmcdev} ${loadaddr} boot.scr
loaduimage=ext2load mmc ${mmcdev} ${loadaddr} uImage
mmcargs=setenv bootargs console=${console} ${optargs} mpurate=${mpurate} vram=${vram} omapfb.mode=dvi:${dvimode} omapdss.def_disp=${defaultdisplay} root=${mmcroot} rootfstype=${mmcrootfstype}
mmcboot=echo Booting from mmc ...; run mmcargs; bootm ${loadaddr}
mmcdev=0
mmcroot=/dev/mmcblk0p2 rw
mmcrootfstype=ext3 rootwait
mpurate=500
nandargs=setenv bootargs console=${console} ${optargs} mpurate=${mpurate} vram=${vram} omapfb.mode=dvi:${dvimode} omapdss.def_disp=${defaultdisplay} root=${nandroot} rootfstype=${nandrootfstype}
nandboot=echo Booting from nand ...; run nandargs; nand read ${loadaddr} 280000 400000; bootm ${loadaddr}
nandroot=ubi0:rootfs ubi.mtd=4
nandrootfstype=ubifs
ramdiskaddr=0x81000000
stderr=serial
stdin=serial
stdout=serial
vram=12M

Environment size: 1363/131068 bytes
Overo # saveenv
Saving Environment to NAND...
Erasing Nand...
Erasing at 0x240000 -- 100% complete.
Writing to Nand... done
Overo # reset
resetting ...

Fourth time’s the charm. Straight into loading the kernel and initrd without any manual intervention required:

U-Boot SPL 2012.04.01 (Sep 24 2012 - 12:48:02)
OMAP SD/MMC: 0
mkimage signature not found - ih_magic = ea000014


U-Boot 2012.04.01 (Sep 24 2012 - 12:48:02)

OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz
Gumstix Overo board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
NAND:  256 MiB
MMC:   OMAP SD/MMC: 0
In:    serial
Out:   serial
Err:   serial
Board revision: 0
Tranceiver detected on mmc2
No EEPROM on expansion board
Die ID #529e0004000000000403a1f303025013
Net:   smc911x-0
Hit any key to stop autoboot:  0 
Loading file "boot.scr" from mmc device 0:1 (xxa1)
625 bytes read
Running bootscript from mmc ...
## Executing script at 82000000
kerneladdr=0x80200000
ramdiskaddr=0x81000000
Loading file "uImage" from mmc device 0:1 (xxa1)
4063704 bytes read
Loading file "initrd" from mmc device 0:1 (xxa1)
9087228 bytes read
## Booting kernel from Legacy Image at 80200000 ...
   Image Name:   Linux-3.4.6-2.10-omap2plus
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:    4063640 Bytes = 3.9 MiB
   Load Address: 80008000
   Entry Point:  80008000
   Verifying Checksum ... OK
## Loading init Ramdisk from Legacy Image at 81000000 ...
   Image Name:   Initrd
   Image Type:   ARM Linux RAMDisk Image (uncompressed)
   Data Size:    9087164 Bytes = 8.7 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
   Loading Kernel Image ... OK
OK

Starting kernel ...

Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.4.6-2.10-omap2plus (abuild@build14) (gcc version 4.7.1 20120723 [gcc-4_7-branch revision 189773] (SUSE Linux) ) #2 SMP Sat Sep 8 06:38:16 UTC 2012
[    0.000000] CPU: ARMv7 Processor [411fc083] revision 3 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
[    0.000000] Machine: Gumstix Overo
...

Then after a little while…

Welcome to openSUSE 12.2 "Mantis" - Kernel 3.4.6-2.10-omap2plus (ttyO2).

linux login: root
Password: *****
Last login: Sat Jan  1 01:07:59 on ttyO2
This is the Lime-JeOS 12.2 SuSE Linux System.
To upgrade your system call:

    zypper refresh
    zypper install -t product openSUSE-12.2

Have a lot of fun...
linux:~ # uname -a
Linux linux 3.4.6-2.10-omap2plus #2 SMP Sat Sep 8 06:38:16 UTC 2012 armv7l armv7l armv7l GNU/Linux
linux:~ # uptime
 02:01am  up   0:01,  1 user,  load average: 2.23, 0.77, 0.27

So I call that reasonable success. At least now I have a documented point to start from next time. Oh, and that little red LED? It’s a heartbeat indicator. So long as it’s blinking, the kernel hasn’t crashed.