I spent most of the day debugging ganglia when the problem wasn’t even ganglia, but I’m mentioning ganglia, and gmond, here so that if someone else searches google for the same problems that I was having using the same words, they’ll be able to find the appropriate solution.
tl;dr: If you can’t get your gmetric in xen using multicast to cluster across a linux bridged interface, disable multicast snooping in dom0.
echo 0 > /sys/devices/virtual/net/<bridge>/bridge/multicast_snooping
Anyways, the symptoms of the problem were that my ganglia cluster, when using mutlicast, would fail to glue itself together. It would fire up the first time, and be around for about 5 minutes or so, and then all the hosts would forget about each other, as if multicast worked the first time, but after a bit, it stops working. The rest of the internet didn’t have any answers to this problem.
If I searched for ‘linux bridging multicast,’ I found lots and lots of problems regarding multicast routing, but I don’t have any multicast routing, and that wasn’t my problem to begin with.
Even more frustrating was that a friend of mine has it all working in a similar, but not the same, xen environment, and he never encountered
any of the problems I’m encountering. Turns out his environment was sufficiently different that it wouldn’t have ever happened anyway. His machine
didn’t even have a
multicast_snooping option. It’s possible that his kernel is too old to have it.
As soon as I turned off multicast snooping in dom0, the rest of the nodes started showing up and everything was working fantastically. On boot, I simply need to add a bit of logic to ensure that multicast_snooping is disabled.
Many hours of frustration lost, but I figured it out, and now I’m sharing it with you so you know why as well.