Wednesday, December 22, 2010

Overnight DNS Issues Resolved

Overnight (or in the wee hours of the morning, depending on the time zone you're in) our nameservers failed, apparently due to an upgrade of BIND (the DNS server) pushed by cPanel. The upgrade configuration file (yum.conf) was written to exclude BIND for automatic updates (so they can be done when we're watching, to fix any issues immediately), but that configuration file was over-written at the same time, presumably also by cPanel.

The update erroneously added duplicate entries to the DNS server configuration file, which caused the failure.

For those unfamiliar with how DNS works, the (DNS) nameservers are what translates your domain name to the physical server's IP address. The physical server was technically up and functioning (email continued to be processed, etc), which is why the typical error alerts that are sent by SMS after business hours didn't go out. In addition the overnight support staff weren't set up to monitor the nameservers. Those have both have been rectified (as well as some new alerts added), should this ever happen again it can be fixed immediately, and a bug report will be filed with cPanel support.

Including backup DNS for all accounts (at no additional cost) has been in the works for some time, and using an external DNS provider has also been considered, it's just been a question of finding the right, reliable DNS provider. Testing and research of these will be bumped up to a higher place on the "to do" list.

We apologize for the downtime and have taken steps to make sure this doesn't happen again.