You may or may not have heard, but there is a bash exploit that was recently uncovered called Shellshock, and it is being exploited right now, which is why we patched relatively quickly and without any customer communication.
- Shellshock Bash bug exploitation in full swing, warn researchers
- Cisco, Oracle find dozens of their products affected by Shellshock
We patched quickly, but we did test the patch in our dev environment. And actually, we patched several different servers yesterday afternoon, including the one that failed today. None exhibited any problems immediately after applying the patch.
So what happened?
It appears that there was a problem with a database driver in FreeBSD (which runs all of our DNS servers) related to the patch. Just after noon today we started to see “garbage” records in ns1, and when we looked further we saw that every record in ns1 was corrupt. The name server connects to our database to pull down the current records, and the bug in the database driver caused all of the zones to be corrupt.
We had a pretty good idea that’s what had happened, but since we weren’t completely sure at that point, the decision was made to rebuild ns1 from scratch. One of the benefits of doing that is we still have the old ns1 so we can do some forensic work on it to determine exactly what happened.
Why do these problems keep happening?
The timing of these things is never good, but it’s worse when they happen in succession, as this did right after the DDoS on September 22nd (and the email data corruption on a few servers a couple of weeks before that). The incidents are not related, but any one of them on their own would have been bad enough, we understand that.
We take a lot of preventative measures that you never see, because…well, they prevent problems from happening. But we cannot prevent every conceivable problem or dodge every bullet. Once in a while we’re going to get hit by something. You asked us to keep you informed via Twitter, Facebook and Google+, and I think we’re doing pretty well there. Keep in mind that the people posting there (including myself) are not system administrators, so we’re giving you as much information as we can get, but we may not have full details while a problem is happening.
Speaking of Twitter, Facebook and Google+…
The same few questions seemed to be asked by a lot of people, and I can’t answer them all individually, so let me address them here.
“Why are you doing this in the middle of the day?” A few reasons: first, the potential exploit is so great that we didn’t want to wait for a scheduled maintenance period. Secondly, it’s always the middle of someone’s day. Half of our users are outside of the U.S. So whenever we do something, it’s going to be bad timing for a good number of people. Finally, we did test the patch before deploying it, so we didn’t anticipate any issues.
“Why don’t you just roll back the patch?” The patch was applied almost 24 hours before we saw the corruption, so it wasn’t completely clear that the problem was caused by the patch. System administrators determined that they could rebuild ns1 in less time than they might spend troubleshooting, so that’s what they did. We can second guess that decision, but there’s no way to know how long it would have taken to “fix” the old ns1.
“Why would you install an untested patch on a production server?” As I mentioned previously, we tested the patch in our dev environment and saw no problems with it. And it’s worth remembering that the patch worked on every other server we installed it on without issue. We’ll be doing more tests on the old ns1 to see if we can find out why we had failure there but nowhere else.
– – –
When there’s an outage that affects a lot of you, we certainly understand that it’s bad news. We don’t take any kind of interruption for any number of users lightly, and there’s a lot of activity (and some shouting) in the halls here during those times. We never want to see anything fail, and when something does, everyone on our system administration team is lending their particular expertise and everyone is working together as quickly as they can on a fix.
I know I can speak for everyone here when I say we appreciate you hanging in there with us during times when you’d probably rather be throwing rotten tomatoes at us. Hey, I get it. I’ve wanted to throw my share at any number of companies. We really do appreciate your continued loyalty and understanding.