We had an outage on ns1 for almost 100 minutes today, and I thought I’d let you know what happened and answer the questions that came up often on social media.
You may or may not have heard, but there is a bash exploit that was recently uncovered called Shellshock, and it is being exploited right now, which is why we patched relatively quickly and without any customer communication.
- Shellshock Bash bug exploitation in full swing, warn researchers
- Cisco, Oracle find dozens of their products affected by Shellshock
We patched quickly, but we did test the patch in our dev environment. And actually, we patched several different servers yesterday afternoon, including the one that failed today. None exhibited any problems immediately after applying the patch.
So what happened?
It appears that there was a problem with a database driver in FreeBSD (which runs all of our DNS servers) related to the patch. Just after noon today we started to see “garbage” records in ns1, and when we looked further we saw that every record in ns1 was corrupt. The name server connects to our database to pull down the current records, and the bug in the database driver caused all of the zones to be corrupt.
We had a pretty good idea that’s what had happened, but since we weren’t completely sure at that point, the decision was made to rebuild ns1 from scratch. One of the benefits of doing that is we still have the old ns1 so we can do some forensic work on it to determine exactly what happened.
Why do these problems keep happening?
The timing of these things is never good, but it’s worse when they happen in succession, as this did right after the DDoS on September 22nd (and the email data corruption on a few servers a couple of weeks before that). The incidents are not related, but any one of them on their own would have been bad enough, we understand that.
We take a lot of preventative measures that you never see, because…well, they prevent problems from happening. But we cannot prevent every conceivable problem or dodge every bullet. Once in a while we’re going to get hit by something. You asked us to keep you informed via Twitter, Facebook and Google+, and I think we’re doing pretty well there. Keep in mind that the people posting there (including myself) are not system administrators, so we’re giving you as much information as we can get, but we may not have full details while a problem is happening.
Speaking of Twitter, Facebook and Google+…
The same few questions seemed to be asked by a lot of people, and I can’t answer them all individually, so let me address them here.
“Why are you doing this in the middle of the day?” A few reasons: first, the potential exploit is so great that we didn’t want to wait for a scheduled maintenance period. Secondly, it’s always the middle of someone’s day. Half of our users are outside of the U.S. So whenever we do something, it’s going to be bad timing for a good number of people. Finally, we did test the patch before deploying it, so we didn’t anticipate any issues.
“Why don’t you just roll back the patch?” The patch was applied almost 24 hours before we saw the corruption, so it wasn’t completely clear that the problem was caused by the patch. System administrators determined that they could rebuild ns1 in less time than they might spend troubleshooting, so that’s what they did. We can second guess that decision, but there’s no way to know how long it would have taken to “fix” the old ns1.
“Why would you install an untested patch on a production server?” As I mentioned previously, we tested the patch in our dev environment and saw no problems with it. And it’s worth remembering that the patch worked on every other server we installed it on without issue. We’ll be doing more tests on the old ns1 to see if we can find out why we had failure there but nowhere else.
– – –
When there’s an outage that affects a lot of you, we certainly understand that it’s bad news. We don’t take any kind of interruption for any number of users lightly, and there’s a lot of activity (and some shouting) in the halls here during those times. We never want to see anything fail, and when something does, everyone on our system administration team is lending their particular expertise and everyone is working together as quickly as they can on a fix.
I know I can speak for everyone here when I say we appreciate you hanging in there with us during times when you’d probably rather be throwing rotten tomatoes at us. Hey, I get it. I’ve wanted to throw my share at any number of companies. We really do appreciate your continued loyalty and understanding.
Thanks for prompt post mortem which demonstrates to me an honest, open description of what happened and why DASP has taken the actions it did which are always easy to criticize from the outside.
You say that only ns1 was effected, but ns2 and ns3 were also not available, any one of which should have provided a fall back for name resolution.
So, you couldn’t redirect requests coming in to ns1 to the other name servers? Seems like that would have been a better solution than causing folks’ sites to drop off the ‘net for close to 2 hours.
Douglas, if there had been an option to avoid the downtime we would have taken it.
I’m not a system administrator so I can’t speak to why everything that was done was done. I can find out, but I suspect that ns1 feeds 2 and 3, so when they synced to ns1 they lost their zones as well (and when ns1 was fixed it could feed them the zone files properly).
We didn’t have zone file backups because…it just isn’t one of those things that is commonly backed up. If something happens to the name server, you can just pull the records from the database again. The problem yesterday is ns1 was corrupting the data as it imported it from our database.
Now we have zone file backups, in case anyone was wondering. We’ll probably never need to use them, but then we didn’t think there was any need for them before, but we were proven wrong.
This impacted client facing production sites. This is not acceptable.
While I’m very frustrated with you guys as of late, I appreciate this blog entry and we’ll give you guys another shot.
I don’t blame you for being frustrated, and we all appreciate you sticking around.