Does this site look plain?

This site uses advanced css techniques

This is a private work in progress; please do not circulate. Yet.

Myspace.com is one of the most popular sites on the internet (in the top ten, according to Media Metrics), and in June 2005 started to receive massive reports about customers being unable to reach the website: it just seemed too widespread to be just an isolated incident. Many frustrating hours later it was revealed that the problem was a DNS hijack done by none other than SBC Internet.

This Tech Tip recounts the incident and how we got to the bottom of it.

"Something's wrong"

On Friday, 3 June 2005 at just before 3PM, a customer IM'd me with reports that his users couldn't resolve his domain name (for this Tech Tip, I'm using example.com throughout rather than the real domain of myspace.com). He gave me a snippet that one of his customers passed him:

Server: dns1.1sanca.sbcglobal.net Address: 206.13.29.12
dns1.1sanca.sbcglobal.net can't find www.example.com: Non-existent domain

At first I thought it was just a typo in the name of the nameserver:

Correct/Incorrect SBC LA nameserver

but even querying the proper nameserver gave no answer. I have been an SBC California customer (in Pacific*Bell territory) for years and know their DNS infrastructure fairly well; everybody here knows that 206.13.29.12 is the LA resolver.

When getting strange answers from a DNS server, one usually starts by querying the SOA (Start of Authorities) record. It contains housekeeping data for the zone that suggests (among other things) where the master data for this zone is kept. Asking the right place (at UltraDNS), we see:

$ dig +short example.com soa
udns1.ultradns.net. hostmaster.example.com. 2005060303 10800 3600 2592000 86400

We also see the serial number -- 2005060303 -- which by convention often contains the date of the last manual update to the contents of the zone file.

Asking the SBC nameserver for this should yield the same data, but it did not:

$ dig +short @dns1.lsanca.sbcglobal.net example.com soa
localhost. postmaster.localhost. 2004052400 3600 1800 604800 3600

Huh? Localhost? Year=2004? This makes no mention of example.com, and it just doesn't smell right even if you're not comparing it with the correct data.

Using a list of SBC California resolving nameservers gave an easy way to query all of them at once for this SOA data from a shell script:

checkdns.sh
#!/bin/sh
#
# checkdns.sh - query SBC nameservers for our SOA
#

while read server junk
do
        echo ===$server

        dig +norecur +short @$server example.com soa

done <<EOF
206.13.28.12     # dns1.snfcca.sbcglobal.net
206.13.29.12     # dns1.lsanca.sbcglobal.net
206.13.30.12     # dns1.sndgca.sbcglobal.net
206.13.31.12     # dns1.scrmca.sbcglobal.net
63.202.63.72     # dns1.frsnca.sbcglobal.net
64.160.192.70    # dns1.bkfdca.sbcglobal.net
64.169.10.7      # dns1.renonv.sbcglobal.net
64.169.140.6     # dns1.sktnca.sbcglobal.net
EOF

Each one gave identical "localhost" SOA responses. Something is very wrong here: perhaps it's some kind of DNS poisoning, perhaps this was a previous DNS configuration that was rolled back, or maybe SBC nameservers are getting bad data from UltraDNS.

Checking the root nameservers for example.com, it correctly shows two NS entries:

UDNS1.ULTRADNS.NET  204.69.234.1
UDNS2.ULTRADNS.NET  204.74.101.1

and both produced the right answers, but this has not gone far enough. UltraDNS - like the root servers - use a technique known as anycast, where multiple diverse machines can use the same IP address, and it relies on BGP routing to get a user to the closest physical server. When I ping 204.69.234.1 it could reach a different machine than when you do so from elsewhere on the internet. It's a great system that's optimized both for performance and robustness.

But what if there were some kind of synchronization issue between the real nameservers? This seemed unlikely, mainly because the UltraDNS folks are very much on the ball, but because I'm also on the SBC network, I'd expect to get the same path as the SBC nameservers. If I see correct data from UltraDNS, so should SBC.

Putting our heads together

Getting on a conference call with my Example.com manager and the support guy from UltraDNS, we were able to query the individual physical nameservers (they each have a separate, unique IP address as well), and it was clear that they were all working correctly. UltraDNS was not the problem here.

So this brings us back squarely to SBC. Through my own networks and via colleagues, we were able to make this same SOA query to other ISP nameservers - Cox, Broadwing, Verio, Earthlink, etc. - and in no case got the bogus data. Asking Example.com support to check their reports from customers, it turns out that none of them was from outside the SBC network.

So what might this be? DNS cache poisoning is an obvious guess, where the bad guy sends an answer to a question not asked, and the nameserver caches it as genuine, but this didn't feel likely either:

There wasn't any redirection to other websites
The usual reasoning behind cache poisoning is to redirect users to a different place, such as a phishing website. But we found no alternate data in this zone - no MX, no A records, an NS record pointing to localhost. Nothing to indicate any redirective purpose.
The nameservers all responded identically
I have limited experience with this, but it just seemed that these nameservers responded just a bit too uniformly; nameservers all get restarted over time (which clears the cache), and I'd have expected to find some variation in the data.
The nameservers all reported authoritatively
Part of a DNS response is an AA ("Authoritative Answer") bit, which should be AA=1 only when we're actually querying the master nameservers for the zone: cached data obtained anywhere else should have AA=0. The SBC nameservers thought they were actually in charge of this zone.
All nameservers reported BIND version 9.2.4
One can make a special kind of query to a BIND nameserver and find the version:
$ dig +short @dns1.lsanca.sbcglobal.net version.bind txt chaos
"9.2.4"
The BIND nameserver software can be configured to report anything to this query, so it may not be genuine, but BIND 9.2.4 is not known to be susceptible to DNS poisoning.

This was completely puzzling, and none of us on the conference call had any real idea what was going on other than to know that it involved SBC. We ended the call with a promise that I'd send the UltraDNS fellow all my detailed notes.

After the call I told the customer how impressed I was with the UltraDNS guy and now nice it was to talk with a truly competent DNS support professional. It turned out to be the company's CTO, Rodney Joffe. Smart, nice guys are a joy to work with.

At just after 5:30 PM, we got some news and the whole picture emerged. Apparently, SBC identified a DoS attack on its customer nameservers as identified by repeatedly asking for the same record over and over.

I don't recall which resource record was being requested, but my best guess - and I am speculating here - is that SBC was seeing a DNS smurf attack on another party, with their own nameservers being the unwitting middleman.

But this is strictly speculation: we have some indications that SBC may have been seeing an attack that more directly targetted example.com, but this is still unclear.

"Fixing" the problem

If this was the case - SBC saw their own DNS servers under attack - they had an absolutely legitimate interest in protecting their infrastructure. These machines were providing DNS to many thousands of SBC customers, and the attack had to be mitigated somehow.

But their response was as shockingly incompetent as it was disastrous for Example.com:

It appears clear that SBC installed an empty, authoritative zone for example.com and deployed it on their resolving nameservers systemwide.

And they didn't tell anybody.

Installing an empty zone as authoritative means that requests for data from that zone will be honored directly ("we have nothing for you") rather than recursively tracking down the real data from the real nameservers elsewhere. No more www, mail server, or anything else. This property passes hundreds of megabits/sec of traffic to SBC.

This had the effect of completely dropping Example.com from the internet for the great majority of SBC users, without notice or explanation. This is similar to a PC user installing an entry for example.com in his hosts file, but on an ISP-wide basis for all customers.

This was an atrocity even when viewed in a light most favorable to SBC. If the attack started late at night, it's entirely possible that the third shift was simply not equipped to respond properly. Faced with an attack affecting their customers, this very well may have been the only thing they knew to try. It certainly would have mitigated the attack, and it may not be so unreasonable on its face.

But when one drops a huge property off the internet in the middle of the night, even with good reason, one has a duty to attend to it properly once daylight rolls around. Perhaps there are no experts available at midnight, but there surely must be some at 9AM.

It's even worse than this. Once UltraDNS was able to track down the right people at SBC to discuss this, the fix still apparently waited for some DNS engineer to make it through town after being stuck in Northern California traffic. This accounted for almost an extra hour of hijack.

I have a customer with perhaps 200 employees, and he has at least two staffers who could stumble through an important DNS change of this nature. That SBC -- almost three orders of magnitude larger -- has only one engineer who can make DNS changes in an emergency seems very scary and/or lame to me.

I don't know in detail what technical mitigations they reasonably should have taken, but I can think of at least two candidates.

First, and most obviously, notify somebody at example.com that their domain has been taken down on the SBC network. Though SBC may not care all that much, Example.com certainly has an interest in helping SBC find a better mitigation than turning them into a black hole. Example.com wasn't complicit in any of these attacks: they merely had DNS data that amplified well.

Second would have been to provide a clue that they had done this, perhaps by changing the SOA record to something like:

ddos.mitigation. security.sbcglobal.net. 2004052400 3600 1800 604800 3600

It would have at least left some breadcrumbs for us to follow and ruled out the horrible game of twenty-questions finding out the responsible party.

It was only due to UltraDNS having excellent contacts at SBC (and/or some amazing persistence) that they were able to track this down on a Friday night. I'm reasonably sure that I would have gotten there eventually, but it would have certainly taken me much longer to find the right SBC people.

It's not out of the question that it wouldn't have been resolved until Monday morning: it's my opinion that UltraDNS saved Example.com's weekend.

The fix started phasing in at around 6:15PM PDT Friday night, and the Example.com traffic graphs; things were fully back to normal before 7PM.

Countervaling Factors

In the main it seems clear that SBC handled this badly, but it would be unfair to omit countervaling factors that undoubtedly weighed into their decisions.

All network organizations have the right and duty to protect their own infrastructure, and are generally answerable only to their own customers. Myspace is not a customer of SBC, so one can make a principled case that SBC owed no them no duty of any kind.

Furthermore, and more specifically, performing DNS blackholing is part of common practice in several few circumstances. During

Summary and Conclusions

Though SBC wasn't responsible for the unknown parties who started this mess with the DoS attack on their servers, it's hard to make the case that they weren't negligent, or at least incompetent, in their response. To silently take down a major internet property with no notice or warning -- even after the fact -- seems patently irresponsible.

This is particularly disappointing considering that Pac*Bell used to have a fantastic DNS department: even today, they still delegate inverse DNS for even a static /29 DSL network. This has been wonderfully helpful.

It's also not clear what Example.com could have done to mitigate this, either before the fact or after. They had a robust DNS architecture that worked flawlessly throughout the entire episode, and since they did not appear to be a DoS target themselves, they wouldn't even have known this was part of the matter at hand.

There is also no substitute for working with very strong technical people: the folks at UltraDNS could not have been more responsive, helpful, or resourceful. They asked the right questions, maintained a sense of urgency appropriate for the customer's concern, and never seemed out of ideas. Two big thumbs up for UltraDNS.

We all hope this doesn't happen again.

First published: 4 June 2005