Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Friday May 24, 2002
08:53 AM

Nightmare

[ #5199 ]

I'm glad I don't work in tech support.

Yesterday our tech support team had about 700 calls in the queue. That's more than we get in a typical month. Customers even made some people on helpdesk cry (admittedly they were girls ;-)

What led up to this? Well basically our system ran to a halt. When you process over 6 million emails a day, grinding to a halt is not good.

The problem was fairly simple - one of our upstream providers (we buy three pipes for resilience) (I won't name any names here since we don't seem to be naming them in our press release) had misconfigured a router. This misconfiguration caused all connections to succeed, but hang very quickly. Unfortunately our connections were timing out after 20 minutes (!!!), so what happened was a huge queue built up, and eventually we ran out of open filehandles. The lower level routing protocols couldn't see that there was a problem (because packets were getting through most of the time, just not all of the time, so it just looked like regular packet loss), so they didn't re-route automatically.

Failures left right and center resulted.

Of course while the problem was simple, finding where the problem actually was, was hard. Having just done a major network upgrade, and a major software upgrade, we of course blamed ourselves. We regressed everything, until eventually we were left 8 hours later still scratching our heads. We increased the number of available filehandles, and that just made more connections open and eventually we ran out of memory.

Finally, someone mistyped a traceroute, which revealed where the problem was. Within 5 minutes the problem router was fixed. In 90 seconds everything started rushing through, like water through a freshly un-kinked pipe.

At 1am this morning I received an email sent by my co-worker at 12:31pm.

What lessons can we learn from this, and how can we prevent it from happening again? As yet we're not entirely sure. We're not exactly wet behind the ears in networking and email, but we're still trying to figure out exactly what we can do to prevent this from happening in the future. I think the best we can really hope for is better parsing of our email log files, to ensure that anything coming out on stderr gets parsed for serious problems, and also to ensure we run constant monitoring of things like open filehandles and memory usage. I think we could also do with a fairly radical redesign of some of our stuff, but I probably won't be involved in the talks on how to resolve this...

Strangely The Register were unusually kind to us about it.

All in all a very stressful day for a lot of people here yesterday (though not really too bad for me, because I wasn't in the loop).

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • huh? :-)
    --

    -- ask bjoern hansen [askbjoernhansen.com], !try; do();

    • We do have multiple datacenters, but unfortunately we have clients X,Y and Z on one rack, and clients H,J and K on another. Switching them over between racks (or between data centers) is never particularly easy because they only accept email from their particular rack. We realise now we need to tell them ahead of time what their failover rack is, and plan to be able to move them over. But moving 500 customers over from one rack to another isn't going to be plain sailing either ;-) Updating MX records takes

      • uhmn, your default setup should be to tell them to accept mail from y and x and then have their mx records point to both.

        mail is so easy to balance and failover; it's all built in! :-)
        --

        -- ask bjoern hansen [askbjoernhansen.com], !try; do();

        • I'm not entirely sure that would work... Their route to our servers might not have been through the failing provider, so we would have already accepted their mail. It was the outgoing that was failing.

          • well, if you had another datacenter where the customers could get processed, you could just have shut the one that didn't work down until you figured it out. :-)
            --

            -- ask bjoern hansen [askbjoernhansen.com], !try; do();

            • Shutting down a tower with 2 million emails in it's queue waiting to be delivered is simply not an option - you have to remember this isn't our email, but our customers', and we're talking big clients like the UK government.

              Really, I don't think there was any way around what happened, except for better early warning systems about which pipes have gone tits up.

              Of course once we get that in place some other thing will fail and we'll spend a day trying to figure that out too. That's just life in the high sca