I'm glad I don't work in tech support.
Yesterday our tech support team had about 700 calls in the queue. That's more than we get in a typical month. Customers even made some people on helpdesk cry (admittedly they were girls
What led up to this? Well basically our system ran to a halt. When you process over 6 million emails a day, grinding to a halt is not good.
The problem was fairly simple - one of our upstream providers (we buy three pipes for resilience) (I won't name any names here since we don't seem to be naming them in our press release) had misconfigured a router. This misconfiguration caused all connections to succeed, but hang very quickly. Unfortunately our connections were timing out after 20 minutes (!!!), so what happened was a huge queue built up, and eventually we ran out of open filehandles. The lower level routing protocols couldn't see that there was a problem (because packets were getting through most of the time, just not all of the time, so it just looked like regular packet loss), so they didn't re-route automatically.
Failures left right and center resulted.
Of course while the problem was simple, finding where the problem actually was, was hard. Having just done a major network upgrade, and a major software upgrade, we of course blamed ourselves. We regressed everything, until eventually we were left 8 hours later still scratching our heads. We increased the number of available filehandles, and that just made more connections open and eventually we ran out of memory.
Finally, someone mistyped a traceroute, which revealed where the problem was. Within 5 minutes the problem router was fixed. In 90 seconds everything started rushing through, like water through a freshly un-kinked pipe.
At 1am this morning I received an email sent by my co-worker at 12:31pm.
What lessons can we learn from this, and how can we prevent it from happening again? As yet we're not entirely sure. We're not exactly wet behind the ears in networking and email, but we're still trying to figure out exactly what we can do to prevent this from happening in the future. I think the best we can really hope for is better parsing of our email log files, to ensure that anything coming out on stderr gets parsed for serious problems, and also to ensure we run constant monitoring of things like open filehandles and memory usage. I think we could also do with a fairly radical redesign of some of our stuff, but I probably won't be involved in the talks on how to resolve this...
Strangely The Register were unusually kind to us about it.
All in all a very stressful day for a lot of people here yesterday (though not really too bad for me, because I wasn't in the loop).