Some day I need to write up a "here's something cool we did with AxKit/Perl" about MessageLabs' spam quarantine system (called "Spam Manager"). But suffice it to say for now that we did it with AxKit and Perl, and it is very cool.
However we're experiencing some very "interesting" problems with it to do with load. We're seeing the load go up on some servers while the CPU usage sits at no more than 5%.
Of course load average isn't tied to CPU usage. But most people see the load go up to more than 1.0 when their CPU is fully utilised. Load is a measure of runnable processes - although that's terribly poorly explained practically everywhere (I did find a good explanation of it but lost the link - so if you don't know what load average really means I can't help you
So this is something to do with the kernel not being able to context switch in processes fast enough. Usually we've managed to tie this down to bad duplex settings on the network interface (half duplex instead of full duplex). However recently we've seen the problem again with the network interface being just fine.
Debugging this is practically impossible - it's not repeatable or isolate-able. I welcome any tips from anyone here who has experience with this. My next port of call is to look at the SQL Server that the box is connected to, and see if that has any relevance, but I don't hold much hope to find anything out. I'm kinda stuck.