We've been on this generation of server for many months running the same codebase. Not the last generation, the current one. Is it reasonable that the codebase that has been working for months suddenly stops working and every single site on both dedicated servers suddenly stop working properly. How is this reasonable?
Now that the situation has been more or less resolved, the main issue we have is how the support was handled on this incident.
Specifically, we contacted level 1 support immediately (8am your time) and he told us he would create a ticket on our behalf and escalate the ticket seeing as all our sites were down. We heard nothing for an hour. We put in a high priority ticket at 9am your time. We've spoken to five separate level 1 customer support people in addition to the ticket trying to get someone to help us with this situation. Nearly 5.5 hours after our noticing there was a problem, 4.5 hours from the high priority ticket, is the first time you saw/touched the ticket.
The first email we got from you was asking about account validation. Perhaps this could have happened at the level 1 support within the first 5 hours or initial contact we had at 8am your time. Then the email you sent me asking about the "specific problem" which I outlined in detail in the ticket with explicit details of the problems, examples of how to make the problem appear and disappear using specific URLs showing how the parameters caused mod security to give us a 403 forbidden error.
The email I got back had this paragraph:
Before I can add exceptions for the rest of the accounts, I will need to also obtain error messages for every domain that is experiencing this problem. To do so, I will need to replicate the problem. In order for me to replicate the problem, I will need to get step-by-step instructions on how to recreate the problem, including any required login information, for each, and every, domain that is experiencing these issues.
I caused enough of an issue with Steve (level 1 tech) that he patched me through directly through to you on the phone because I had thought I'd outlined the issue very explicitly. I can post the entire ticket with the timestamps here including our correspondence if you wish.
In our phone call, you re-iterated that you needed to know the exact nature of the problem, that I would have to supply the username/password for every domain affected (even though it was all of them) and give specific examples of each problem on each domain. I explained that it was in the ticket, and on the phone we went through it so that you understood the problem. Your quote to me on the phone was "this is on every domain?" which was CLEARLY indicated hours ago in the ticket, SEVERAL TIMES prior to our phone conversation.
If I did not have experience dealing with this specific issue before I don't know how long this resolution would have taken. I had to explain it to the Level 1 techs (each one because each tech goes through the same exact script that I went through 15 minutes ago with the previous tech aside from Steve who is your best IMHO) I had to also explain the issue with you because it wasn't an error per se but the result of the mod sec ruleset.
After agreeing on a course of action, you said that if your workload was uninterrupted it would take 1.5-2 hours to go through every domain to make the exceptions unless you were interrupted due to a server going down or something similar to which I replied "what do you think we are going through now? ALL OUR DOMAINS ARE DOWN, how is this not the same as a server going down?"
A few minutes after our conversation was over, as I was watching the ticket, it was marked RESOLVED. Sites were still all down. I called customer service and pretty much went off the chain and they said they messaged you. You messaged us back saying it was in error and you closed the wrong ticket. Okay I understand that happens, but didn't help the situation.
At approximately 8:45pm your time (just under 12 hours from our high priority ticket being submitted) the domains were back online. (and thank you for being prompt once we actually got the issue in front of you Reed)
So after causing hell on earth for our own customer service department, losing customers (thankfully we may be able to recover these by explaining the scenario today), I'm hearing you say that it's basically our fault and Westhost is blameless in this situation.
We have been customers of Westhost since 2004. We've suffered through the outage fiasco in 2010 where everything went down horribly for all Westhost customers for 48+ hours. It's not like we aren't loyal customers.
I understand things go wrong from time to time, but seriously this "customer support" that should be coming with "managed servers" has left us faithless. Dedicated server customers are treated no differently from shared hosting customers. We are paying 1.5x to 2x what other webhosts (who are higher ranked in several metrics) are charging for the same dedicated server specification (or better).
I'll end this by saying we went to you guys because you used to be at the top of the game. When we first started with you the customer support was immediate and FAST. In 2006 you were in the top 10 (maybe top 5) of every webhost ranking. You've won awards for several years after that. You famously advertised that Netcraft ranked you #1 I believe in 2008. Today you aren't even on the lists including Netcraft. Whether this is because your competitors outpaced you or that you slipped due to transitions/buyout/whatever, it should be clear that this isolated incident isn't the reason.