How Is Your Customer Support?

Here's an excerpt from a customer support thread for a new customer.

After getting their staging slice set up, they ran some load tests and had some questions. They're planning to add a 2 slice production environment as soon as they feel ready to launch.

I'm posting this here because, increasingly, it's clear to us that this is why our customers are very excited about Engine Yard! They're paying for many things, including reliability and scalability, but once they sign up, it's the customer support that blows them away!

Does your Ruby on Rails hosting company provide this level of support?

I tried out Jakarta JMeter (b/c httperf doesn't compile easily on WinXP) and it worked pretty well for delivering basic http performance statistics (pure Java with very simple installer). I monitored VM and ps aux on the box, while it ran and here's what I noticed:

This sounds good.

1) Only minor VM swapping (on a couple of heavily overloaded tests

see below) Those are only page-ins, which probably means the code path required something off of disk.

Whenever Linux needs to load code into memory from disk, it's a page in, and cannot be avoided other than via caching if it's already been loaded before.

To that extent, page outs are "worse" than page ins, because a page out means the system decided to free up some memory. Interestingly, we've found that Linux is very aggressive about freeing up memory, much more so than a few years ago, and it will occasionally page out a process that isn't getting used even when it has A LOT of free RAM, so even page outs are not necessarily evil.

What is evil is a relentless stream of page ins and page outs. If and when this happens, it'll generally result in nasty performance that is immediately obvious.

2) mongrels grow to 100MB total but RAM stays at around 65MB regardless of load

Congratulations! You're in a pretty small group of applications we host that don't appear to leak memory. :-)

3) Response times become unacceptable (>1 sec / page) at around 30-40 users (all hitting the same 5 page sequence, so not exactly a great test but lots of various DB activity though always for the same recordset).

That sounds very good to me, particularly on a single slice. I generally tell people to expect around 38 requests/sec/slice for very lightweight pages (i.e. session + rhtml render), which scales very well with multiple slices.

One thing that strikes me as slightly odd is your use of the term "users", which is rarely used by benchmarking software. It's highly unlikely, in my experience, that the sort of performance you're talking about could only handle 30-40 true website users, because website users don't typically POUND the system like a performance testing tool does!

It's possible that JMeter does attempt to simulate users by introducing delays between the requests, but most performance tools would discuss concurrent requests, not "users".

If JMeter was really producing 30-40 concurrent requests as fast as possible, and you received sub-second response times, you can generally support somewhere between 8 and 32 times that number of simultaneous flesh-and-blood users on that same workload.

4) CPU generally gets pegged completely during peak performance (so it is apparently the bottleneck) 4a) I think it's good news that CPU is the throttle b/c that means more slices = more concurrent users?

Yes, and that's interesting too, because most applications seem to limit out elsewhere, likely in the DB. This could be caused by either Pg being a killer DB, or that we have very few (perhaps no) Pg users in production, so the DB is very lightly loaded.

Your performance should scale very linearly per slice, and we should also see very good results "fattening" slices. We've recently started adding RAM and CPU allotment to existing slices in times of sudden traffic upswings (Digg, Slashdot, etc.) because it's very easy and fast to do so.

We generally recommend this after having 4 true slices, in a case where traffic grows linearly, but we can do it even on a single slice layout. We give you enough memory to double your mongrels, and twice the CPU allotment. This can be accomplished in a few minutes without taking the site offline if you have more than one slice, so this works out really well.

It was a very hodgepodge test but I did run things up from 10 users to 120 users and there were never any errors reported (every request was handled but at > ~40 users the response times began to get very large -> 5-15 sec / page at peak depending on load).

I'm very pleased to hear that!

Thanks also for the tip on up-time monitoring services - we'll investigate those.

No problem. We're here to help, always!