The past couple of weeks of testing and investigation have revealed some interesting things about the Mozilla Developer Center and how we use the MindTouch software on it. I thought I’d share some of what we’ve learned, and what’s being done about it.
The MindTouch software responds to each request for an RSS feed by scouring the database and generating the results on the fly. For small, low-traffic sites, which have few RSS users, that works okay. But MDC has a lot of users, many of whom watch RSS feeds. Generating the results for an RSS feed hit takes time and a lot of database hits.
MindTouch has developed a patch, which we’ve been testing, to enable caching of RSS feeds. The feeds will be cached for some amount of time — the exact amount of time is still being experimented with, but we’ll be able to change later. While this will reduce the granularity of updates, it will dramatically reduce load on the site.
We make extensive use of sometimes complex templates on MDC. For the uninitiated, templates are embeddable snippets that can include script code to generate content on the fly. We have templates that are very simple, generating a quick string based on a parameter, but we also have some that are very involved, doing lookups of potential target pages to generate valid links depending on the availability of different destinations.
Some of our pages use a lot of templates. For example, let’s consider the XUL prefpane element‘s document. It directly uses 37 templates, each of which is in turn using others. As a result, on the current MDC site, this page requires — wait for it — over 11,000 database queries to render.
Let me say that again.
Over 11,000 database queries. Now, it’s important to note here that our database server doesn’t care. It’s hardly noticing any load at all. The problem appears to be in the overhead required to handle setting up and issuing the queries, and waiting for the results.
However, with just a handful of pages being loaded at once, the number of connections to the database server can get quite large. And the site is currently not configured to allow more than 50 at a time. So only a few pages need to be getting loaded at once, and you’ll start getting “unable to access database” errors. This is also why the site was crashing; when the available connections ran out, the site would start stacking up pending queries at a rapid pace, until finally the site would just die.
Obviously, this is totally uncool. I don’t think MindTouch ever realized just how much use people would get out of templates.
So what’s being done about it? A few things.
First, MindTouch has created a patch for us that caches the results of database queries for a few minutes. This reduces the number of queries required to render that page by a couple of orders of magnitude.
Second, we’ll be increasing the number of connections to the database server that can be open at once, by a sizable amount.
Third, we’ll be enabling caching of page diffs. I’m not sure how much win we’ll get from this, but every little bit will help.
Now, to be honest, I’m not sure how much of a performance gain we’ll get out of these changes, but it should dramatically improve reliability, at least.
There is another thing we’re investigating, but it may or may not be implemented due to complexities in terms of site management.
MindTouch offers a caching service that allows templates to cache data for re-use later. Using this would require rewriting a lot of our templates, and possibly even significant revisions to the pages that use them. We’re looking at how difficult this would be, as well how much performance gain we might get from doing it.
Basically, what it comes down to, is this: processing templates is taking a surprising amount of CPU time, and we use a lot of them, resulting in the server becoming overloaded very quickly.
Upgrading to 9.12.2 — Again
We’re planning to attempt this upgrade again on Tuesday, May 18th, at 4 PM Pacific Daylight Time. This time, we’ve done a substantial amount of testing around the areas that failed on us last time, and MindTouch personnel will be available to help if anything does go wrong.
The failure of the last upgrade was caused by the database connection pool configuration issue I mentioned above. We have confidence that the upgrade will work better this time.
We’ll be increasing the number of hosts driving MDC. I don’t know what the schedule is for this. Obviously this will help too.
To be clear: I don’t expect MDC to suddenly become amazingly fast and responsive next Tuesday evening after the upgrade and all these tweaks are put into place. I do expect that it will become more stable, however, and performance will hopefully be better to some extent.
There’s ongoing work to be done to further improve performance, and it’s too early to say exactly when that will all happen.
I know this has all been frustrating. Trust me, I’m vastly more annoyed by it all than you are. Please be nice to me in your comments. I don’t really feel like being yelled at anymore about this. :)