Sep 092011

Yesterday, we made substantial progress on fixing the problems we’ve been having with the Mozilla Developer Network wiki. By “we,” of course, I mean our support rep at MindTouch, Brian, and the IT guy at Mozilla whose misfortune it is to have to deal with my constant whining, Jake Maul.

Problem #1: Syntax Errors

First off, Brian figured out why the extensions were failing to have the correct preferences at startup. It turns out that there was a file permissions problem preventing this from working properly. That has been fixed, so we should no longer see the syntax errors in pages that was caused when extensions weren’t running properly. This was most commonly seen with the syntax highlighter extension, but happened with several others that relied on preferences being configured in order to work properly (but those weren’t being used nearly as often).

Problem #2: Stability

Secondly, and arguably more exciting, is that Brian figured out why the site has been crashing and requiring frequent restarts in order to keep running. In order to understand, let’s look at how the site has been configured to date.

In order to handle all our traffic, devmo runs on three nodes, each running its own copy of the MindTouch software. That includes the Lucene indexer software, which is used to build search indexes of the database. Now, in order to keep everything synchronized, all file data was being kept on one NFS share.

And therein lay the problem: we had three MindTouch instances all trying to work with the same Lucene index at the same time. The result: contention. Lots of it. Especially when they were all trying to make changes to the index at the same time. This resulted in collisions that caused delays, which resulted in requests piling up and eventually the software would either crash or memory use would build to the point that the processes would be killed and restarted just to clear the bottleneck.

A Trial Fix

So last night, Brian and Jake worked out a trial fix to see what would happen: they took one of the three hosts out of the load balancer’s rotation, leaving it to do indexing of the wiki’s content but not getting any web traffic. Then they turned off the indexing process on the other two hosts. This, in essence, turned one of the hosts into a dedicated indexing box, while the other two machines’ Lucene requests would access the same files on the NFS share that were being updated by the dedicated indexing machine.

This was deployed yesterday evening. Since then, there have been no crashes or automated restarts due to memory usage. Previously, we had frequent restarts due to memory use exceeding 700 MB; now, memory usage hasn’t been exceeding 400 MB at any time.

This is fantastic news, and bodes well for the long term fix I will describe momentarily.

A Next Step

Over the next few minutes, as I write this post, Jake and Brian are adding the third machine back to the load balancer pool, so that it will receive web traffic in addition to handling all Lucene indexing work. This should improve responsiveness slightly beyond where we are right now by distributing load further. This should be done by the time you read this post; in fact, as I wrote this paragraph, Jake said “Bringing it back up in the LB now.”

Some additional work is being done to verify the current state of the site to ensure that it will keep working smoothly while we work on the long term solution, which I will now describe.

The Long Term Fix

The long-term solution to this problem: a dedicated Lucene server. This machine will be running its own MindTouch instance, but will get no web traffic. Instead, it will handle all API requests for Lucene activity for all three of the web hosts, and will host the Lucene data files locally on its local disk instead of on an NFS share. This will have multiple benefits for us:

  1. Indexing of the data will be done by a single machine, preventing contention (this is temporarily solved by our current fix).
  2. Indexing will not be getting done by any of the web hosts, reducing load on those machines (this is partially fixed by our current solution, but one of our hosts is still doing indexing).
  3. All Lucene requests will be directed to a dedicated box, reducing Lucene related load on our web hosts (this is not addressed at all by the current solution).
  4. By having Lucene processing all handled by a dedicated box, there will be no contention at all for the Lucene data files (this is only addressed for writing by the current solution, and not at all for reading).

There are probably other benefits to this change, but these are the ones I’ve heard bandied about the most.

In Summary

The take-away from all of this: devmo is currently more stable than it’s been in a long time, and is likely to improve both in terms of stability and performance as we continue to work on the long-term solution. In addition, we have other things we are likely to be able to do to improve performance that we’ll look at going forward.

 Posted by at 2:41 PM