Yes, I Crashed the Site!

      Comments Off on Yes, I Crashed the Site!

Yesterday, I performed an upgrade to a third-party package used with Zend Developer Zone. It has an automated schema update system which silently performs actions on the database that had a large impact on ZDZ and related sites causing an outage. So, there are good lessons from my post-mortem that I would like to share with the community.

h2. The Start of the Problem

First, let us look at the actual list of actions that started the issue:

# The upgrade does a schema check on first load
# The upgrade then corrects the schema to be valid for the new release (performing table changes via DDL)
# The upgrade then may modify large amounts of existing data, or delete large amounts of old data

These schema and data updates can cause huge potential issues when the database and tables are used concurrently by the online site. First, the DDL changes will lock the affected tables. And for some storage engines (i.e. MyISAM in MySQL) the modifications or deletes will also cause table locks, and in other engines they could cause contention on locked rows, and in other engines cause things like rollback segments to overflow. The online site then waits behind the locks or contention causing threads in the web server to be held until no more threads are left to serve actual user requests. No more threads, no more site.

h2. Learning from the Problem

So here are some things to keep in mind about upgrade such as this, and about designing your site to mitigate the impact of dependencies — both internal and external:

bq. When practicing an upgrade, always log and monitor the impact on your resources.

When I practiced the upgrade on a test database, I did not monitor the SQL it performed but rather whether it worked or not given the same schema currently used in production. By recording the actual SQL performed during the upgrade, I could then do impact analysis and would have noticed the pattern that would have caused locking and decided on the right approach when applying the upgrade in production. Of course if my data wasn’t sized the same, I might not have seen the bulk delete later performed in production; so always size my test data to be similar and always plan for the worse in case it does behave differently.

bq. When writing an auto-update system in your software product, always show the impact analysis before you perform DDL or large scale data changes.

If your auto-update is going to do potentially life-threatening changes to a database, it is a friendly gesture to indicate this beforehand. “Hey! I’m about to lock a table and change its structure, are you ok with that? Here is the SQL for reference…” Most just say “please backup your database before hand, just in case” but that isn’t the same as telling me what it will do to the database when things go right.

It also doesn’t hurt to have a batch version of the upgrade, or even a simulation of the upgrade that outputs the expected changes and even allows them to be applied by hand; after possible tuning.

bq. Ensure that your cross-site dependencies are shielded.

When I dropped ZDZ off the face of the planet (well, I actually left it on the planet, but locked it up nice and tight and allowed nothing to use it) other sites went down that had hard dependencies to ZDZ. These other sites might be directly querying ZDZ frequently for data with online requests. It would be better to have them query via an *offline* process that then makes the most recent data available to the online process. The offline process should have a time-out so that it does not wait indefinitely. Unless they need live up-to-the-request-time data from ZDZ, this is a better model. If they want perfectly live data, then a client-side asynchronous request with reasonable time-out, or a server-side synchronous request with a very short time-out (falling back to cached data on failure) is an improvement as well.

It is also possible to have dependency “modes” where you track the state of a 3rd-party dependency and have a fallback position when that state becomes unusable. So by pinging, monitoring the performance of, or somehow knowing the health of the dependency I can cause my application to change when that dependency temporarily loses my trust. For example, loading data from ZDZ I notice that the average response time has gone way up, so I turn on an application state flag that says “ZDZ-NOT-AVAILABLE” and my code path changes for upcoming requests. My monitoring of ZDZ then later shows me that it is healthy and changes the state back to “ZDZ-AVAILABLE” and my code resumes its normal route of using the resources of ZDZ. This is great for things like a side-bar where I normally display a ZDZ feed, but when it becomes unavailable I then might temporarily show a feed from an alternative source or even Dilbert comic strips — and doing so instantly since the state is known at the start of the request.

In other words, don’t let other sites kill your own.

(_note: It is possible I killed at least one of the other sites due to it being in the same web server with the same thread pool rather than a dependency as stated above, read more about that below…_)

bq. Always serve your users, no matter what!

The actual consumption of threads was an asynchronous callback from the client after they receive the page. This callback has no time-outs and eventually due to blocking behind the database lock, just ate all of the available threads from the same web server which was serving all other user requests. In fact, it could be that other applications and sites are hosted on the same server with the same thread pool and those were also choked by this greedy consumer of all things thready.

Therefore it is best to have a thread pool per application, and within each application have threads dedicated to your main-line user requests different from the threads dealing with client-side callbacks different from threads for things such as REST or other API calls. Piling them together means they all live and die together. But do they really all have the same SLA and priority?

In the case of ZDZ it should have at least 4 thread pools:

# Main-line user requests (what you do when you go to the web page from your browser)
# AJAX style callbacks which are part of the application, but not as important as the main-line user request. Also I should note that for each client request I might have N callbacks, so I should have a proper ratio of main requests to AJAX requests.
# Stats and other tracking which is a lower priority request
# RSS and other feeds which are also a lower priority (and are typically cached anyway)

Then, I would evict whatever other apps that are in the same pool with ZDZ, telling them to go find their own pools to live in to isolate their performance from that of ZDZ. For both ZDZ’s benefit and their own.

bq. Timeouts should exist for everything.

There is never a case in ZDZ where we would want you to wait more than a few seconds (at the worse case) for a page. So why would we do anything that doesn’t have an emergency time-out? If we are taking too long to do an action, something is wrong and we should just return and tell you that something is wrong, or give you partial results (if available) rather than hurting the overall system. The contention caused by the pile-up can make matters worse, so release a bit of the contention by giving up in smart ways.

This includes async callbacks which can just fall back to some other action or display of secondary data, or ads, or whatever makes sense for the application.

bq. Monitor long-running database transactions and locks.

One of the early steps I do in a project, is setup monitoring for transactions and locks. Very early. As a developer I like to see when my online code does anything slow or contends with database locking. So monitoring and alerting when these occur is a very helpful step in ensuring that I detect these problems early, and fix them well in advance of production. And once in production, they help identify problems quickly since the monitoring is already in place.

Now since I have batch processes that may intentionally do long transactions and create locks (in ways that do not interfere with my online system); you will want to monitor differently for an online database user account than for an offline database user account. Oh, I probably just implied that different database accounts are useful and a good idea! And I also implied there is a lot to talk about with online and offline contention of shared resources, but that is out of scope for this article other than to say I could have disabled the online portion of this upgraded product to prevent it from contending with the backend update while that was occurring.

bq. Never poke a production problem randomly with a stick.

When this problem hit, the first call was for a rollback of the upgraded code. But, without knowing the exact problem you do not know if this actually solves anything. In our case, the locking database transaction was already running and unless killed (either directly in the database, or indirectly by terminating its client) the rollback wouldn’t save the day.

By determining the exact cause “A DDL command is causing a table lock on table X blocking all updates to table X from the client requests due to Y, eating all of the web server worker threads while being blocked” we found that we had these options:

* Terminate the long running command (causing the lock) and then do the code rollback if needed (and only if valid after partial database changes were made).
* Let the long running command finish, but in the meanwhile disable the process that was being blocked and eating threads (simple for this particular problem and since it isn’t a customer visible feature). This would let the upgrade complete while not blocking the online application, although it might have been too resource intensive on the database to continue the work.
* Help the bad command along: In our case it was on the “delete every one of the 9 million records in a table the hard way” stage, so stopping the delete and issuing a truncate table to do the same action almost instantly pushed it right along nicely.

The last option turned out to be the best, with the disabling of the process underway in parallel (just in case). Always do intentional fixes to known and understood problems. Panic is not a helpful process.

h2. Summary

Everything “learned” during this problem, were things we already knew. But they were not kept in mind at every stage of the way; both in development changes to our code over 4 years, and in handling the “simple” upgrade.

So while we look at the “revamp of ZDZ”:—Revamp-Rebuild-Redo-Re-Everything we will be sure to keep them all in mind and point out where we are taking them into consideration.

And you be sure to keep them in mind for your own applications and upgrades… (otherwise, feel free to post your lessons learned here so that others can evolve!)