Hi everyone,
I'd like to start by saying thank you to everyone who has been sticking with us through these performance issues, both on the server and on the client. We have been working around the clock for the past two days to get them resolved, and we've been making some excellent progress.
In this update, I'd like to focus on server performance. Today you should notice dramatically better server performance as we have finally addressed several major issues. There are still client performance issues (bad frame rate, memory leaks, etc) but we're making substantial progress there and hopefully we can roll out fixes for those today too.
I would like to take you all through the four major server performance issues that we identified and fixed this morning so you all can understand what on earth has been going on and why server performance has been, frankly, garbage, and why we didn't catch these problems before EA release.
#1 - The Tease
When the game went live, right off the bat we saw that the server was scanning tens or even hundreds of millions of rows per second evaluating what we call "one-off" or "remote" queries. Remote queries don't subscribe to data, they only pull down the data once. It turns out that we were missing an optimization for two specific queries that we were running when players opened either the trade finder window or the waystone window. Two changes happened between the demo and EA that made this worse. First, the number of entities in each region increased by a factor of 2 and second, we added the trade finder window to the tutorial. Since our internal testing bots don't open UI or complete the tutorial, we did not see this in our testing. Fixing this improved performance, but did not substantially fix our issues.
#2 - The Freeze
The next major issue we (and you) noticed were huge minute-long freezes on the server. Awful. Eventually we determined that this only happened if we had the "mob monitor" running. The mob monitor is the "brain" of all the creatures in the world. It's the AI director. At some point while reviewing the logs we found that a mob monitor crash and restart corresponded with the freezes. We fixed the mob monitor crash and the freezing went away.
#3 - The Disease
After we fixed the freeze we found that performance was still not great on the server. Players were reporting substantial lag, and worse, they were reporting that they were disconnecting from the server or had problems connecting. Additionally after a couple hours each region server would start to get what we called "Region 8 Disease". The first symptom of Region 8 Disease was elevated reducer execution time, despite no apparent additional workload on the server. After that many clients would disconnect and reducer execution time got even worse and remained very bad. The only resolution was to restart the server, which would resolve the issue temporarily. Clients disconnecting seemed to both cause the issue and be caused by the issue.
We had assumed that the freeze was somehow caused by the monitor resubscribing to all the data, but after some experimentation and careful review of timestamps we found that the freeze issue was actually happening when the mob monitor disconnected. Eventually we figured out that when clients were disconnecting we would print an enormous number of logs saying that we couldn't deliver messages to them. The number of logs was proportional to the amount of messages, so the more players in the game the more logs we'd be printing. It turns out that although these logs were not on the reducer code path, they would hold a lock on stdout causing a small amount of logging on the reducer code path to get gummed up waiting on locks, causing reducer execution time to skyrocket (which caused more disconnections, which caused more logging, etc.)
#4 - The Tragedies
Finally we come to a set of tragedies which led to the biggest performance impact. In the last days before the release we ran bot tests on the same physical machines as we'd be launching on. When we ran our latest code version we actually saw the freeze happening. When we rolled back the version, we no longer saw the freeze or saw it much less often, but we now believe this was simply due to random chance. We therefore tragically decided to ship an older version with early access.
The reason this was so tragic is that in the newest version we had made several query engine optimizations to improve performance of operations that the bots did not do, but that real players did (e.g. lots of crafting).
Last night, given our deeper understanding of the freeze and the disease, we chose to roll out the newest version, and we saw a dramatic improvement in server-side performance.
If you're reading this and you were disappointed with the server performance yesterday, I ask that you please give the game another shot either now, or after we fix the client-side performance and bugs. Not everything is solved, but things are a lot better. I hope this provides better insight into what was going wrong and what we're doing to resolve the issues as they arise.
Changed depots in preview branch