Server coming back online. It crashed at 21:35, so there's 35 minute revert, sorry about that. Just came back online at 0:00 so that means a 2 hour and 25 minute downtime.
I fixed 2 parts of the code that caused this, but that's just on the surface... I have 2 more upgrades planned to prevent such deadlocks in the future (or at least, so that the shard keeps working at some half-capacity, for the world to be saved and regularly restarted without a revert). I hope there's no other similar bugs that would crash the server when I'm not here to fix it, and we end up with a 6-8 hour downtime.
Here's what happened.
The culprits: Pariah and Banethorn, but they were framed by ThreadId 1 and ThreadId 270. All orchestrated by +Colibri, but there's no evidence :p
This is just an example of a deadlock. The image below is a very good metaphore of what happened.
Our server uses a lot of multi-threaded code to keep the lag down. For example, when you do a search with [mystuff, that should cause a noticable lag spike if used by a player with a lot of stuff. But it's multi-threaded, so that while everyone is attacking monsters, that searching algorithm just works on the data from the sidelines.
However, most things cannot just be concurrent, imagine a bunch of blind people running around, each having a spear aimed ahead of them. For example, one thread might want to sum up the numbers in a list of 100 numbers, and as it's doing that, another thread removes one number. As the first thread wants to read the last of the 100 numbers, it's no longer there, because the list is now just 99 numbers long, and that just causes a crash. But just crashing is a good thing, a worse scenario that can happen is silent data corruption, and you don't know where it's coming from. That's why computers use semaphores, to signal who can currently work on a piece of data, so that only one at a time does it. No data corruption, no problem. The problem is just when, in a system that's very complex, for all the lights to turn red, and we get a deadlock.
This happened twice in the past, always when I'm on vacation

I remember one time in ... july, probably 2017. Then again August 1st 2020, last summer. These things are almost impossible to catch in testing, only show when the shard is under load of various activity (everyone doing a lot of different things at once). Well, there are coding practices that prevent such deadlocks, but it makes things much harder to code. This is just a game server, not the software that runs the electrical grid.

- deadlock.png (57.17 KiB) Viewed 7313 times