OFFICIAL DOWNTIME THREAD

Anything you find suspicious, things that crash your client, things that crash the server, anything that doesnt work as it should.
User avatar
+Colibri
Administrator
Reactions:
Posts: 4077
Joined: Sat Feb 25, 2006 4:08 pm
Location: static void Main

Re: OFFICIAL DOWNTIME THREAD

Post by +Colibri »

The lag spikes / disconnects are happening with a stable frequency of about 12 minutes.

I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
Attachments
2024-09-29 server downtime disconnection pattern - 1 hour.png
2024-09-29 server downtime disconnection pattern - 1 hour.png (21.95 KiB) Viewed 9588 times
+Colibri, Administrator of UO Excelsior Shard

Don't know what the purpose of your life is? Well then make something up! ;)
(Old Colibrian proverb)
Suthain
Apprentice Scribe
Reactions:
Posts: 13
Joined: Fri Apr 09, 2021 7:58 am

Re: OFFICIAL DOWNTIME THREAD

Post by Suthain »

.
Last edited by Suthain on Sun Sep 29, 2024 9:36 pm, edited 1 time in total.
User avatar
MagicUser
Legendary Scribe
Reactions:
Posts: 206
Joined: Mon Nov 03, 2014 2:24 pm
Location: PST

Re: OFFICIAL DOWNTIME THREAD

Post by MagicUser »

On the actual bright side, I bet there are no afk farmers currently.
+Colibri wrote: Sun Sep 29, 2024 6:59 am The lag spikes / disconnects are happening with a stable frequency of about 12 minutes.

I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
Respectfully,
Paroxysmus ILV Master Spellcaster

Image
Blakentar
Passer by
Reactions:
Posts: 4
Joined: Mon Sep 30, 2024 6:47 am

Re: OFFICIAL DOWNTIME THREAD

Post by Blakentar »

Hopefully they will resolve the issue today 😃
User avatar
+Colibri
Administrator
Reactions:
Posts: 4077
Joined: Sat Feb 25, 2006 4:08 pm
Location: static void Main

Re: OFFICIAL DOWNTIME THREAD

Post by +Colibri »

I tried to move the server to an auxiliary one today, but there was a misconfiguration which I'll have to fix (figure out) and so we're back on the main one.
It's kind of late right now, and I have something else I have to do urgently today, so today I'll just try to get this thing sorted out, and if there's time we'll be moving today around midnight shard time. But if there's complications, we will be moving tomorrow.
+Colibri, Administrator of UO Excelsior Shard

Don't know what the purpose of your life is? Well then make something up! ;)
(Old Colibrian proverb)
Dirtybook
Elder Scribe
Reactions:
Posts: 159
Joined: Sat Jan 21, 2023 3:53 pm

Re: OFFICIAL DOWNTIME THREAD

Post by Dirtybook »

Works for me. Gives me time to catch up on other things.
User avatar
Johnny Warren
Legendary Scribe
Reactions:
Posts: 968
Joined: Mon Oct 11, 2010 11:40 pm

Re: OFFICIAL DOWNTIME THREAD

Post by Johnny Warren »

EDIT:

Lag spikes seem gone for me. Averaging 250ms ping, which is also normal...

*Fingers crossed*
JOHNNY WARREN!
Image
tazlyn50
Novice Scribe
Reactions:
Posts: 9
Joined: Wed Sep 27, 2023 4:22 pm

Re: OFFICIAL DOWNTIME THREAD

Post by tazlyn50 »

Not so lucky here just got booted
Alibaster
Legendary Scribe
Reactions:
Posts: 270
Joined: Wed Nov 16, 2016 11:02 am

Re: OFFICIAL DOWNTIME THREAD

Post by Alibaster »

Same here. Booted out 3 times in the last hour
Alibaster in game!!
User avatar
Wil
Legendary Scribe
Reactions:
Posts: 1228
Joined: Mon Dec 30, 2013 1:19 pm
Location: Seattle, WA, USA
Contact:

Re: OFFICIAL DOWNTIME THREAD

Post by Wil »

Johnny Warren wrote: Wed Oct 02, 2024 10:23 pm Lag spikes seem gone for me. Averaging 250ms ping, which is also normal...
Nope.
User avatar
Wil
Legendary Scribe
Reactions:
Posts: 1228
Joined: Mon Dec 30, 2013 1:19 pm
Location: Seattle, WA, USA
Contact:

Re: OFFICIAL DOWNTIME THREAD

Post by Wil »

+Colibri wrote: Sun Sep 29, 2024 6:59 am I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
What's the scoop? I seem to recall this is not the first time your hosting provider has given you the run-around with faults in this particular piece of equipment.
User avatar
Johnny Warren
Legendary Scribe
Reactions:
Posts: 968
Joined: Mon Oct 11, 2010 11:40 pm

Re: OFFICIAL DOWNTIME THREAD

Post by Johnny Warren »

Yeah, lag was still there. I thought it had smoothed out.

Is it hard to just rent a better server? I hope you're not paying them for this service while it's not working.
JOHNNY WARREN!
Image
User avatar
Lach
Legendary Scribe
Reactions:
Posts: 416
Joined: Wed Jul 29, 2020 6:47 am

Re: OFFICIAL DOWNTIME THREAD

Post by Lach »

I feel stupid I paid for gold/auction items that could help thousands of people. If only we could tell time.
@ Me Bro
User avatar
+Colibri
Administrator
Reactions:
Posts: 4077
Joined: Sat Feb 25, 2006 4:08 pm
Location: static void Main

Re: OFFICIAL DOWNTIME THREAD

Post by +Colibri »

It seems like the problem has gone away, it's a bit frustrating since there was no answer or any particular thing that I did, maybe our main server just needed some downtime and a couple of restarts.

Timeline of events:
- Saturday, September 28th at 22:15 shard time, server was disconnected. I was still able to access the IPMI management console.
- 23:30-ish shard time, the technician took over. The cable got unplugged or something along those lines, and he re-plugged it in. There was a bit of a back-and-forth with the tech, because well although they get alerts of the server being down, I was just connected to the management console, so the tech thought that the server is offline because I'm doing maintenance on it, and then half an hour later again a similar thing but because of something else.
- We were back online about 2 and a half hours later (Sunday at 0:52 shard time).

But after this, there was a brief connection degradation happening about every 12 minutes. Some people got disconnected, but not everyone... and there seem to be no particular rule about who would get disconnected. It seemed that it won't go away, so I planned to get the game server moved to an auxiliary/secondary server, and although that server was setup with everything ready to go, it hasn't been used in probably 3 years, so it was a bit of extra work to get it ready. I think it would be good to have like a fire drill a few times a year, to make sure it's all ready to go in case of an emergency.
On tuesday morning around 5am, the spikes stopped for about an hour and a half.

Well, Thursday, October 3rd at 18:20 we moved to the auxiliary server and I took the main one down for some diagnostics (running a bare minimum linux environment just to run some commands and see if the problem is still there). However, those lag spikes were gone. Even after going back into Windows, with everything running, no lag spikes. So at 22:00, we moved again to the main server (very fast with just 15 minutes of downtime). And the lag spikes are no longer there.

So there's no telling what exactly happened, only theories:
- Since these lag spikes started happening just after Saturday's disconnect (perhaps a loose cable), there could be some hardware/firmware/software issue that caused some loop or data overflow, going off every 12 minutes, overwhelming the connection, and causing those disconnects. Taking the server offline for a few hours and doing a couple restarts, might have cleared that up.
- A DOS attack, although it would have to be one that targets some vulnerabilities of the RunUO server or one of our services, in a way that wouldn't be detected by the Anti-DDOS system in the datacenter. Though I think this is unlikely.
- Similar to a DOS attack, but someone running some very intense script that would overwhelm the gameserver ... again, I don't think this is likely, but it could be a kind of accidental DOS-attack, someone just having their script set with too little timeout between actions. Though, no amount of requests to the game server could cause a gigabit connection to saturate.

It has now been about 12 hours and it seems that the connection is stable.

2024-10-04 uogateway online chart.png
2024-10-04 uogateway online chart.png (33.79 KiB) Viewed 8635 times
Point A: moving from main server to auxiliary.
Point B: moving back to main
The lag spikes stopped right after point A.


@Wil - it's a pretty good host (OVH). I just looked back, in late August 2021, there was a hardware issue. They replaced the motherboard at the time, but it took some effort to get them to look at it seriously. Although, well they have a huge number of servers and automated monitoring, and there were also times when something was fixed in a very short time and no hassle.
@Lach - not sure what you wanted to say with your post, it doesn't seem to be helping though. If the lag spikes were still happening, I'd be trying to figure it out, or probably just move us to a new server. If the lag spikes continued to happen on another server, then I'd just keep digging. But no amount of money can make the process any faster. (well, hiring an IT professional but since all these systems and software is completely custom, and there's a lot of trust involved, would be hard to get someone very fast).
+Colibri, Administrator of UO Excelsior Shard

Don't know what the purpose of your life is? Well then make something up! ;)
(Old Colibrian proverb)
User avatar
Wil
Legendary Scribe
Reactions:
Posts: 1228
Joined: Mon Dec 30, 2013 1:19 pm
Location: Seattle, WA, USA
Contact:

Re: OFFICIAL DOWNTIME THREAD

Post by Wil »

+Colibri wrote: Fri Oct 04, 2024 7:24 am they have a huge number of servers and automated monitoring,
As you say, they have plenty of servers. When I worked at Facebook, intermittent issues with specific machines were solved by removing the machine from the service pool and bringing up a replacement. The machine would then run a thorough test suite. If it passed, it'd be returned to the spare pool awaiting the next application while the replacement host would remain online. Even if it passed, it would be tracked. If a machine was removed for odd behavior not tracked to a clear cause a couple of times, it would be declared a lemon and retired from service.

When you're a large hosting company serious about the machines working right, that's how you do it.
Post Reply