OFFICIAL DOWNTIME THREAD
Re: OFFICIAL DOWNTIME THREAD
The lag spikes / disconnects are happening with a stable frequency of about 12 minutes.
I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
- Attachments
-
- 2024-09-29 server downtime disconnection pattern - 1 hour.png (21.95 KiB) Viewed 9582 times
+Colibri, Administrator of UO Excelsior Shard
Don't know what the purpose of your life is? Well then make something up!
(Old Colibrian proverb)
Don't know what the purpose of your life is? Well then make something up!

(Old Colibrian proverb)
Re: OFFICIAL DOWNTIME THREAD
.
Last edited by Suthain on Sun Sep 29, 2024 9:36 pm, edited 1 time in total.
Re: OFFICIAL DOWNTIME THREAD
On the actual bright side, I bet there are no afk farmers currently.
+Colibri wrote: Sun Sep 29, 2024 6:59 am The lag spikes / disconnects are happening with a stable frequency of about 12 minutes.
I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
Respectfully,
Paroxysmus ILV Master Spellcaster

Paroxysmus ILV Master Spellcaster

Re: OFFICIAL DOWNTIME THREAD
Hopefully they will resolve the issue today 
Re: OFFICIAL DOWNTIME THREAD
I tried to move the server to an auxiliary one today, but there was a misconfiguration which I'll have to fix (figure out) and so we're back on the main one.
It's kind of late right now, and I have something else I have to do urgently today, so today I'll just try to get this thing sorted out, and if there's time we'll be moving today around midnight shard time. But if there's complications, we will be moving tomorrow.
It's kind of late right now, and I have something else I have to do urgently today, so today I'll just try to get this thing sorted out, and if there's time we'll be moving today around midnight shard time. But if there's complications, we will be moving tomorrow.
+Colibri, Administrator of UO Excelsior Shard
Don't know what the purpose of your life is? Well then make something up!
(Old Colibrian proverb)
Don't know what the purpose of your life is? Well then make something up!

(Old Colibrian proverb)
Re: OFFICIAL DOWNTIME THREAD
Works for me. Gives me time to catch up on other things.
- Johnny Warren
- Legendary Scribe
- Posts: 968
- Joined: Mon Oct 11, 2010 11:40 pm
Re: OFFICIAL DOWNTIME THREAD
EDIT:
Lag spikes seem gone for me. Averaging 250ms ping, which is also normal...
*Fingers crossed*
Lag spikes seem gone for me. Averaging 250ms ping, which is also normal...
*Fingers crossed*
Re: OFFICIAL DOWNTIME THREAD
Not so lucky here just got booted
- Wil
- Legendary Scribe
- Posts: 1228
- Joined: Mon Dec 30, 2013 1:19 pm
- Location: Seattle, WA, USA
- Contact:
Re: OFFICIAL DOWNTIME THREAD
Nope.Johnny Warren wrote: Wed Oct 02, 2024 10:23 pm Lag spikes seem gone for me. Averaging 250ms ping, which is also normal...
- Wil
- Legendary Scribe
- Posts: 1228
- Joined: Mon Dec 30, 2013 1:19 pm
- Location: Seattle, WA, USA
- Contact:
Re: OFFICIAL DOWNTIME THREAD
What's the scoop? I seem to recall this is not the first time your hosting provider has given you the run-around with faults in this particular piece of equipment.+Colibri wrote: Sun Sep 29, 2024 6:59 am I assume that the weekend techs are either too busy or unable to fix this issue, so I'm expecting this won't be fixed today (Sunday).
- Johnny Warren
- Legendary Scribe
- Posts: 968
- Joined: Mon Oct 11, 2010 11:40 pm
Re: OFFICIAL DOWNTIME THREAD
Yeah, lag was still there. I thought it had smoothed out.
Is it hard to just rent a better server? I hope you're not paying them for this service while it's not working.
Is it hard to just rent a better server? I hope you're not paying them for this service while it's not working.
Re: OFFICIAL DOWNTIME THREAD
I feel stupid I paid for gold/auction items that could help thousands of people. If only we could tell time.
@ Me Bro
Re: OFFICIAL DOWNTIME THREAD
It seems like the problem has gone away, it's a bit frustrating since there was no answer or any particular thing that I did, maybe our main server just needed some downtime and a couple of restarts.
Timeline of events:
- Saturday, September 28th at 22:15 shard time, server was disconnected. I was still able to access the IPMI management console.
- 23:30-ish shard time, the technician took over. The cable got unplugged or something along those lines, and he re-plugged it in. There was a bit of a back-and-forth with the tech, because well although they get alerts of the server being down, I was just connected to the management console, so the tech thought that the server is offline because I'm doing maintenance on it, and then half an hour later again a similar thing but because of something else.
- We were back online about 2 and a half hours later (Sunday at 0:52 shard time).
But after this, there was a brief connection degradation happening about every 12 minutes. Some people got disconnected, but not everyone... and there seem to be no particular rule about who would get disconnected. It seemed that it won't go away, so I planned to get the game server moved to an auxiliary/secondary server, and although that server was setup with everything ready to go, it hasn't been used in probably 3 years, so it was a bit of extra work to get it ready. I think it would be good to have like a fire drill a few times a year, to make sure it's all ready to go in case of an emergency.
On tuesday morning around 5am, the spikes stopped for about an hour and a half.
Well, Thursday, October 3rd at 18:20 we moved to the auxiliary server and I took the main one down for some diagnostics (running a bare minimum linux environment just to run some commands and see if the problem is still there). However, those lag spikes were gone. Even after going back into Windows, with everything running, no lag spikes. So at 22:00, we moved again to the main server (very fast with just 15 minutes of downtime). And the lag spikes are no longer there.
So there's no telling what exactly happened, only theories:
- Since these lag spikes started happening just after Saturday's disconnect (perhaps a loose cable), there could be some hardware/firmware/software issue that caused some loop or data overflow, going off every 12 minutes, overwhelming the connection, and causing those disconnects. Taking the server offline for a few hours and doing a couple restarts, might have cleared that up.
- A DOS attack, although it would have to be one that targets some vulnerabilities of the RunUO server or one of our services, in a way that wouldn't be detected by the Anti-DDOS system in the datacenter. Though I think this is unlikely.
- Similar to a DOS attack, but someone running some very intense script that would overwhelm the gameserver ... again, I don't think this is likely, but it could be a kind of accidental DOS-attack, someone just having their script set with too little timeout between actions. Though, no amount of requests to the game server could cause a gigabit connection to saturate.
It has now been about 12 hours and it seems that the connection is stable.
Point A: moving from main server to auxiliary.
Point B: moving back to main
The lag spikes stopped right after point A.
@Wil - it's a pretty good host (OVH). I just looked back, in late August 2021, there was a hardware issue. They replaced the motherboard at the time, but it took some effort to get them to look at it seriously. Although, well they have a huge number of servers and automated monitoring, and there were also times when something was fixed in a very short time and no hassle.
@Lach - not sure what you wanted to say with your post, it doesn't seem to be helping though. If the lag spikes were still happening, I'd be trying to figure it out, or probably just move us to a new server. If the lag spikes continued to happen on another server, then I'd just keep digging. But no amount of money can make the process any faster. (well, hiring an IT professional but since all these systems and software is completely custom, and there's a lot of trust involved, would be hard to get someone very fast).
Timeline of events:
- Saturday, September 28th at 22:15 shard time, server was disconnected. I was still able to access the IPMI management console.
- 23:30-ish shard time, the technician took over. The cable got unplugged or something along those lines, and he re-plugged it in. There was a bit of a back-and-forth with the tech, because well although they get alerts of the server being down, I was just connected to the management console, so the tech thought that the server is offline because I'm doing maintenance on it, and then half an hour later again a similar thing but because of something else.
- We were back online about 2 and a half hours later (Sunday at 0:52 shard time).
But after this, there was a brief connection degradation happening about every 12 minutes. Some people got disconnected, but not everyone... and there seem to be no particular rule about who would get disconnected. It seemed that it won't go away, so I planned to get the game server moved to an auxiliary/secondary server, and although that server was setup with everything ready to go, it hasn't been used in probably 3 years, so it was a bit of extra work to get it ready. I think it would be good to have like a fire drill a few times a year, to make sure it's all ready to go in case of an emergency.
On tuesday morning around 5am, the spikes stopped for about an hour and a half.
Well, Thursday, October 3rd at 18:20 we moved to the auxiliary server and I took the main one down for some diagnostics (running a bare minimum linux environment just to run some commands and see if the problem is still there). However, those lag spikes were gone. Even after going back into Windows, with everything running, no lag spikes. So at 22:00, we moved again to the main server (very fast with just 15 minutes of downtime). And the lag spikes are no longer there.
So there's no telling what exactly happened, only theories:
- Since these lag spikes started happening just after Saturday's disconnect (perhaps a loose cable), there could be some hardware/firmware/software issue that caused some loop or data overflow, going off every 12 minutes, overwhelming the connection, and causing those disconnects. Taking the server offline for a few hours and doing a couple restarts, might have cleared that up.
- A DOS attack, although it would have to be one that targets some vulnerabilities of the RunUO server or one of our services, in a way that wouldn't be detected by the Anti-DDOS system in the datacenter. Though I think this is unlikely.
- Similar to a DOS attack, but someone running some very intense script that would overwhelm the gameserver ... again, I don't think this is likely, but it could be a kind of accidental DOS-attack, someone just having their script set with too little timeout between actions. Though, no amount of requests to the game server could cause a gigabit connection to saturate.
It has now been about 12 hours and it seems that the connection is stable.
Point A: moving from main server to auxiliary.
Point B: moving back to main
The lag spikes stopped right after point A.
@Wil - it's a pretty good host (OVH). I just looked back, in late August 2021, there was a hardware issue. They replaced the motherboard at the time, but it took some effort to get them to look at it seriously. Although, well they have a huge number of servers and automated monitoring, and there were also times when something was fixed in a very short time and no hassle.
@Lach - not sure what you wanted to say with your post, it doesn't seem to be helping though. If the lag spikes were still happening, I'd be trying to figure it out, or probably just move us to a new server. If the lag spikes continued to happen on another server, then I'd just keep digging. But no amount of money can make the process any faster. (well, hiring an IT professional but since all these systems and software is completely custom, and there's a lot of trust involved, would be hard to get someone very fast).
+Colibri, Administrator of UO Excelsior Shard
Don't know what the purpose of your life is? Well then make something up!
(Old Colibrian proverb)
Don't know what the purpose of your life is? Well then make something up!

(Old Colibrian proverb)
- Wil
- Legendary Scribe
- Posts: 1228
- Joined: Mon Dec 30, 2013 1:19 pm
- Location: Seattle, WA, USA
- Contact:
Re: OFFICIAL DOWNTIME THREAD
As you say, they have plenty of servers. When I worked at Facebook, intermittent issues with specific machines were solved by removing the machine from the service pool and bringing up a replacement. The machine would then run a thorough test suite. If it passed, it'd be returned to the spare pool awaiting the next application while the replacement host would remain online. Even if it passed, it would be tracked. If a machine was removed for odd behavior not tracked to a clear cause a couple of times, it would be declared a lemon and retired from service.+Colibri wrote: Fri Oct 04, 2024 7:24 am they have a huge number of servers and automated monitoring,
When you're a large hosting company serious about the machines working right, that's how you do it.