Author Topic: Lister Engine Forum: UNEXPECTED CHRISTMAS (2011) DOWNTIME  (Read 9403 times)

AdeV

  • Administrator
  • Hero Member
  • *****
  • Posts: 570
    • View Profile
Lister Engine Forum: UNEXPECTED CHRISTMAS (2011) DOWNTIME
« on: December 28, 2011, 08:54:24 AM »
From: The Lister Engine Forum. Please do not reply to this e-mail (if you are reading this as an e-mail)


Fellow Lister Engine enthusiasts,

I would like to apologise most sincerely for the unexpected downtime LEF has experienced over the Christmas period. It must have looked like the "bad old days" were back. Well, I'd like to assure you that they most certainly are not back, and that this was one of those really bad luck situations experienced by our VPS host.

I will quote their status message so you can see what happened; this contains techie computer speak, so feel free to skip over the quote, and I will explain in English:

Quote
Update: Wednesday 28th December
The below post is a summary of previous posts regarding the VPS issues being experienced. Further updates will be added here.

On Thursday 22nd December at approximately 2100, we experienced a disk failure in one of our Storage Area Networks (SANs)

We have failover systems in place for an event such as this, as all of our SANs are clustered and have hot-spares (basically spares kept in the SANs solely for failover purposes), meaning an automatic rebuild takes place within minutes onto the other SANs. This process started as expected.

What we then encountered was the extremely rare event where we experienced both another disc failure on our other live SAN and the replication cluster at the same time (approximately 2300)

The chances of multiple failures on different systems and in such a short period of time is minute, but the result of events like this have severe effects on the workings of SAN systems and in this case caused them to go offline (and any VPS using the SAN to stop responding).

As such, the only option to restore service at this point was to bring the SAN back online, remove the initial server from the cluster, access the data and transfer the data from this onto the standby SAN.

Our engineers set about restoring the SAN immediately and also contacted Dell (the vendors of the hardware who we have support contracts with, who stated they had never seen an issue like this occur before) and worked throughout the night to bring the SAN back online. This was a very critical process as a main concern was with preservation of data so that when systems were fully back online, customers would still have their VPS in the state they were in prior to the outage.

With assistance from Dell (whose engineers have specific workarounds and tools to bypass some of the automated procedures with the SAN), we managed to bring the SAN in question back online the following afternoon on Friday 23rd December at approximately 1400 with 0% data loss.

We then set about removing this SAN (with the disk issues) from the cluster so we could next perform a data migration onto the active SAN. For this to happen, the disk array on the affected SAN first needed to be rebuilt to allow the data to be accessed. If for any reason the array rebuild failed, the SAN would automatically go into offline mode again. To minimise the chances of this happening and avoid any additional delays from having to restart the process, we fitted additional spare disks into SAN (the result of having several additional clean disks in a SAN meant that if there was a further error, the SAN would detect the disks and carry on with the process rather than defaulting back to offline mode)

Due to the very large volume of data the process of evicting the SAN from the cluster, rebuilding the array, performing a data integrity check and migrating the data from is a long and largely automated process. It is not possible to accurately estimate a duration, as the process will vary throughout (e.g. 0%-10% could complete twice as fast as 10%-20% or vice versa). We have kept our status page updated with estimates based on the time it had taken thus far. As we have seen though, the final 10% is taking the longest.

The process commenced from Friday 23rd December and so far into Monday 26th December. No further errors were detected, though the overall process stalled at 99% since the evening of Sunday 25th December and this ran into Monday 26th December. This was identified to be a corrupted partion, so our engineers set about copying this data onto external storage so that the rebuild process could be continued and the partion copied back when the rebuild was complete.

The partion was fully copied off by 1130 on Tuesday 27th December and the rebuild process continued through the afternoon until 1230 whent he VPS hosts were powered back up. By 1553 we had restored service to all our Windows Virtuozzo VPS and all but two of our Linux Virtuozzo hosts. Work was continuing throughout the night into Wednesday 28th December for our Xen and Hyperv VPS hosts and the two remaining Linux Virtuozzo hosts (one of which was going through a final disk check and the other was having the partition copied back.

Further updates will be posted to this page the moment we have them and will continue to do so until we have all services running as normal again.

The whole team here at Daily.co.uk sincerely apologise for any problems an outage of this nature will have inevitably caused our VPS customers. We would like to reassure customers that we are doing everything possible to restore service in the fastest possible manner whilst minimising the chances of any data loss.

A full incident report will also be available once systems are fully back online and all parties have been debriefed. This will clarify the above details along with further information as well, such as future preventative measures.

Or, in English, a disk broke, then a disk controller broke, then some other stuff broke, and the whole system shut itself down to prevent further failures. It then took Daily 3 (4?) days to get everything back up and running, hence the delay. I have also had to hard-reboot the server to get it going again, so there MAY be a post or two gone missing, but I sincerely hope not.

Anyway, apologies again, and hopefully this will never happen again.

Cheers!
Ade.
« Last Edit: January 01, 2014, 02:13:55 PM by AdeV »
Cheers!
Ade.
--------------
1x Lister CS Start-o-Matic (complete, runs)
0x Lister JP4 :( - Sold to go in a canal boat.

dieselgman

  • Hero Member
  • *****
  • Posts: 3148
    • View Profile
    • Lister Parts
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #1 on: December 28, 2011, 10:34:27 AM »
Bravo for bringing it back online again! Unfortunate timing for all of this...

I appreciate you posting the details Ade!

dieselgman
Ford Powerstroke, Caterpillar 3304, Cummins M11, Too many Listers to count...

cgwymp

  • Full Member
  • ***
  • Posts: 133
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #2 on: December 28, 2011, 12:43:02 PM »
Thanks for the great work getting everything back up!
Listeroid 8/1

38ac

  • Hero Member
  • *****
  • Posts: 1812
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #3 on: December 28, 2011, 01:36:43 PM »
Thanks for all you and whoever else does to keep it up and going. I hadn't realized that this place has become my cyber hangout until it was gone for a few days.

Thanks again, Butch
Collector and horder of about anything diesel

xyzer

  • Hero Member
  • *****
  • Posts: 1052
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #4 on: December 28, 2011, 02:49:27 PM »
Thanks for being "on top of it"! I began to worry when for the first time in a long time it was gone! The assurance really helped!
Vidhata 6/1 portable
Power Solutions portable 6/1
Z482 KUBOTA

BigGreen

  • Jr. Member
  • **
  • Posts: 80
    • View Profile
    • Mach1Pony
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #5 on: December 28, 2011, 11:32:21 PM »
No apology neccessary. Thanks for everything you do.

Dave
Dave
More Power Ashwamegh 25/2 15kw

Startomatic

  • Jr. Member
  • **
  • Posts: 86
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #6 on: December 29, 2011, 05:10:43 AM »
many thanks for your effort in keeping the forum going. great job well done. greetings and compliments of the seasons from the far end of the Far East.
Lister 6/1 with ST5,Lister SL3, JP1
Yanmar Diesel L10,TS65,TS105 with Interpump hp head
Yanmar 3TN72, NT110 with ST5
Andoria S320
Kubota Z400 Twin Diesel with KEW 1500 hp head
Isuzu 25KVA Diesel silenced Genset
ChangChai EV80
JianDong 1115
Shineray250 watercooled- gasoline
ListerHRW3water

Horsepoor

  • Sr. Member
  • ****
  • Posts: 250
  • West Palm Beach, Florida
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #7 on: December 29, 2011, 05:57:57 AM »
Thank you for all your hard work and time. Reading on this forum ranks among my greatest pleasures, thank you.
GTC 20/2 down rated to 850 rpm - ST 15
Metro 6/1 800 rpm on cart - ST 7.5

billswan

  • Sr. Member
  • ****
  • Posts: 439
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #8 on: December 29, 2011, 12:36:19 PM »
Thanks to the admin.....

Billswan
16/1 Metro  in the harness choking on WMO ash!!

10/1 OMEGA failed that nasty WMO ash ate it

By the way what is your cylinder index?

AdeV

  • Administrator
  • Hero Member
  • *****
  • Posts: 570
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #9 on: December 29, 2011, 12:55:46 PM »
In fairness, I did very little - the hosting company did all the hard work, all I had to do was a reboot...
Cheers!
Ade.
--------------
1x Lister CS Start-o-Matic (complete, runs)
0x Lister JP4 :( - Sold to go in a canal boat.

Apogee

  • Jr. Member
  • **
  • Posts: 75
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #10 on: December 29, 2011, 09:06:52 PM »
Ade,

I too appreciate the work keeping this site going.

Thanks and Happy Holidays,

Steve

listerboy

  • Jr. Member
  • **
  • Posts: 97
  • I love the smell of diesel in the morning....
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #11 on: December 31, 2013, 08:53:02 PM »
Who are we to complain??  ;D  ;D

Thanks Ade!!  :angel:

AdeV

  • Administrator
  • Hero Member
  • *****
  • Posts: 570
    • View Profile
Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
« Reply #12 on: January 01, 2014, 02:13:27 PM »
Who are we to complain??  ;D  ;D

Thanks Ade!!  :angel:

Heh - this was a couple of years ago now.... and other than some occasional erratic behaviour by the server (requiring a hard reboot), it's been largely OK ever since...

I'd better unsticky this topic, as it's well past its prime. To think, when I wrote that post, I was languishing in hospital with Pneumonia/Pleurisy...

Cheers! And happy new year everyone!
Cheers!
Ade.
--------------
1x Lister CS Start-o-Matic (complete, runs)
0x Lister JP4 :( - Sold to go in a canal boat.