Lister Engine Forum

General Category => General Discussion => Topic started by: AdeV on December 28, 2011, 08:54:24 AM

Title: Lister Engine Forum: UNEXPECTED CHRISTMAS (2011) DOWNTIME
Post by: AdeV on December 28, 2011, 08:54:24 AM
From: The Lister Engine Forum. Please do not reply to this e-mail (if you are reading this as an e-mail)


Fellow Lister Engine enthusiasts,

I would like to apologise most sincerely for the unexpected downtime LEF has experienced over the Christmas period. It must have looked like the "bad old days" were back. Well, I'd like to assure you that they most certainly are not back, and that this was one of those really bad luck situations experienced by our VPS host.

I will quote their status message so you can see what happened; this contains techie computer speak, so feel free to skip over the quote, and I will explain in English:

Quote
Update: Wednesday 28th December
The below post is a summary of previous posts regarding the VPS issues being experienced. Further updates will be added here.

On Thursday 22nd December at approximately 2100, we experienced a disk failure in one of our Storage Area Networks (SANs)

We have failover systems in place for an event such as this, as all of our SANs are clustered and have hot-spares (basically spares kept in the SANs solely for failover purposes), meaning an automatic rebuild takes place within minutes onto the other SANs. This process started as expected.

What we then encountered was the extremely rare event where we experienced both another disc failure on our other live SAN and the replication cluster at the same time (approximately 2300)

The chances of multiple failures on different systems and in such a short period of time is minute, but the result of events like this have severe effects on the workings of SAN systems and in this case caused them to go offline (and any VPS using the SAN to stop responding).

As such, the only option to restore service at this point was to bring the SAN back online, remove the initial server from the cluster, access the data and transfer the data from this onto the standby SAN.

Our engineers set about restoring the SAN immediately and also contacted Dell (the vendors of the hardware who we have support contracts with, who stated they had never seen an issue like this occur before) and worked throughout the night to bring the SAN back online. This was a very critical process as a main concern was with preservation of data so that when systems were fully back online, customers would still have their VPS in the state they were in prior to the outage.

With assistance from Dell (whose engineers have specific workarounds and tools to bypass some of the automated procedures with the SAN), we managed to bring the SAN in question back online the following afternoon on Friday 23rd December at approximately 1400 with 0% data loss.

We then set about removing this SAN (with the disk issues) from the cluster so we could next perform a data migration onto the active SAN. For this to happen, the disk array on the affected SAN first needed to be rebuilt to allow the data to be accessed. If for any reason the array rebuild failed, the SAN would automatically go into offline mode again. To minimise the chances of this happening and avoid any additional delays from having to restart the process, we fitted additional spare disks into SAN (the result of having several additional clean disks in a SAN meant that if there was a further error, the SAN would detect the disks and carry on with the process rather than defaulting back to offline mode)

Due to the very large volume of data the process of evicting the SAN from the cluster, rebuilding the array, performing a data integrity check and migrating the data from is a long and largely automated process. It is not possible to accurately estimate a duration, as the process will vary throughout (e.g. 0%-10% could complete twice as fast as 10%-20% or vice versa). We have kept our status page updated with estimates based on the time it had taken thus far. As we have seen though, the final 10% is taking the longest.

The process commenced from Friday 23rd December and so far into Monday 26th December. No further errors were detected, though the overall process stalled at 99% since the evening of Sunday 25th December and this ran into Monday 26th December. This was identified to be a corrupted partion, so our engineers set about copying this data onto external storage so that the rebuild process could be continued and the partion copied back when the rebuild was complete.

The partion was fully copied off by 1130 on Tuesday 27th December and the rebuild process continued through the afternoon until 1230 whent he VPS hosts were powered back up. By 1553 we had restored service to all our Windows Virtuozzo VPS and all but two of our Linux Virtuozzo hosts. Work was continuing throughout the night into Wednesday 28th December for our Xen and Hyperv VPS hosts and the two remaining Linux Virtuozzo hosts (one of which was going through a final disk check and the other was having the partition copied back.

Further updates will be posted to this page the moment we have them and will continue to do so until we have all services running as normal again.

The whole team here at Daily.co.uk sincerely apologise for any problems an outage of this nature will have inevitably caused our VPS customers. We would like to reassure customers that we are doing everything possible to restore service in the fastest possible manner whilst minimising the chances of any data loss.

A full incident report will also be available once systems are fully back online and all parties have been debriefed. This will clarify the above details along with further information as well, such as future preventative measures.

Or, in English, a disk broke, then a disk controller broke, then some other stuff broke, and the whole system shut itself down to prevent further failures. It then took Daily 3 (4?) days to get everything back up and running, hence the delay. I have also had to hard-reboot the server to get it going again, so there MAY be a post or two gone missing, but I sincerely hope not.

Anyway, apologies again, and hopefully this will never happen again.

Cheers!
Ade.
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: dieselgman on December 28, 2011, 10:34:27 AM
Bravo for bringing it back online again! Unfortunate timing for all of this...

I appreciate you posting the details Ade!

dieselgman
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: cgwymp on December 28, 2011, 12:43:02 PM
Thanks for the great work getting everything back up!
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: 38ac on December 28, 2011, 01:36:43 PM
Thanks for all you and whoever else does to keep it up and going. I hadn't realized that this place has become my cyber hangout until it was gone for a few days.

Thanks again, Butch
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: xyzer on December 28, 2011, 02:49:27 PM
Thanks for being "on top of it"! I began to worry when for the first time in a long time it was gone! The assurance really helped!
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: BigGreen on December 28, 2011, 11:32:21 PM
No apology neccessary. Thanks for everything you do.

Dave
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: Startomatic on December 29, 2011, 05:10:43 AM
many thanks for your effort in keeping the forum going. great job well done. greetings and compliments of the seasons from the far end of the Far East.
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: Horsepoor on December 29, 2011, 05:57:57 AM
Thank you for all your hard work and time. Reading on this forum ranks among my greatest pleasures, thank you.
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: billswan on December 29, 2011, 12:36:19 PM
Thanks to the admin.....

Billswan
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: AdeV on December 29, 2011, 12:55:46 PM
In fairness, I did very little - the hosting company did all the hard work, all I had to do was a reboot...
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: Apogee on December 29, 2011, 09:06:52 PM
Ade,

I too appreciate the work keeping this site going.

Thanks and Happy Holidays,

Steve
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: listerboy on December 31, 2013, 08:53:02 PM
Who are we to complain??  ;D  ;D

Thanks Ade!!  :angel:
Title: Re: Lister Engine Forum: UNEXPECTED CHRISTMAS DOWNTIME
Post by: AdeV on January 01, 2014, 02:13:27 PM
Who are we to complain??  ;D  ;D

Thanks Ade!!  :angel:

Heh - this was a couple of years ago now.... and other than some occasional erratic behaviour by the server (requiring a hard reboot), it's been largely OK ever since...

I'd better unsticky this topic, as it's well past its prime. To think, when I wrote that post, I was languishing in hospital with Pneumonia/Pleurisy...

Cheers! And happy new year everyone!