[lug-l] VM Host Failure

This is an update to the current issue with LUG's infrastructure (http://status.lug.mtu.edu/incidents/blgb1j2vl2x6).

Over winter break some users reported to the admin email list lug-server-admins-l@xxxxxxx that they were unable to login to the shell server shell.lug.mtu.edu. After some investigation it was found that the RAID array on our VM host had gone into a degraded state and turned itself off. Looking at the Xen logs we saw that the problem started a few hours before the array went offline, starting with a few I/O errors -- at which point we think the first disk failed, then after a few hours the second drive failed causing the array to be in a very bad state and force itself offline.

I have been working on imaging the drives in an attempt to recover what data I can, but my Fedora workstation was built with btrfs for the OS -- tip don't use btrfs ever -- and it died so I had to rebuild, but with it being the beginning of the new semester I had other things I had to take care of. I have restored from my backups and am now able to begin imaging the drives.

So far things are not looking very good, about 4 hours ago -- 2016-01-18 06:16 -- I started to image the first disk I picked up from the four (9VS2R298), and it keeps disconnecting itself...

```

[270577.469248] sd 124:0:0:0: Attached scsi generic sg6 type 0

[270577.469269] sd 124:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)

[270577.469713] sd 124:0:0:0: [sdf] Write Protect is off

[270577.469715] sd 124:0:0:0: [sdf] Mode Sense: 23 00 00 00

[270577.470161] sd 124:0:0:0: [sdf] No Caching mode page found

[270577.470164] sd 124:0:0:0: [sdf] Assuming drive cache: write through

[270577.473391] sd 124:0:0:0: [sdf] Attached SCSI disk

[270591.785143] usb 2-2: USB disconnect, device number 110

[270592.013353] usb 2-2: new SuperSpeed USB device number 111 using xhci_hcd

[270592.028740] usb 2-2: New USB device found, idVendor=174c, idProduct=5106

[270592.028743] usb 2-2: New USB device strings: Mfr=2, Product=3, SerialNumber=1

[270592.028744] usb 2-2: Product: AS2105

[270592.028745] usb 2-2: Manufacturer: ASMedia

[270592.028746] usb 2-2: SerialNumber: 9VS2R298

[270592.031032] usb-storage 2-2:1.0: USB Mass Storage device detected

[270592.031122] scsi host125: usb-storage 2-2:1.0

[270593.031687] scsi 125:0:0:0: Direct-Access ST315003 41AS CC1H PQ: 0 ANSI: 5

[270593.032111] sd 125:0:0:0: Attached scsi generic sg6 type 0

[270593.032157] sd 125:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)

[270593.032688] sd 125:0:0:0: [sdf] Write Protect is off

[270593.032690] sd 125:0:0:0: [sdf] Mode Sense: 23 00 00 00

[270593.033204] sd 125:0:0:0: [sdf] No Caching mode page found

[270593.033206] sd 125:0:0:0: [sdf] Assuming drive cache: write through

[270593.036441] sd 125:0:0:0: [sdf] Attached SCSI disk

```

Over and over, in the many hours I have since moved that drive to a USB3 dock while I moved another of the disks to my hotswap bay... The really bad news is that the first disk I started to image was not one of the 2 which reported as being failed by the RAID controller.

I will continue to attempt the data recovery but I am not holding my breath that I will be able to recover anything useful. I am hoping to be finished with the recovery attempt by the start of next week.

In the mean time the old adviser for PSG was kind enough to donate their server with a Xeon 5120 @1.86GHz and > 16G of RAM, this should be much better than the old VM. The downside to that server is that it requires new hard drives -- the current drives are failing. So we are currently looking for decent SATA drives for that box, and once we get drives for it we will install the latest version of Debian and restart the shell server.

If you have any questions for the admins please send them to lug-server-admins-l@xxxxxxx, someone will get back with you.

Brandon Ingalls
Computer Network & System Administration Major

[lug-l] VM Host Failure - Update