Cascading failures (or, why I did nothing this weekend)

This is a fun one.

To set the scene and provide information in temporal order, my Talos and WD Black NVMe device have never “gotten along” well. Frequently, the device would fail to train for whatever reason. Calling reboot from a Petitboot shell with fast-reset enabled was enough to fix this, so I didn’t think all that much about it.

Late in the night Thursday (or early in the morning Friday, if you prefer), I was reading a few articles before I went to bed. At 01:36:13, an animal darted in front of a car about two miles away, causing the car to crash into a power pole. This caused a serious power surge as the lines came down on to each other (and the car’s frame).

My office is was protected by an APC BX1500G combination battery backup (SPS, not UPS) and surge protection unit. The power surge was severe enough that the unit failed. The lights, the Power Mac G5, and the Talos in my office went immediately dark as the alarm went off making a continuous noise while “F04” and “See Manual” flashed on the display. This code means “Clamp Short”, and means that the varistor that is supposed to arrest surges had become permanently ‘stuck’ in arrest mode.

My first priority was to ensure the integrity of my hardware, so I dug out a spare APC SPS and put the battery from the now-failed one in it. I powered on my Talos and while it seemed to IPL fine, lspci in Petitboot did not show my NVMe device as present. Looking in Hostboot, it now wasn’t even failing to train — the slot may as well have been unoccupied. I tried both a fast-reset and a full system power cycle to no avail.

The next day, I attempted to swap slots, thinking the PCIe slot that the NVMe device was connected to may have been damaged by the power surge. I swapped the NVMe and the sound card. The sound card worked in the slot formerly occupied by the NVMe, but the NVMe still wouldn’t come up in the slot formerly occupied by the sound card. Now came the worrying part: did the M.2 to PCIe adaptor fail, or did the NVMe media itself fail?

I went to two of our local computer stores to buy some parts. I decided that since I was already going to need to do a full disk swap (even if I could recover the data off the NVMe media), it would make sense to add SATA media as well. I had a 256 GB SATA SSD laying around that was supposed to be put in my Xeon until it failed. I found an “open box” Seagate 1 TB HDD for 30 USD at DISC Surplus Computers in Sand Springs. And I found a new 4-port Marvell chipset based SATA controller for 34 USD at Wholesale Computer Supply in Tulsa. I also had a 3 port USB 3.0 controller card sitting on a shelf since the slot it was meant to go in was occupied by a 2-slot Radeon that was used for big endian amdgpu.ko porting. I went ahead and shelved that Radeon (and the big endian porting effort, for now) and used the CPU 1 slot for SATA and the CPU 2 slot for the USB card.

I then turned my attention to recovering the NVMe media. I brought out my old Intel Reference Platform board, a DP43TF with developer firmware, and put the NVMe adaptor card with media in to it. Unfortunately, the DP43TF only has one PCIe slot larger than x1, and it’s the x16 typically used for a GPU. Since NVMe is x4, I had to find a PCI GPU. I pulled a GeForce 8400GS out of one of our Pentium 4 test boxes and attempted to boot the Adélie 1.0-BETA4 live CD.

Our Live CD does not support the JMicron PATA controller that the DP43TF’s DVD drive was connected to. I ended up using a USB optical media, but I also could have used SATA optical media. The CD I was attempting to use was scratched, and it refused to finish booting (the scratched section appears to have contained OpenRC). I had to find a computer capable of burning media, which was no small task since most newer computers don’t support writing optical media and most of my computers have marginal USB support at best.

One of our community members reminded me that the PowerBook G4 has a SuperDrive, which I used to burn a fresh x86_64 BETA4 CD. Finally booted, I noticed the NVMe was present but throwing occasional controller reset errors. I’m not sure if this was due to media degradation or the fact it was a Gen3 NVMe in a Gen1 PCIe slot. At any rate, I used dd to make a full clone of the NVMe to the 1 TB Seagate disk, and then put that in the Talos. A gracious member of the Adélie Linux community donated the funds needed to replace the NVMe with a Samsung 970 EVO Pro of the same size.

Yesterday I copied the data off the Seagate 1 TB to the new Samsung NVMe. Everything is working quite well, and the Samsung is much faster than the Western Digital; 712 MB/s uncached write vs 303 MB/s on the WD. The additional space on the Seagate can be used for further testing, and for possible expansion of Adélie to more platforms — I may post more on that later. 🙂

This was a very interesting experience for me. It’s been many years since I’ve seen a cascade of failures like this: car accident breaks APC SPS, which breaks NVMe marginally, which shows an issue booting our live CD on a specific computer. It also gave me a reason to re-catalogue a lot of the hardware I have on hand for testing purposes, and to know what needs fixing and replacing. And most importantly, it made me realise I need to perform weekly backups instead of semi-annual backups.

I want to especially thank the members of the Adélie Linux community that helped with this process, not only financially but with techniques and ideas to make this go well. My workstation is better than ever, and now I can get even more done for libre software. You rock!

Wednesday: Photos around Tulsa

No tech today. Haven’t been out on the open road for far too long. I took a few nice photos in the passenger seat as we were heading westward. (It was far too dark to take any photos when we went back east.)

[Cute cat, lounging on window sill]
Mr Gaz on his perch in my home office, just before we left
[Clouds over a shopping centre]
I believe these are Stratocumulus clouds, which had a rather striking appearance over Southroads today
[Clouds with a sunset on the horizon, offset by a highway exit sign]
A beautiful sky around twilight, taken from westbound I-44.
[Tulsa skyline with Arkansas River in front]
The downtown Tulsa skyline, as seen from the I-44 bridge over the Arkansas River.
[Adorable cat, but not as adorable as Mr Gaz]
As a bonus, this lovable four month old tabby is Shelby, and she’s currently available for adoption at the Tulsa Hills PetSmart. She is bubbly and loves scritches!

Happy Workaholic Day!

I’ve never been a big fan of stores being open on Thanksgiving Day, because I feel that American culture already emphasises consumerism and unhealthy obsessions with work enough. However, I rarely say anything, because what are you going to do with big-box retailers? They want some of that Black Friday money, and they typically don’t open until 9 PM or later on Thanksgiving — that’s late enough that I could see a reasonable amount of relaxation or family time being spent.

That is, until I opened my email yesterday afternoon and received this email from our local, “Oklahoma Proud” grocer, Reasor’s:

Open Thanksgiving - Regular Store Hours

I was definitely not Oklahoma Proud. I was Oklahoma Ashamed. I was also appalled and disgusted. They aren’t even treating Thanksgiving as a holiday. It’s just another work day in another work week. Some of their stores are open 24 hours — they won’t close at all for this holiday!

And it just kept coming. I received this email shortly after picking up our family’s meal package at The Fresh Market:

Open until 3pm Thanksgiving

That’s slightly better, but still doesn’t allow employees much freedom to spend Thanksgiving morning and afternoon the way they want to be able to.

American culture already penalises people enough for wanting to have a holiday outside of the federally-recognised ones. Some workplaces do not even allow you holidays (or “vacation days”), and the ones that do typically require you to work for a certain amount of time before receiving any. This is the next level, and in my opinion, going too far. When you start taking away the ability of people to have holidays at all, even when they are federally recognised, that is where I draw the line and say something is wrong. This is unhealthy for all involved, and will only lead to problems.

Saturday: Mozilla and Bixby

This morning, I tried more ideas for fixing the remaining endianness bugs in Mozilla’s graphics engine.  I found a few more leads but so far no progress on cracking the image decoding issue.

It was a beautiful day out and my allergies are waning since it’s finally autumn, so we took my gran out and decided to explore around Bixby.  There’s quite a variety of shops down there; very nice.  Their Super Target is much nicer than the Tulsa one, as well.

A sunset with many shades of blue, teal, red, and yellow.
Sunset over Bixby

As it became dark, we headed home.  On the way back I stopped in to Best Buy to find a universal remote for the TV I was given second-hand.  Had a nice chat with the cashier about watch bands.

Back at home, Mr Gaz was very affectionate and mrowy.  They say there’s going to be a light frost overnight.  I can’t wait.  The property turned off the air conditioning last week so it’s been uncomfortably warm in my flat.  Bring on the cold weather and warm kitty snuggles ^.^

Now playing: ♫ Heartbeat – Carrie Underwood