Reckless Software Development Must End

On the 6th of November, 2019, I made a comment on Twitter:

Okay, so today’s news isn’t as dramatic as Uber killing a homeless woman by not programming in the fact that pedestrians might not use crosswalks, but it is based in the same mode of thought.

Today’s news is that the US state of Iowa has had issues with their election processes (processes that are a bit too complex for me to provide you an overview in this blog). The problem boils down to reckless abandon of software engineering principles.

As reported in the New York Times and The Verge, in addition to many other outlets, there were a number of failings in the development and deployment of this software package that would have been trivial to prevent.

My personal belief is that the following issues significantly contributed to the failure we have seen.

No test plan

There was no well-defined plan of testing.

The test plan should have covered testing of the back-end (server) portion of the software, including synthetic load testing. My test plan would have included a swarm of all 1600+ precincts reporting all possible data at the same time, using a pool of a few inexpensive systems running multi-connection clients.

The test plan should have also included testing of the deployment of the front-end (user facing) portion of the software. They should have asked at least a few of the precinct staffers to attempt to complete installation of the software.

Ideally, a member of the development team would be present for this, to note where users encounter hesitation or issues. However, we are far from an ideal world. My test plan would have included a simple Skype or FaceTime session with the poll workers, if face-to-face communication would have been prohibitive.

These sessions with real-world users can be used to further refine the installation process, and can inform what should be written in documentation to simplify and streamline the experience for the general user population. Then, users should be allowed to input mock test data into the software. This will allow the development team to see any issues with the input routines, and function as an additional real-world test for the back-end portion.

By “installation”, I mean the set up required after the software is installed. For instance, logging in with the unique PIN that reportedly controlled authentication. I am not including the installation of the app software onto the device, which should not have been an issue at all — and which is covered in the following section.

Lack of release engineering

Software must be released to be used.

It appears that the developers of this software either did not have the software finished before the Iowa caucus began (requiring them to on-board every user as a beta tester), or they did not intend to have a proper ‘release’ of the software at any time (meaning every user was intended to be a beta tester). I could write a full article on the sad state of software release engineering, but I digress.

The software was distributed to users via a testing system, used for providing pre-release or “beta” versions to testers. This is an essential system to use when you have a test plan like what I described above. This is, however, a bad idea to use for releasing software for production.

On Apple’s platform, distributing final releases via TestFlight or TestFairy can result in your organisation being permanently banned from accessing any Apple developer material. Not counting the legal (contract law) issues surrounding such a release, on Android this requires your users to enable what is called “side-loading”, or installing software from untrusted third-party repositories.

All of the Iowa caucus precinct workers using the Android OS now have mobile devices configured in a severely vulnerable way, and they have had sideloading normalised as something that could be legitimate. The importance of this cannot be understated. This is a large security risk, and I am already wondering in the back of my mind how this will affect these same workers if they are involved with the general election in November. The company responsible for telling them to configure their mobile devices in this manner may, and in my opinion should, be liable for any data loss or exploitation that happens to these people.

My release plan document would have involved clearly defined milestones, with allowances for what features would be okay to postpone for later releases. This could include post-Iowa caucus releases, if necessary — the Nevada Democratic Party intended to use this software for their 22nd February caucus. Release planning should include both planned dates and required dates. For example:

  • Alpha release for internal testing. Plan: 6 December. Must: 13 December.
  • Beta release, sent for wider external testing. Plan: 3 January. Must: 10 January.
  • Final release, sent to Apple and Google app stores. Plan: 13 January. Must: 20 January.
  • Iowa Caucus: 3 February (hard).

Such a release plan would have given the respective app stores at least two weeks to approve the app for distribution.

Alternatively, if the goal was to avoid deployment to the general app stores of the mobile platforms, they could have used “business-internal” deployment solutions. Apple offers the Apple Business Manager; Google offers Managed Google Play. Both of these services are included with their respective developer subscriptions, so there is no additional cost for the development organisation.

Lack of security processes

Authentication control is important in all software, but especially so in election software. This team demonstrated to me a lack of understanding of proper security processes by providing the PIN on the same sheet of paper that would be used on the night of the election for vote tallying.

I would have had the PIN sent to the precinct workers via either email, or using a separate sheet which they could have in their wallet. Ideally, initial log in and authentication would have taken place on the device before the release, with the credentials stored in the secure portion of device storage (Secure Enclave on iPhone, TrustZone on Android). However, even if this is not possible, it was still possible to provide the PIN to users in a more secure manner.

Apparent lack of clearly defined specification

I have a sneaking suspicion that the combination of these failings mirror the many other development organisations who refuse to apply the discipline of engineering to their software projects. They are encouraged by bad stewards of engineering to “Move Fast and Break Things”. They are encouraged by snake-oil peddlers of “process improvement” that formal specification and testing are unnecessary burdens. And this must change.

I’m not alone in this call. Even the Venture Capitalist section of Harvard Business Review admits that this development culture is irresponsible and outdated. Software developers and project managers must be willing to #Disrupt the current industry norm and be willing to Move Moderately and Fix Things.

Cascading failures (or, why I did nothing this weekend)

This is a fun one.

To set the scene and provide information in temporal order, my Talos and WD Black NVMe device have never “gotten along” well. Frequently, the device would fail to train for whatever reason. Calling reboot from a Petitboot shell with fast-reset enabled was enough to fix this, so I didn’t think all that much about it.

Late in the night Thursday (or early in the morning Friday, if you prefer), I was reading a few articles before I went to bed. At 01:36:13, an animal darted in front of a car about two miles away, causing the car to crash into a power pole. This caused a serious power surge as the lines came down on to each other (and the car’s frame).

My office is was protected by an APC BX1500G combination battery backup (SPS, not UPS) and surge protection unit. The power surge was severe enough that the unit failed. The lights, the Power Mac G5, and the Talos in my office went immediately dark as the alarm went off making a continuous noise while “F04” and “See Manual” flashed on the display. This code means “Clamp Short”, and means that the varistor that is supposed to arrest surges had become permanently ‘stuck’ in arrest mode.

My first priority was to ensure the integrity of my hardware, so I dug out a spare APC SPS and put the battery from the now-failed one in it. I powered on my Talos and while it seemed to IPL fine, lspci in Petitboot did not show my NVMe device as present. Looking in Hostboot, it now wasn’t even failing to train — the slot may as well have been unoccupied. I tried both a fast-reset and a full system power cycle to no avail.

The next day, I attempted to swap slots, thinking the PCIe slot that the NVMe device was connected to may have been damaged by the power surge. I swapped the NVMe and the sound card. The sound card worked in the slot formerly occupied by the NVMe, but the NVMe still wouldn’t come up in the slot formerly occupied by the sound card. Now came the worrying part: did the M.2 to PCIe adaptor fail, or did the NVMe media itself fail?

I went to two of our local computer stores to buy some parts. I decided that since I was already going to need to do a full disk swap (even if I could recover the data off the NVMe media), it would make sense to add SATA media as well. I had a 256 GB SATA SSD laying around that was supposed to be put in my Xeon until it failed. I found an “open box” Seagate 1 TB HDD for 30 USD at DISC Surplus Computers in Sand Springs. And I found a new 4-port Marvell chipset based SATA controller for 34 USD at Wholesale Computer Supply in Tulsa. I also had a 3 port USB 3.0 controller card sitting on a shelf since the slot it was meant to go in was occupied by a 2-slot Radeon that was used for big endian amdgpu.ko porting. I went ahead and shelved that Radeon (and the big endian porting effort, for now) and used the CPU 1 slot for SATA and the CPU 2 slot for the USB card.

I then turned my attention to recovering the NVMe media. I brought out my old Intel Reference Platform board, a DP43TF with developer firmware, and put the NVMe adaptor card with media in to it. Unfortunately, the DP43TF only has one PCIe slot larger than x1, and it’s the x16 typically used for a GPU. Since NVMe is x4, I had to find a PCI GPU. I pulled a GeForce 8400GS out of one of our Pentium 4 test boxes and attempted to boot the Adélie 1.0-BETA4 live CD.

Our Live CD does not support the JMicron PATA controller that the DP43TF’s DVD drive was connected to. I ended up using a USB optical media, but I also could have used SATA optical media. The CD I was attempting to use was scratched, and it refused to finish booting (the scratched section appears to have contained OpenRC). I had to find a computer capable of burning media, which was no small task since most newer computers don’t support writing optical media and most of my computers have marginal USB support at best.

One of our community members reminded me that the PowerBook G4 has a SuperDrive, which I used to burn a fresh x86_64 BETA4 CD. Finally booted, I noticed the NVMe was present but throwing occasional controller reset errors. I’m not sure if this was due to media degradation or the fact it was a Gen3 NVMe in a Gen1 PCIe slot. At any rate, I used dd to make a full clone of the NVMe to the 1 TB Seagate disk, and then put that in the Talos. A gracious member of the Adélie Linux community donated the funds needed to replace the NVMe with a Samsung 970 EVO Pro of the same size.

Yesterday I copied the data off the Seagate 1 TB to the new Samsung NVMe. Everything is working quite well, and the Samsung is much faster than the Western Digital; 712 MB/s uncached write vs 303 MB/s on the WD. The additional space on the Seagate can be used for further testing, and for possible expansion of Adélie to more platforms — I may post more on that later. 🙂

This was a very interesting experience for me. It’s been many years since I’ve seen a cascade of failures like this: car accident breaks APC SPS, which breaks NVMe marginally, which shows an issue booting our live CD on a specific computer. It also gave me a reason to re-catalogue a lot of the hardware I have on hand for testing purposes, and to know what needs fixing and replacing. And most importantly, it made me realise I need to perform weekly backups instead of semi-annual backups.

I want to especially thank the members of the Adélie Linux community that helped with this process, not only financially but with techniques and ideas to make this go well. My workstation is better than ever, and now I can get even more done for libre software. You rock!

Libre software funding and market abuse

I’ve just read a troubling article from the developer of Aether.

What troubles me is not so much the differences we have, which likely stems from being in vastly different segments of libre software (he’s doing social media, and I’m in low-level systems). What troubles me is that he claims that it is an economic imperative to work at FAANG or a Silicon Valley startup for a number of years before working on libre software full time, and all of this on a false pretense.

Encouraging someone to have a long-term savings and funding plan is a good idea, perhaps even a great idea. It falls apart when he states that working for startups or FAANG are the only or best way you can earn that money — and then claiming that you could make 250,000 USD per month working at them[1]. This is flawed mathematics at best, and actively malicious to society at worst.

Most people are going to have to work at a company before founding their own, unless they have funding from external sources (be it angel investors, VC, inheritance, family and friends, etc). This is not what I take issue with. This issue I have is this false dichotomy that you can only make good money by working at FAANG or an abusive startup. As someone who actually has worked at two different startups in their life, I take personal issue with the way startup culture exploits its workers, investors, and society at large. This doesn’t even go in to how startup culture can also be bad for business.

This abuse is ingrained in to most, if not all, of the industry of Big Tech, ala FAANG. You might be able to wrestle some division of Apple, or the security research division of Netflix, out of this hole and point to them as an example of where I’m wrong. Oh, dear reader — even if you have the privilege of working in an area of the company that isn’t abusing its workers, you’re still complicit in that abuse by furthering the company’s mission and control over some part of the industry at best, and indirectly engaging in it yourself at worst.

It’s time for the computer industry to rise up and work at companies that respect their workers, and society. Quit FAANG like a bad habit, and find a company to work for that doesn’t trade in the abuse of power and users as its main product. And where those don’t yet exist, it’s time to found some. At the end of the day, we are all defined by the actions we take — which side of history do you want to be on?


[1]: And I quote, “If you can make $10,000 a month from donations doing open source work, I can guarantee you that your salary in any large tech company (or even startup) would be much more — to the tune of 10x to 25x.” The firm Indeed claim, at time of writing, that the highest paid research engineers at Google make about 246,000 USD per year; other companies pay even less. That’s 20,500 USD per month, or just about twice the amount he claims you might be able to make on donations doing ‘open source work’. And this doesn’t require you to further Google’s surveillance state.

Mozilla finally disavows Discord

mhoye’s new blog post on the future of Mozilla community chat came out last week. He notes about Discord that “their active hostility towards interoperability and alternative clients has disqualified them as a community platform.”

I am very thankful that the Mozilla brass have realised this, as I pointed out in an earlier installment. Kudos to them that three of their four options are fully libre — I sincerely hope they choose one of those three, and keep the Mozilla community libre and open.