The complexities of enabling OpenCL support

Hello, and welcome back to FOSS Fridays! One of the final preparations for the release of Adélie Linux 1.0-beta6 has been updating the graphical stack to support Wayland and the latest advancements in Linux graphics. This includes updating Mesa 3D. It’s been quite exciting doing the enablement work and seeing Wayfire and Sway running on a wide variety of computers in the Lab. And as part of our effort towards enabling Wayland everywhere, we have added a lot of support for Vulkan and SPIR-V. This has the side-effect of allowing us to build Mesa’s OpenCL support as well.

With Adélie Linux, as with every project we work on at WTI, we are proud to do our very best to offer the same feature set across all of our supported platforms. When we enable a feature, we work hard to enable it everywhere. Since OpenCL is now force-enabled by the Intel “Iris” Gallium driver, in addition to the Vulkan driver, I set off to ensure it was enabled everywhere.

Once again with LLVM

Mesa’s OpenCL support requires libclc, which is an LLVM plugin for OpenCL. This library in turn requires SPIRV-LLVM-Translator, which allows one to translate between LLVM IR and SPIR-V binaries. The Translator uses the familiar Lit test suite, like other components of LLVM. Unfortunately, there were a significant number of test failures on 64-bit PowerPC (big endian):

Total Discovered Tests: 821
  Passed           : 303 (36.91%)
  Failed           : 510 (62.12%)

Further down the rabbit hole

Digging in, I noticed that we had actually skipped tests in glslang because it had one test failure on big endian platforms. And while digging in to that failure, I found that the base SPIRV-Tools package was not handling cross-endian binaries correctly. That is, when run on a big endian system, it would fail to validate and disassemble little endian binaries, and when run on a little endian system, it would fail to validate and disassemble big endian binaries.

I found an outstanding merge request from a year ago against SPIRV-Tools which claimed to fix some endian issues. Applying that to our SPIRV-Tools package allowed those tools to function correctly for me. I then turned back to glslang and determined that the standalone remap tool was simply reading in the binary and assuming it would always match the host’s endianness. I’ve written a patch and submitted it in my issue report, but I am not happy with it and hope to improve it before opening a merge request.

Back to the familiar

I began looking around at the failures in SPIRV-LLVM-Translator and noticed that a lot of them seemed to revolve around unsafe assumptions on endianness. There were a lot of errors of the form:

error: line 11: Invalid extended instruction import ‘nepOs.LC’

Note that this string is actually ‘OpenCL.s’, byte-swapped on a 32-bit stride. Researching the specification, it defines strings as:

A string is interpreted as a nul-terminated stream of characters. All string comparisons are case sensitive. The character set is Unicode in the UTF-8 encoding scheme. The UTF-8 octets (8-bit bytes) are packed four per word, following the little-endian convention (i.e., the first octet is in the lowest-order 8 bits of the word).

As an aside, I want to express my deep and earnest gratitude that standards bodies are still paying attention to endianness and ensuring their standards will work on the widest number of platforms. This is a very good thing, and the entire community of software engineering is better for it. This also serves as a great example of the wide scope in which we, as those engineers responsible for portability and maintainability, need to be aware and pay attention.

The standards language quoted above means that on a big endian system, it should actually be written to disk/memory as ‘nepOs.LC’. The Translator was not doing this encoding and therefore the binaries were not correct. I attempted to look at how the strings were serialised, and I believe I found the answer in lib/SPIRV/SPIRVStream.cpp, but it seemed like it would be a challenge to do things the correct way. I decided that for the moment, it would be enough to make the translator operate only on little endian SPIR-V files. After massaging the SPIRVStream.h file to swap when running on a big endian system, I significantly reduced the count of failing tests:

Total Discovered Tests: 821
  Passed           : 500 (60.90%)
  Failed           : 313 (38.12%)

However, now we had some interesting looking errors:

test/DebugInfo/X86/static_member_array.ll:50:10: error: CHECK: expected string not found in input
; CHECK: DW_AT_count {{.*}} (0x04)
         ^
<stdin>:73:33: note: scanning from here
0x00000096: DW_TAG_subrange_type [10] (0x00000091)
                                ^
<stdin>:76:2: note: possible intended match here
 DW_AT_count [DW_FORM_data8] (0x0000000400000000)
 ^

You will note that the failure is that 0x04 != 0x04’0000’0000. This is what happens if you store a 32-bit value into a 64-bit pointer using bad casting, which is very similar to the Clang bug I found and fixed last month. On a hunch, I decided to look at all of the reinterpret_casts used in the translator’s code base, and I hit something that seemed promising. SPIRVValue::getValue, where they were doing equally questionable things with pointers, was amenable to a quick change:

#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    if (ValueSize > CopyBytes) {
      if (ValueSize == 8 && CopyBytes == 4) {
        uint8_t *foo = reinterpret_cast<uint8_t *>(&TheValue);
        foo += 4;
        std::memcpy(&foo, Words.data(), CopyBytes);
        return TheValue;
      }
      assert(ValueSize == CopyBytes && "Oh no");
    }
#endif

Unfortunately, this only fixed eight tests. The last time there was an endian bug in an LLVM component, it was somewhat easier to find because I could bisect the error. There was a version at which point it had worked in the past, so the change revealed where the error was hiding. Not so with this issue: it seems to have been present since the translator was first written.

Upstream has been apprised of this issue nearly a year ago, with no movement. The amount of time it would take to do a proper root cause analysis on a codebase of this size (65,000+ lines of LLVM plugin style C++) would be prohibitive for the time constraints on beta6. I’d like to work on it, but this is a low priority unless someone wants to contract us to work on it.

The true impact of non-portable code

There is no way to conditionalise a dependency on architecture in an abuild package recipe. This means even if we condition enabling OpenCL support in Mesa on x86_64, Mesa on all other platforms would still pull the broken translator as a build-time dependency. It is likely that libclc itself would not build properly with a broken translator, meaning Mesa’s dependency graph would be incomplete on every other architecture due to libclc as well.

Unfortunately, this means I had to make the difficult decision to disable OpenCL globally, including on x86_64. We never had OpenCL support, so this isn’t a great loss for our users, and we can always come back to it later. However, a few of Mesa’s drivers require OpenCL: namely, the Intel Vulkan driver, and the Intel “Iris” Gallium driver, which supports Broadwell and newer. On these systems, using the iGPU will fall back to software rendering only. We cannot offer hardware acceleration on these systems until we enable OpenCL, for reasons that are not entirely clear to me. It was possible before, and while I understand the performance enhancements with writing shaders this way, not providing a fallback option is really binding us here.

It is my sincere hope that software authors, both corporate and individual, begin to realise the importance of portability and maintainability. If the SPIRV-LLVM-Translator was written with more portability in mind, this wouldn’t be an issue. If the Intel driver in Mesa was written with fewer dependencies in mind, this wouldn’t be an issue. The future is bright if we can all work together to create good, clean code. I greatly look forward to that future.

Until then, at least software rendering should work on Broadwell and newer, right?

Target.

On Tuesday, I did something that I have not done for a long while: I stepped into a Walmart.

For as long as I can remember, my go-to retail establishment for groceries, clothing, and various other essentials has been Target.  Last year, during Pride month, I was elated to see a Pride department with merchandise for sale at the front of the store.  I bought a few towels and some clothes, including a shirt I plan to wear to Pride this year.

This year, I was dismayed to read that anti-trans bigots have taken a “bullseye” to Target stores, including smashing products and damaging signage. I was even more disheartened to see that Target’s leadership has let the terrorists win by pulling “some” Pride merchandise from shelves and relocating the rest to the back of the store, showing them a clear sign that violence is the answer. This is a bold face rewarding of hate, embracing the worst impulses of society and giving them comfort.

Furthermore, despite these threats being made against specific store locations, this move of erasing Pride merchandise has been done at all Target stores nationwide and on their Web storefront as well.

To add insult to injury, the explanation given was that they are concerned for the “safety” of the employees. Of course, everyone has an unalienable right to a safe workplace, but the implications are disturbing.

Why is a multi-billion-dollar enterprise not using their own security team – one of the most notorious asset protection organisations in the American retail industry – to protect their workers from angry Karens throwing merchandise around? Why weren’t the police involved? Why weren’t these threats of violence responded to with legal action, including prosecution of trespassing and vandalism?

A further-reaching implication is this means that, by their own admission, their stores are not safe for trans people. If the employees are being attacked for the simple fact they sell Pride merchandise, I cannot imagine the reactions of these same people if they see me wearing a “Trans Rights are Human Rights” shirt.

Their caving to this form of pressure is disturbing on its own, but in the culture that we live in, safety is paramount. Target seem to think that their stores across the nation are no longer safe spaces for trans people. It is for these reasons that I am considering no longer shopping at Target in the future.

Reckless Software Development Must End

On the 6th of November, 2019, I made a comment on Twitter:

Okay, so today’s news isn’t as dramatic as Uber killing a homeless woman by not programming in the fact that pedestrians might not use crosswalks, but it is based in the same mode of thought.

Today’s news is that the US state of Iowa has had issues with their election processes (processes that are a bit too complex for me to provide you an overview in this blog). The problem boils down to reckless abandon of software engineering principles.

As reported in the New York Times and The Verge, in addition to many other outlets, there were a number of failings in the development and deployment of this software package that would have been trivial to prevent.

My personal belief is that the following issues significantly contributed to the failure we have seen.

No test plan

There was no well-defined plan of testing.

The test plan should have covered testing of the back-end (server) portion of the software, including synthetic load testing. My test plan would have included a swarm of all 1600+ precincts reporting all possible data at the same time, using a pool of a few inexpensive systems running multi-connection clients.

The test plan should have also included testing of the deployment of the front-end (user facing) portion of the software. They should have asked at least a few of the precinct staffers to attempt to complete installation of the software.

Ideally, a member of the development team would be present for this, to note where users encounter hesitation or issues. However, we are far from an ideal world. My test plan would have included a simple Skype or FaceTime session with the poll workers, if face-to-face communication would have been prohibitive.

These sessions with real-world users can be used to further refine the installation process, and can inform what should be written in documentation to simplify and streamline the experience for the general user population. Then, users should be allowed to input mock test data into the software. This will allow the development team to see any issues with the input routines, and function as an additional real-world test for the back-end portion.

By “installation”, I mean the set up required after the software is installed. For instance, logging in with the unique PIN that reportedly controlled authentication. I am not including the installation of the app software onto the device, which should not have been an issue at all — and which is covered in the following section.

Lack of release engineering

Software must be released to be used.

It appears that the developers of this software either did not have the software finished before the Iowa caucus began (requiring them to on-board every user as a beta tester), or they did not intend to have a proper ‘release’ of the software at any time (meaning every user was intended to be a beta tester). I could write a full article on the sad state of software release engineering, but I digress.

The software was distributed to users via a testing system, used for providing pre-release or “beta” versions to testers. This is an essential system to use when you have a test plan like what I described above. This is, however, a bad idea to use for releasing software for production.

On Apple’s platform, distributing final releases via TestFlight or TestFairy can result in your organisation being permanently banned from accessing any Apple developer material. Not counting the legal (contract law) issues surrounding such a release, on Android this requires your users to enable what is called “side-loading”, or installing software from untrusted third-party repositories.

All of the Iowa caucus precinct workers using the Android OS now have mobile devices configured in a severely vulnerable way, and they have had sideloading normalised as something that could be legitimate. The importance of this cannot be understated. This is a large security risk, and I am already wondering in the back of my mind how this will affect these same workers if they are involved with the general election in November. The company responsible for telling them to configure their mobile devices in this manner may, and in my opinion should, be liable for any data loss or exploitation that happens to these people.

My release plan document would have involved clearly defined milestones, with allowances for what features would be okay to postpone for later releases. This could include post-Iowa caucus releases, if necessary — the Nevada Democratic Party intended to use this software for their 22nd February caucus. Release planning should include both planned dates and required dates. For example:

  • Alpha release for internal testing. Plan: 6 December. Must: 13 December.
  • Beta release, sent for wider external testing. Plan: 3 January. Must: 10 January.
  • Final release, sent to Apple and Google app stores. Plan: 13 January. Must: 20 January.
  • Iowa Caucus: 3 February (hard).

Such a release plan would have given the respective app stores at least two weeks to approve the app for distribution.

Alternatively, if the goal was to avoid deployment to the general app stores of the mobile platforms, they could have used “business-internal” deployment solutions. Apple offers the Apple Business Manager; Google offers Managed Google Play. Both of these services are included with their respective developer subscriptions, so there is no additional cost for the development organisation.

Lack of security processes

Authentication control is important in all software, but especially so in election software. This team demonstrated to me a lack of understanding of proper security processes by providing the PIN on the same sheet of paper that would be used on the night of the election for vote tallying.

I would have had the PIN sent to the precinct workers via either email, or using a separate sheet which they could have in their wallet. Ideally, initial log in and authentication would have taken place on the device before the release, with the credentials stored in the secure portion of device storage (Secure Enclave on iPhone, TrustZone on Android). However, even if this is not possible, it was still possible to provide the PIN to users in a more secure manner.

Apparent lack of clearly defined specification

I have a sneaking suspicion that the combination of these failings mirror the many other development organisations who refuse to apply the discipline of engineering to their software projects. They are encouraged by bad stewards of engineering to “Move Fast and Break Things”. They are encouraged by snake-oil peddlers of “process improvement” that formal specification and testing are unnecessary burdens. And this must change.

I’m not alone in this call. Even the Venture Capitalist section of Harvard Business Review admits that this development culture is irresponsible and outdated. Software developers and project managers must be willing to #Disrupt the current industry norm and be willing to Move Moderately and Fix Things.

Libre software funding and market abuse

I’ve just read a troubling article from the developer of Aether.

What troubles me is not so much the differences we have, which likely stems from being in vastly different segments of libre software (he’s doing social media, and I’m in low-level systems). What troubles me is that he claims that it is an economic imperative to work at FAANG or a Silicon Valley startup for a number of years before working on libre software full time, and all of this on a false pretense.

Encouraging someone to have a long-term savings and funding plan is a good idea, perhaps even a great idea. It falls apart when he states that working for startups or FAANG are the only or best way you can earn that money — and then claiming that you could make 250,000 USD per month working at them[1]. This is flawed mathematics at best, and actively malicious to society at worst.

Most people are going to have to work at a company before founding their own, unless they have funding from external sources (be it angel investors, VC, inheritance, family and friends, etc). This is not what I take issue with. This issue I have is this false dichotomy that you can only make good money by working at FAANG or an abusive startup. As someone who actually has worked at two different startups in their life, I take personal issue with the way startup culture exploits its workers, investors, and society at large. This doesn’t even go in to how startup culture can also be bad for business.

This abuse is ingrained in to most, if not all, of the industry of Big Tech, ala FAANG. You might be able to wrestle some division of Apple, or the security research division of Netflix, out of this hole and point to them as an example of where I’m wrong. Oh, dear reader — even if you have the privilege of working in an area of the company that isn’t abusing its workers, you’re still complicit in that abuse by furthering the company’s mission and control over some part of the industry at best, and indirectly engaging in it yourself at worst.

It’s time for the computer industry to rise up and work at companies that respect their workers, and society. Quit FAANG like a bad habit, and find a company to work for that doesn’t trade in the abuse of power and users as its main product. And where those don’t yet exist, it’s time to found some. At the end of the day, we are all defined by the actions we take — which side of history do you want to be on?


[1]: And I quote, “If you can make $10,000 a month from donations doing open source work, I can guarantee you that your salary in any large tech company (or even startup) would be much more — to the tune of 10x to 25x.” The firm Indeed claim, at time of writing, that the highest paid research engineers at Google make about 246,000 USD per year; other companies pay even less. That’s 20,500 USD per month, or just about twice the amount he claims you might be able to make on donations doing ‘open source work’. And this doesn’t require you to further Google’s surveillance state.