engineering – Page 3 – The Cat Fox Life

Really leaving the Linux desktop behind

I’m excited to start a new chapter of my life tomorrow. I will be starting a new job working at an excellent company with excellent benefits and a comfortable wage.

It also has nothing to do with Linux distributions.

I have asked, and been granted, clearance to work on open source software during my off time. And I do plan on writing libre software. However, I really no longer believe in the dream of the Linux desktop that I set out to create in 2015. And I feel it might be beneficial for everyone if I describe why.

1. Stability.

My goal for the Linux desktop started with stability. Adélie is still dedicated to shipping only LTS releases, and I still feel that is useful. However, it has made more difficult because Qt has removed LTS from the open source community, plainly admitting they want us to be their beta testers and that paid commercial users are the only ones who deserve stability. This is obviously an antithesis to having a stable libre desktop environment.

Mozilla keeps pushing release cycles narrower together, in a desperate attempt to compete with evil G (more on this in the next section). This means that the yearly ESR releases, which Adélie depends on for some modicum of stability, are unfortunately being left behind by whiz bang web developers that don’t understand not everyone wants to run Fx Nightly.

I think that stability may be the point that is the easiest to argue it could still be fixed. You might be able to sway me on that. There are some upstreams finally dedicating themselves to better release engineering. And I’ve been happy to find that even most power users don’t care about running the bleeding edge as long as their computer works correctly.

My overall hope for the future: more libre devs understand the value of stable cycles and release engineering.

My fear for the future: everything is running off Git main forever.

2. Portability.

It’s been harder and harder for me to convince upstreams to support PowerPC, ARM, and other architectures. This even as Microsoft and Apple introduce flagship laptop models based on ARM, and Raptor continues to sell out of their Talos and Blackbird PPC systems.

A significant portion of issues with portability come from Google code. The Go runtime does not support many non-x86 architectures. And the ones it does, it does poorly. PPC support in Golang is 64-bit only and requires a Power8, which is equivalent to an x86 program requiring a Skylake or newer. You could probably get away with it for an end-user application, but no one would, or should, accept that in a systems programming language.

Additionally, the Chromium codebase is not amenable to porting to other architectures. Even when the Talos user community offered a PowerPC port, they rejected it outright. This is in addition to their close ties to glibc which means musl support requires thick patches with thousands and thousands of lines. They won’t accept patches for Skia or WebP for big endian support. They, in general, do not believe in the quality of portability as something desireable.

This would be fine and good since GCC Go works, and we do have Firefox, Otter (which can still use Qt WebKit), and Epiphany for browsers. However, increasingly, important software like KMail is depending on WebEngine, which is a Chromium embedded engine. This means KDE’s email client will not run on anything other than x86_64 and ARMv8, even though the mail client itself is portable.

This also has ramifications of user security and privacy. The Chromium engine regularly has large, high-risk security holes, which means even if you do have a downstream patch set to run on musl or PowerPC, you need to ensure you forward-port as they release. And their release models are insanely paced. They rewrite large portions of the engine with significant, distressing regularity. This makes it unsuitable for tracking in a desktop that requires stability and security, in addition to portability.

And with more and more Qt and KDE apps (IMO, mistakenly) depending on WebEngine, this means more and more other apps are unsuitable for tracking.

My overall hope for the future: more libre devs care about accepting patches for running on non-x86 architectures. The US breaks up Google and kills Chromium for violating antitrust and RICO laws.

My fear for the future: everything is Chrome in the future.

3. The graphics stack.

I’ve made no secret of the fact that my personal opinion is that it would still, even today, be easier to fix X11 than to make Wayland generally acceptable for widespread use. But, let’s put that aside for now. Let’s also put aside the fact that they don’t want to work on making it work on nvidia GPUs, which represent half of the GPU market.

At the behest of one of my friends, who shall remain nameless, I spent part of my December break trying to bring up Wayland on my PowerBook G4. This computer runs KDE Plasma 5.18 (the current LTS release) under X11 with no issues or frameskip. It has a Radeon 9600XT with hardware OpenGL 2.1 support.

It took days to bring up anything on it because wlroots was being excessively difficult with handling the r300 for some reason. Once that was solved, it turned out it was drawing colours wrong. Days of hacking at it revealed that there are likely some issues in Mesa causing this, and that this is likely why Qt Quick requires the Software backend on BE machines.

When I asked the Wayland community for a few pointers at what to look at, since Mesa is far outside of my typical purview of code (graphics code is still intimidating to me, even at 30), I was met with nothing but scorn and criticism.

In addition, I was still unable to find a Wayland compositor that supports framebuffers and/or software mode, which would have removed the need to fix Mesa yet. Framebuffer support would also allow it to run on computers that run LXQt fine, like my Pentium III and iBook G3, both of which having Rage 128 cards that don’t have hardware GL2. This was also met with scorn and criticism.

Why should I bother improving the Wayland ecosystem to support the hardware I care about if they actively work against me, then blame the fact that cards like the S3 Trio64 and Rage128 don’t have DRM2 drivers?

My overall hope for the future: either Wayland compositors supporting more varied kinds of hardware, or X11 being improved and obviating the need for Wayland.

My fear for the future: you need an RX 480 to use a GUI on Linux.

4. Usability.

This is more of an objective point than a subjective one, but the usability of desktop Linux seems to be eternally stuck just below that of other environments. ElementaryOS is closest to fixing this, but there is still much to be desired from my point of view before they’re ready for prime time.

In conclusion.

I still plan to run Linux – likely Adélie – on all servers I use. (My fallback would be Gentoo, even after all these years and disagreements, if you were wondering.)

However, I have been slowly migrating my daily personal life from my Adélie laptop to a Mac running Catalina. And, sad as it is to say, I’ve found myself happier and with more time to do what I want to do.

It is my genuine hope that maybe in a few years, if the Linux ecosystem seems to be learning any of these lessons, maybe I can come back to it and contribute in earnest once again. Until then, it’s system/kernel level work and hacking POSIX conformance in to musl for me. The Linux desktop has simply diverged too far from what I need.

Live from Adélie: Streaming Spotify on musl

Over the July 4th holiday weekend, I was working on a secret project. It was a resounding success and I can now announce to the world: Spotify runs on musl distributions!

This article will describe how I went about accomplishing this feat. If you just want to take Spotify for a test drive on your Adélie workstation or Void desktop, scroll to the “Instructions” heading.

Greetz

Thanks to these fine dwellers of IRC for helping make sense of the twisty mazes.

[[sroracle]]
Aerdan
cb
dalias
skarnet

gcompat 0.4.0: how very cash LC_MONETARY of you

The latest release version of gcompat did not get very far:

awilcox on laptop spotify % ./spotify
Segmentation fault (core dumped)

Inspecting the core file was minimally helpful:

Thread 1 "ld-musl-x86_64." received signal SIGSEGV, Segmentation fault.
0x0000000001d6ff60 in ?? ()
(gdb) bt
#0  0x0000000001d6ff60 in ?? ()
#1  0x00007fffffffd738 in ?? ()
#2  0x0000000001e94f13 in ?? ()
#3  0x00007fffffffd6d0 in ?? ()
#4  0x00007fffffffd738 in ?? ()
#5  0x0000000003e9d691 in ?? ()
#6  0x0000000003e9d698 in ?? ()
#7  0x0000000003e9d691 in ?? ()
#8  0x00007fffffffd738 in ?? ()
#9  0x00007fffffffdc40 in ?? ()
#10 0x0000000001ccd0f0 in ?? ()
#11 0x00007fffffffd7a0 in ?? ()
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffffd720 in ?? ()
#14 0x0000000001e92e92 in ?? ()
#15 0x0000000003e9d691 in ?? ()
#16 0x0000000003e9d698 in ?? ()
#17 0x00007fffffffd738 in ?? ()
#18 0x00007fffffffd738 in ?? ()
#19 0x00007fffffffd760 in ?? ()
#20 0x0000000001e9dd51 in ?? ()
#21 0x00007fffffffdc40 in ?? ()
#22 0x0000000003e9b3e0 in ?? ()
#23 0x00007fffffffd7e8 in ?? ()
#24 0x00007fffffffd7b8 in ?? ()
#25 0x00007fffffffd7b8 in ?? ()
#26 0x00007fffffffd828 in ?? ()
#27 0x00007fffffffd810 in ?? ()
#28 0x0000000001e9df09 in ?? ()
#29 0x612f656d6f682f1a in ?? ()
#30 0x0000786f636c6977 in ?? ()
#31 0x0000000000000000 in ?? ()
(gdb) info registers
rax            0x54454e4f4d5f434c  6072345775086453580
rbx            0x53                83
rcx            0x53                83
rdx            0x2                 2
rsi            0x53                83
rdi            0x3e9b1a0           65647008
rbp            0x7fffffffd6f0      0x7fffffffd6f0
rsp            0x7fffffffd690      0x7fffffffd690
r8             0x0                 0
r9             0x0                 0
r10            0x1                 1
r11            0x7fffffffdb9c      140737488346012
r12            0x7fffffffd6b8      140737488344760
r13            0x7fffffffd6b0      140737488344752
r14            0x7fffffffd6a8      140737488344744
r15            0x7fffffffd6c0      140737488344768
rip            0x1d6ff60           0x1d6ff60
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

What are we trying to do? Looking at symbols present in the Spotify binary, this is actually part of the G++ runtime; specifically, std::ctype::do_tolower:

  1d6ff51:       48 8b 05 18 a8 12 02    mov    0x212a818(%rip),%rax        # 3e9a770 
  1d6ff58:       48 8b 40 70             mov    0x70(%rax),%rax
  1d6ff5c:       48 0f be cb             movsbq %bl,%rcx
=>1d6ff60:       8a 1c 88                mov    (%rax,%rcx,4),%bl
  1d6ff63:       89 d8                   mov    %ebx,%eax
  1d6ff65:       5b                      pop    %rbx
  1d6ff66:       c3                      retq

That rax value looks suspicious, and we can see if we translate it to ASCII that it is the little-endian representation of the string “LC_MONETARY”. We’re trying to reach 0x70 into a structure in %rax for a pointer value, but we’re getting a string instead.

It turns out that when libstdc++ is compiled on a glibc system, it will attempt to access the internal __ctype_* members in the locale_t of the current locale. musl’s locale_t is not ABI-compatible with glibc’s. In fact, it is only 48 bytes in length; 0x70 (or 112 bytes) is past the end of the locale object musl has provided it!

I implemented a stub locale module in gcompat, and… it tried to exec /proc/self/exe, which broke under the gcompat loader. This required me to write a patch interposing the execv* functions to catch this. And suddenly…

The lights that stop me turn to stone

Slight success! We have a Spotify window!

… but a blank white screen only. After some inspecting, I found that one of the many zygotes CEF was forking was segfaulting:

[158358.508029] ThreadPoolForeg[3230]: segfault at 0 ip 0000000000000000 sp 00007fe3203db448 error 14 in spotify[200000+1acd000]
[158365.067313] ThreadPoolForeg[3252]: segfault at 0 ip 0000000000000000 sp 00007f2d69c172e8 error 14 in spotify[200000+1acd000]
[158378.506832] ThreadPoolForeg[3312]: segfault at 0 ip 0000000000000000 sp 00007f52ed7c8448 error 14 in spotify[200000+1acd000]
[158383.654027] ThreadPoolForeg[3339]: segfault at 0 ip 0000000000000000 sp 00007fcb631eb2e8 error 14 in spotify[200000+1acd000]

I replaced libcef.so from the Spotify DEB package with a matched-version libcef.so from Spotify’s Open Source builds page. This allowed me to have more debugging symbols, and generating a core dump revealed:

Core was generated by `ld-linux-x86-64.so.2 --argv0 /usr/share/spotify/spotify --type=utility --field-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000000 in ?? ()
[Current thread is 1 (LWP 12774)]
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007f79a8a3d671 in sqlite3MallocSize () at ../../third_party/sqlite/amalgamation/sqlite3.c:26957
#2  mallocWithAlarm () at ../../third_party/sqlite/amalgamation/sqlite3.c:26891
#3  sqlite3Malloc () at ../../third_party/sqlite/amalgamation/sqlite3.c:26913
#4  0x00007f79a8aff232 in sqlite3MallocZero () at ../../third_party/sqlite/amalgamation/sqlite3.c:27118
#5  pthreadMutexAlloc () at ../../third_party/sqlite/amalgamation/sqlite3.c:25755
#6  0x00007f79a8a4e9b2 in sqlite3MutexAlloc () at ../../third_party/sqlite/amalgamation/sqlite3.c:25298
#7  chrome_sqlite3_initialize () at ../../third_party/sqlite/amalgamation/sqlite3.c:24906
#8  0x00007f79a8a350bd in EnsureSqliteInitialized () at ../../sql/initialization.cc:55
#9  0x00007f79a8a30eb2 in OpenInternal () at ../../sql/database.cc:1357
#10 0x00007f79a8a30dfa in Open () at ../../sql/database.cc:270
#11 0x00007f79a8fb8de6 in InitializeDatabase () at ../../net/extras/sqlite/sqlite_persistent_store_backend_base.cc:99
#12 0x00007f79a8fb9751 in LoadNelPoliciesAndNotifyInBackground () at ../../net/extras/sqlite/sqlite_persistent_reporting_and_nel_store.cc:1041
#13 0x00007f79a5abe25b in Invoke<void (leveldb_proto::ProtoDatabaseSelector::*)(base::OnceCallback), scoped_refptr, base::OnceCallback > () at ../../base/bind_internal.h:498
#14 MakeItSo<void (leveldb_proto::ProtoDatabaseSelector::*)(base::OnceCallback), scoped_refptr, base::OnceCallback > ()
    at ../../base/bind_internal.h:598
#15 RunImpl<void (leveldb_proto::ProtoDatabaseSelector::*)(base::OnceCallback), std::__1::tuple<scoped_refptr, base::OnceCallback >, 0, 1> () at ../../base/bind_internal.h:671
#16 RunOnce () at ../../base/bind_internal.h:640
#17 0x00007f79a7776fa0 in Run () at ../../base/callback.h:98
#18 RunTask () at ../../base/task/common/task_annotator.cc:142
#19 0x00007f79a7792862 in base::internal::TaskTracker::RunBlockShutdown(base::internal::Task*) () at ../../base/task/thread_pool/task_tracker.cc:743
#20 0x00007f79a7792062 in RunTask () at ../../base/task/thread_pool/task_tracker.cc:598
#21 0x00007f79a77d42fb in RunTask () at ../../base/task/thread_pool/task_tracker_posix.cc:23
#22 0x00007f79a7791a43 in RunAndPopNextTask () at ../../base/task/thread_pool/task_tracker.cc:450
#23 0x00007f79a7798386 in RunWorker () at ../../base/task/thread_pool/worker_thread.cc:321
#24 0x00007f79a77980f4 in base::internal::WorkerThread::RunPooledWorker() () at ../../base/task/thread_pool/worker_thread.cc:223
#25 0x00007f79a77d4a05 in ThreadFunc () at ../../base/threading/platform_thread_posix.cc:81
#26 0x00007f79ac9fe2dd in ?? ()
#27 0x00007f79aca799e8 in ?? ()
#28 0x00007f7998247ce0 in ?? ()
#29 0x0000000000000000 in ?? ()
(gdb) frame 1
#1  0x00007f79a8a3d671 in sqlite3MallocSize () at ../../third_party/sqlite/amalgamation/sqlite3.c:26957
26957     return sqlite3GlobalConfig.m.xSize(p);

Inspecting the SQLite3 code, I realised that it was somehow getting a nullptr for the malloc_usable_size pointer. Further inspection revealed that this was not exactly the case:

(gdb) disassemble 0x7f79a77d5520
Dump of assembler code for function malloc_usable_size():
   0x00007f79a77d5520 :     push   %rbp
   0x00007f79a77d5521 :     mov    %rsp,%rbp
   0x00007f79a77d5524 :     mov    %rdi,%rsi
   0x00007f79a77d5527 :     mov    0x484a76a(%rip),%rdi        # 0x7f79ac01fc98 
   0x00007f79a77d552e :    mov    0x28(%rdi),%rax
   0x00007f79a77d5532 :    xor    %edx,%edx
   0x00007f79a77d5534 :    pop    %rbp
   0x00007f79a77d5535 :    jmpq   *%rax
End of assembler dump.

Looking at how the Chromium allocator works internally, the issue is that RTLD_NEXT won’t work on libraries loaded before libcef. And looking at the output of ldd spotify revealed both libm and libdl before libcef; musl always redirects these to libc for glibc ABI compatibility.

Using PatchELF to remove these two DT_NEEDEDs from the binary yielded a surprising result…

Music makes the people come together

Spotify on Adélie Linux — Spotify, playing “Rhinestone Eyes” by Gorillaz, on my Adélie laptop

It works! All the features I tested work: Spotify Connect, which means I can control the laptop’s playback using the iOS and Apple Watch apps; radio playback; Bluetooth speaker support.

Instructions

You will need to download the official Spotify 64-bit DEB. I have not tested this on a 32-bit system yet, but I see no reason it won’t work. Once you have the DEB, extract the data.tar.xz file somewhere. Use PatchELF on the Spotify binary as so:

$ patchelf --remove-needed libm.so.6 usr/share/spotify/spotify
$ patchelf --remove-needed libdl.so.2 usr/share/spotify/spotify

Move the extracted usr/share/spotify directory to your system’s /usr/share directory. For better integration, I moved the /usr/share/spotify/spotify.desktop file to /usr/share/applications. Then move the usr/bin/spotify link to /usr/bin.

Ensure that you have the latest gcompat installed. As I write this, only Adélie has the newest version in the current repo. I’ll be submitting merge requests to the distros I know that ship gcompat this week to ensure everyone has a chance to play around with the new bits.

Have fun!

Do you like running Spotify on musl? Or do you just like reading about fun hacks? Consider donating to Adélie to keep the fun going!

Reckless Software Development Must End

On the 6th of November, 2019, I made a comment on Twitter:

People will continue to be maimed and killed by reckless development practices until such a time that regulators wake up and realise the entire IT industry is something that needs to be regulated to an inch of their lives.

— 🌈 A. Wilcox 🌈 (@awilcox) November 6, 2019

Okay, so today’s news isn’t as dramatic as Uber killing a homeless woman by not programming in the fact that pedestrians might not use crosswalks, but it is based in the same mode of thought.

Today’s news is that the US state of Iowa has had issues with their election processes (processes that are a bit too complex for me to provide you an overview in this blog). The problem boils down to reckless abandon of software engineering principles.

As reported in the New York Times and The Verge, in addition to many other outlets, there were a number of failings in the development and deployment of this software package that would have been trivial to prevent.

My personal belief is that the following issues significantly contributed to the failure we have seen.

No test plan

There was no well-defined plan of testing.

The test plan should have covered testing of the back-end (server) portion of the software, including synthetic load testing. My test plan would have included a swarm of all 1600+ precincts reporting all possible data at the same time, using a pool of a few inexpensive systems running multi-connection clients.

The test plan should have also included testing of the deployment of the front-end (user facing) portion of the software. They should have asked at least a few of the precinct staffers to attempt to complete installation of the software.

Ideally, a member of the development team would be present for this, to note where users encounter hesitation or issues. However, we are far from an ideal world. My test plan would have included a simple Skype or FaceTime session with the poll workers, if face-to-face communication would have been prohibitive.

These sessions with real-world users can be used to further refine the installation process, and can inform what should be written in documentation to simplify and streamline the experience for the general user population. Then, users should be allowed to input mock test data into the software. This will allow the development team to see any issues with the input routines, and function as an additional real-world test for the back-end portion.

By “installation”, I mean the set up required after the software is installed. For instance, logging in with the unique PIN that reportedly controlled authentication. I am not including the installation of the app software onto the device, which should not have been an issue at all — and which is covered in the following section.

Lack of release engineering

Software must be released to be used.

It appears that the developers of this software either did not have the software finished before the Iowa caucus began (requiring them to on-board every user as a beta tester), or they did not intend to have a proper ‘release’ of the software at any time (meaning every user was intended to be a beta tester). I could write a full article on the sad state of software release engineering, but I digress.

The software was distributed to users via a testing system, used for providing pre-release or “beta” versions to testers. This is an essential system to use when you have a test plan like what I described above. This is, however, a bad idea to use for releasing software for production.

On Apple’s platform, distributing final releases via TestFlight or TestFairy can result in your organisation being permanently banned from accessing any Apple developer material. Not counting the legal (contract law) issues surrounding such a release, on Android this requires your users to enable what is called “side-loading”, or installing software from untrusted third-party repositories.

All of the Iowa caucus precinct workers using the Android OS now have mobile devices configured in a severely vulnerable way, and they have had sideloading normalised as something that could be legitimate. The importance of this cannot be understated. This is a large security risk, and I am already wondering in the back of my mind how this will affect these same workers if they are involved with the general election in November. The company responsible for telling them to configure their mobile devices in this manner may, and in my opinion should, be liable for any data loss or exploitation that happens to these people.

My release plan document would have involved clearly defined milestones, with allowances for what features would be okay to postpone for later releases. This could include post-Iowa caucus releases, if necessary — the Nevada Democratic Party intended to use this software for their 22nd February caucus. Release planning should include both planned dates and required dates. For example:

Alpha release for internal testing. Plan: 6 December. Must: 13 December.
Beta release, sent for wider external testing. Plan: 3 January. Must: 10 January.
Final release, sent to Apple and Google app stores. Plan: 13 January. Must: 20 January.
Iowa Caucus: 3 February (hard).

Such a release plan would have given the respective app stores at least two weeks to approve the app for distribution.

Alternatively, if the goal was to avoid deployment to the general app stores of the mobile platforms, they could have used “business-internal” deployment solutions. Apple offers the Apple Business Manager; Google offers Managed Google Play. Both of these services are included with their respective developer subscriptions, so there is no additional cost for the development organisation.

Lack of security processes

Authentication control is important in all software, but especially so in election software. This team demonstrated to me a lack of understanding of proper security processes by providing the PIN on the same sheet of paper that would be used on the night of the election for vote tallying.

I would have had the PIN sent to the precinct workers via either email, or using a separate sheet which they could have in their wallet. Ideally, initial log in and authentication would have taken place on the device before the release, with the credentials stored in the secure portion of device storage (Secure Enclave on iPhone, TrustZone on Android). However, even if this is not possible, it was still possible to provide the PIN to users in a more secure manner.

Apparent lack of clearly defined specification

I have a sneaking suspicion that the combination of these failings mirror the many other development organisations who refuse to apply the discipline of engineering to their software projects. They are encouraged by bad stewards of engineering to “Move Fast and Break Things”. They are encouraged by snake-oil peddlers of “process improvement” that formal specification and testing are unnecessary burdens. And this must change.

I’m not alone in this call. Even the Venture Capitalist section of Harvard Business Review admits that this development culture is irresponsible and outdated. Software developers and project managers must be willing to #Disrupt the current industry norm and be willing to Move Moderately and Fix Things.