It’s pretty easy to annoy me on the internet some days. Post some microbenchmarks, without context, and draw a conclusion based on those without considering all the various details that go into system performance.
One of these things, in the moderate past, has involved Raspberry Pis and 64-bit operating systems, with the claim that of course you want 64-bit OSes on the Pi, especially the Pi4, because look at these little microbenchmarks that show you fancy new hardware features only available in 64-bit (AArch64) mode!
Well… OK, but what if we do something more comprehensive and get a better feel for actual system performance? Of course, you probably know I’m about to do just that…
32-bit vs 64-bit
Fundamentally, the difference between 32-bit and 64-bit processor modes is the register width. Registers are the “core storage” in a processor. On most processors, if you want to add two numbers together, they have to be in registers (or sometimes one has to be in a register and the other register points to a memory address). Results go in registers. Addresses of the data you’d like to load go in registers. Just about everything in in a register.
How big are they? On a 32-bit machine, they’re 32 bits wide. On a 64-bit machine, they’re 64 bits! This works out to being able to handle larger values and virtual address spaces, and generally tends to be nicer to work with.
But there’s more! In the transition to 64-bit, most architectures add registers and clean other things up. For x86 processors, the transition to 64-bit involved going from 4 primary registers to 12, and making some of the other registers a bit more flexible (RIP-relative addressing, for one). On ARM, the transition went from about 12 general purpose registers (depending on how you count) to about 30 (and dropped direct PC access). It’s a big jump.
Increasing the number of registers, all other things being equal, means you can keep more data in the highest speed access section of the processor. This tends to improve performance, and a combination of allowing 64-bit values and adding registers tends to mean that 64-bit code runs a good bit faster than 32-bit code!
Plus, as has been shown in some microbenchmarks, there are often new and exciting instructions that do things like AES encryption or SHA hashing in hardware! This is radically faster than software - but does it matter in terms of actual system performance? That’s the open question.
Of course, there’s (usually) no such thing as a free lunch.
Going to 64-bit data types (and 64-bit pointers) doubles the memory footprint for various types of data - and that means you can fit less data in a given amount of memory. You’ve increased the amount of data you can store in registers, but reduced (by up to half) the number of data elements you can fit in a given amount of memory - like the high speed L1/L2 caches on the processor.
There are also impacts on code density. On both x86 and ARM, 64-bit code tends to be a bit less dense than 32-bit code - which means, again, you can fit less in the various caches. Yes, it’s still 32-bit instructions, but it drops the nice conditional execution field on almost everything.
In sum? Performance will vary. And that’s why I’m going through this benchmark suite!
Why not both? Well, if you want to use the x32 ABI, or the ARM equivalent Arm64ilp32Port, yes, you can have most of the benefits of the 64-bit register count without the memory pressure. Unfortunately, nobody outside HPC uses them, and I’m not about to maintain my own x32 distro. Sorry. You might be able to do it with Gentoo.
However, one bit of untruth you’ll hear is that you can’t use the 8GB of RAM in a Pi4 with a 32-bit OS. This is wrong. ARM, like x86, supports ways of using a >32-bit physical address space with a 32-bit OS. This means that you’ll never have more than 4GB virtual memory per process, but you can still use all 8GB physical. Really, it’s fine. The Raspberry Pi foundation isn’t stupid.
Yet More: Userspace vs Kernel
And, just for yet more fun, one can mix and match userspace and kernel bit-ness. You can run a 32-bit userspace (all the application binaries that run) and kernel, a 64-bit userspace and kernel, or a 64-bit kernel and 32-bit userspace. For a variety of quite valid technical reasons, you can’t run a 32-bit kernel and 64-bit userspace, though.
I’ve been curious about the performance deltas between these for a while, but I’ve not wanted to do cross-distro comparisons. There are just too many variables to be able to make strong statements one way or another, and I’m not insane or bored enough to do something like custom Gentoo builds on a Raspberry Pi for the purposes of benchmarking.
Recently enough, the Raspberry Pi Foundation relented and released a 64-bit Raspbian. They’ve shipped 64-bit kernel options for 32-bit userspace for a while, but, finally, I can do a proper shootout with a reasonably flat playing field!
I’ve installed the most recent Rasbian full images, then gone about doing a full
apt update && apt dist-upgrade -y.
This works out, at the time of benchmarking, to a 5.10-63 kernel (v8 & v7l), and Chromium 92.0.4515.98 - on both 32-bit and 64-bit. It doesn’t get much more equal than this. I’m using a pair of new Samsung Evo+ cards for benchmarking, and good thermal management.
I’ve also locked the CPU into maximum performance, with
sudo cpufreq-set -g performance - this cranks up the CPUs, so there’s no governor throttling related deltas. Normally, on the Pi, the CPU remains slow until there’s some demand, then it speeds up. This sort of thing interferes with benchmarking, so I’m just disabling it.
Benchmarking: Real World Tests
If you’ve followed along with the random benchmarks on the Pi, you’ll discover that, yes, wow, 64-bit is way faster - for benchmarks and HPC-type code. Unfortunately, most people aren’t using Raspberry Pis for sitting all day doing micro-benchmarks and running a high performance compute cluster with them. If you are, well, you probably don’t need to read the rest of the post, do you?
I’m interested in a bit more of the day-to-day performance - disk IO (on SD and a USB3 SSD), browser performance, compiler performance, etc. Stuff I use that’s typically performance bound on a Raspberry Pi. I daily drive a Raspberry Pi 4 as a desktop, so this sort of thing is interesting to me.
For each of the tests, I’m comparing a Pi3 B+ (with thermals properly managed) and an 4GB Pi4 at stock speeds. Why? This is what I use in my office, so this is what I care about - and my benchmarks are, first and foremost, to inform my decisions about platforms I run. Overclocking may change the absolute performance of a system, but it’s unlikely to change the relative performance of 32-bit vs 64-bit code, so it’s not interesting for this comparison.
I’m running the following suite:
- Chrome benchmarks: Sunspider, Jetstream, Speedometer, Octane
- File IO with iozone3 (both on the SD card and a USB3 SSD)
- A build time test, building iozone3 (have to build to use it, may as well time it!)
- A FLAC to WAV decode and then encode.
The Raspberry Pi 3B+ and the Pi4 behave differently enough that I’m going to address them separately instead of trying to combine charts. The goal is to compare different OSes on each Pi, not to directly compare them to each other.
Raspberry Pi 3: Browser Tests
I’ll start with the browser tests. Quite a bit of what people do with computers these days involves browsers.
Starting with the Octane 2.0 benchmark, the Pi3B+ scores somewhat higher with 32-bit code than with 64-bit. Browsers tend to be exceedingly pointer heavy, and doubling the size of the pointers has a very real impact in how much code you can put in the L1 or L2 cache (and the Pis have tiny little caches). The difference isn’t huge, but for this first benchmark, 32-bit wins out. For the compute-bound browser benchmarks, they’re not going to be making a ton of kernel round trips, so the 32-bit vs 64-bit kernel doesn’t seem to have any impact.
However, when it comes to the newer Speedometer 2.0 benchmark, higher bars are better - and here, the 64-bit browser pulls ahead! The 64-bit kernel/32-bit userspace is the worst of the options by a good bit here, coming in last.
There’s no obvious winner in the browser benchmarks - sometimes 32-bit is faster, sometimes 64-bit is faster.
Raspberry Pi 3B+: Memory Bandwidth and Compute
One of my favorite little benchmark tools for comparing systems is mbw. It’s a memory bandwidth tester, and it’s not trying to be smart. It just tries to copy data around from one buffer to another, as fast as possible, using various techniques that standard code will actually use. Some people, including myself, would argue that for general purpose code, memory bandwidth is king - the fastest CPU cores will be starved with slow RAM, and a fairly slow clocked CPU paired with exceedingly fast memory will actually run workloads quickly. The M1 sure seems to argue for this…
In any case, on the 3B+, the results are clear. The 32-bit code/kernel manages the most effective memory bandwidth, in all configurations. Why does the kernel make a difference? The kernel mode (ARMv7 or AArch64) determines the page table structure being used for user processes, and this will impact just how much can be kept in the TLBs, as well as how fast the table walks are.
The time to build iozone3 comes in the same as a lot of the other tests here. The 32-bit userspace/kernel is fastest, with 64-bit coming in last. This isn’t strictly a fair comparison between 32-bit and 64-bit, as the 64-bit build is building a 64-bit binary, with the 32-bit tests building a 32-bit binary, but since this is the default for how a system builds things, I think it’s a reasonable test.
If you spend a lot of time compressing and decompressing files, the 7zip benchmark mode might be useful to you as well. It attempts to estimate CPU performance on some 7zip workloads, and I’m inclined to say that the workloads are rather pointer heavy, because the 32-bit code and 64-bit code hold even in the compression tests, but 32-bit dominates in the decompression tests.
Finally, I’ve tested some FLAC compression and decompression - I decompress a flac to wav, then compress it back. These are done on the USB SSD, with the flac and wav file in the disk cache, so it’s a compute test. Here, the 64-bit version of the flac tools does pull ahead the 32-bit version!
Raspberry Pi 3B+: Disk IO
My final set of tests for the 3B+ are the iozone3 suite of disk tests. First, I’m testing the SD card, using a higher performance SD card I have laying around. On the 3B+ (and all prior Pis), the SD card interface is rather limited in speed, and that’s on full display here. Regardless of the OS, most of the disk numbers are identical (or nearly so). The SD card interface is the limit, not the SD card or the OS. The 4k read numbers are a bit uneven, with the 32-bit OS underperforming in the initial read, but overall, for the SD card, it’s safe to say that there’s no real difference.
With the USB SSD, the story is a familiar one as well. The Pi3B+ only has a USB2 interface - meaning that the SSD is performance limited to the USB bus bandwidth, and we see that here, with the performance coming in consistently around 35-40MB/s in the larger reads. However, here, the 32-bit OS/kernel has a consistent lead, being the best performer in every single test involving the USB SSD , and I’ll call that a significant result!
Raspberry Pi 3B+: Use 32-bit Raspbian!
While it’s not a win in every single test, the 32-bit OS consistently performs well enough on the 3B+ that, unless you’re doing something that specifically requires 64-bit support (at which point you should be using the Pi4 - see below…), I’ll argue that you should use the 32-bit Raspbian on the 3B+. It comes out ahead in enough tests, by far enough, that it seems a pretty clear win to me. While the Pi3 has cores that can support 64-bit modes, it’s clearly optimized for 32-bit code.
But what about the Pi4? It’s a new design, with new cores, and a rather redesigned internal architecture. Does the same apply?
Raspberry Pi 4: Browser Tests
Starting with Octane on the Pi4, there are some slight differences in performance between the 32-bit and 64-bit installs, but nothing to write home about (beyond over twice the performance of the 3B+ - that was scoring around 3300). I could zoom in on the differences and try to make an argument one way or another (by using deceptive graphs - I like my bars starting at 0 for honesty), but the reality here is that for Octane, everything is just about identical across the board.
With SunSpider, the results are again fairly close, but the 64-bit clean OS does edge out the others (yes, it’s easier to tell with this chart layout). Everything is still close, but 64-bit does get the tests done in the least time.
For Speedometer, the results are quite a bit more dramatic. Not only does the 64-bit install perform the best, it performs the best by quite a good margin - showing about a 40% performance boost over the 32-bit install. That’s significant!
The Pi4 will successfully complete the JetStream 2 benchmark as well - the Pi3B+ won’t finish it without the browser crashing due to RAM limits. Here, again, the 64-bit browser has a significant lead in the benchmark results.
Raspberry Pi 4: Memory Bandwidth and Compute
With the Pi4, the mbw results are flipped as well - now, either the 64-bit build performs best, or the results are very close. You’ll note more than double the memory bandwidth of the Pi3B+ here as well. It would be interesting at some point to dive into these results a bit more - if anyone happens to be bored or has done so before, I’d love to know what makes up the difference here!
For building iozone3, the results are roughly the same as on the 3B+ - just a lot faster. The 64-bit build is slower than the 32-bit builds, though by a smaller margin. It would seem the 64-bit compiler is just doing more work.
The FLAC results are, again, a win for 64-bit, though not by a particularly large margin.
And, finally, the 7zip test shows that, yet again, something about decompression really, really likes the density of 32-bit code in memory.
Raspberry Pi 4: Disk IO
And, finally, the disk IO tests. For the SD card, not only is the absolute performance (on the same card) higher (because the Pi4 has a better SD card interface - the interface on the 3B is the limit, no the card), there’s a clear pattern - the 64-bit kernel, with either a 32-bit or 64-bit userspace, performs the best. The difference isn’t huge, but it’s there in almost all the tests.
Moving to the SSD, which is now connected via USB3, the results are even clearer - the 64-bit kernel is quite a bit faster at reading the device. So fast, in fact, that it’s hard to see the differences in the other bars!
Which is why I’ve made another version of that chart, without the high performance reads. Here results are a bit more varied, though the 64-bit kernel/32-bit userspace has a consistent lead in the 4k block tests. But the rest of the results are mostly even across the kernels.
Raspberry Pi 4: Use 64-bit Raspbian!
Like the 3B+, I think the results here are pretty clear as well: For the Pi4, you really should be using a 64-bit OS if there’s no reason not to. In some cases, there isn’t a huge performance delta between the two, and in others, there’s a really quite commanding performance lead for the 64-bit OS!
Or you could just use the 32-bit OS for maximum compatibility. You’re giving up some browser performance on the Pi4, but it’s still not a dramatic difference in most of the tests.
But if you read this blog? You probably like the bleeding edge - so 64-bit it up and have a blast!