Using distcc to compile the Linux kernel

Introduction

This is a quick guide on how to use distcc to compile the Linux kernel on a small Aarch64 (ARM 64-bit) build farm. The hardware is made up of inexpensive and compact 64-bit Android TV boxes repurposed as headless Linux cluster nodes, but could very well be made of SBC's with probably better performance. The nodes are connected using an inexpensive Fast Ethernet switch (100Mbps) on a separate private network behind a development x86-64 workstation.
All nodes are setup with Armbian Ubuntu 18.04.1.

The main issue that using distcc attempts to resolve is that up until now, compiling large software projects natively on ARM hardware has proved to be a chore, with long wait times compared to cross-compiling on modern x86-64 workstations. But as can be seen in the examples below, using distcc a low cost, low power consumption Aarch64 four-node build farm with a total of 28 Cortex A53 cores provides enough computing power to natively compile the Linux kernel in at least comparable times with cross-compiling on x86-64 hardware.

Master (client) and slave (server) nodes

For distcc, the single master (client) is the node that holds the source code and on which the compilation is launched. The slaves (servers) are the various nodes to which various pieces of code are sent over the network for compilation. Distcc itself runs on the master (client) node. The slave (server) nodes run a daemon distccd that communicates with the client and accepts the pieces of code to compile i.e. provides the code compilation service.

Note that any node can act as a master or a slave. So, summarizing:

  • Distcc: runs on the master node.
  • Distccd: runs on all the slave nodes.

Installation

Required packages (apart from the packages required to natively compile the kernel, see HOWTO compile the Linux kernel for the "La Frite" on the "Le Potato") are installed with:

sudo apt-get install distccmon-gnome ccache distcc-pump dmucs python3-dev libiberty-dev distcc

IMPORTANT: all the nodes must have the same version of gcc installed!

Preparing the slave nodes
  • Edit the configuration file in /etc/default/distcc
  • Start distccd daemon: sudo service distcc restart
Preparing the master node
export DISTCC_HOSTS='localhost red green blue' (names or IP addresses of compile nodes, note that localhost is also specified)
To start the distributed compilation of the Linux kernel
cd <path to your linux kernel source directory>/linux
make distclean
make defconfig
make menuconfig (and load whatever kernel configuration file you need)
time make -j16 CC="distcc gcc" Image modules dtbs (note here we indicate we have a total of 16 cores available for compilation)

Build farm control and monitoring

I use a combination of ssh and Ksysguard running on my Ubuntu Linux development workstation (actually my old and trusty Thinkpad T420) to launch jobs and control the headless nodes of the build farm. The headless nodes only need to run a minimal daemon called ksysguardd and of course an SSH server daemon.

ksysguard-distcc.png
Thermal throttling

Note that Amlogic S9XXXX SoC's have an on-chip temperature sensor that Ksysguard can monitor and graph. This is very useful to check whether the SoC's are hitting a max. temperature threshold and "throttling" (reducing their clock rates) during compilation jobs. In my case I have observed a few times thermal throttling on both the S905 and S905X TV boxes, but oddly enough not on the S912 TV boxes.

Some numbers, comparing times on individual machines vs distributed compilation

Note: in all the examples below, the real time is the time elapsed between beginning and end of the compilation job, in other words, the time we are trying to minimize.

Compiling Linux kernel 4.19.7 on a 3-node, 16-core total Aarch64 build farm:

Without distcc, using -j4, 4-core S905X TV box
real 66m42.850s
user 244m46.677s
sys 17m54.767s

Without distcc, using -j8, 8-core S912 TV box
real 44m0.087s
user 311m0.300s
sys 23m14.610s

With distcc, using -j16, master node is 4-core S905X TV box
real 29m48.396s
user 90m36.358s
sys 12m43.706s

With distcc, using -j16, changed master node to 8-core S912 TV box
real 24m22.278s
user 58m12.980s
sys 9m31.670s

More numbers and how to optimize distcc performance

On 3 x S912 TV Boxes with 100mbps network build farm. Building Linux kernel 4.19.7 using distcc. Distribution is Armbian Ubuntu Aarch64 18.04.1 with all updates.

1) export DISTCC_HOSTS='192.168.51.20/8,lzo 192.168.51.17/8,lzo localhost/8'

time make -j20 CC="distcc gcc" Image modules dtbs

real 21m6.824s
user 128m28.990s
sys 12m48.700s

2) export DISTCC_HOSTS='192.168.51.20/12,lzo 192.168.51.17/12,lzo localhost/10'

time make -j32 CC="distcc gcc" Image modules dtbs

real 21m24.377s
user 143m16.370s
sys 14m1.740s

Note that adding threads did NOT improve performance.

3) export DISTCC_HOSTS='192.168.51.20/9,lzo 192.168.51.17/9,lzo localhost/6'

time make -j22 CC="distcc gcc" Image modules dtbs

real 20m4.725s
user 122m58.670s
sys 12m20.230s

Here we decreased the number of threads and performance improved somewhat.

4) export DISTCC_HOSTS='192.168.51.20/8,lzo 192.168.51.17/8,lzo localhost/5'

time make -j21 CC="distcc gcc" Image modules dtbs

real 19m55.687s
user 119m19.470s
sys 12m21.470s

Another decrease in the number of threads and another small performance improvement.

5) export DISTCC_HOSTS='192.168.51.20/8,lzo 192.168.51.17/8,lzo 192.168.51.18/4,lzo localhost/4'

(added a 4-core S905 TV Box to the build farm, total Cortex A53 core count = 28)

time make -j24 CC="distcc gcc" Image modules dtbs

real 16m51.692s
user 99m1.500s
sys 10m55.840s

We are onto something here! Decreasing the number of compile threads on the master (localhost) makes the slaves work harder!

6) export DISTCC_HOSTS='—randomize 192.168.51.23/9,lzo 192.168.51.24/9,lzo 192.168.51.18/5,lzo localhost/3'

(again, with total 28 cores, what I have tried to do here is to decrease the number of compile processes on the master node and increase the load on the slave nodes)

time make -j28 CC="distcc gcc" Image modules dtbs

real 15m31.226s
user 89m33.590s
sys 10m34.380s

Best time yet! lzo compresses network data between master and slave nodes. /n after IP address in DISTCC_HOSTS indicates max limit of parallel processes, default is really low! localhost is last in list because it runs other tasks besides compilation (e.g. preprocessing and linking).

Debugging

Distcc-pump problems

Note that unfortunately distcc-pump cannot be used to compile the Linux kernel, because it fails with the following error:

...
  CC      arch/arm64/kernel/asm-offsets.s
  HOSTCC  scripts/mod/file2alias.o
  HOSTLD  scripts/dtc/dtc
distcc[1352] ERROR: compile arch/arm64/kernel/asm-offsets.c on 192.168.51.20,cpp,lzo failed
distcc[1352] (dcc_build_somewhere) Warning: remote compilation of 'arch/arm64/kernel/asm-offsets.c' failed, retrying locally
distcc[1352] Warning: failed to distribute arch/arm64/kernel/asm-offsets.c to 192.168.51.20,cpp,lzo, running locally instead
  HOSTLD  scripts/mod/modpost
distcc[1352] (dcc_please_send_email_after_investigation) Warning: remote compilation of 'arch/arm64/kernel/asm-offsets.c' failed, retried locally and got a different result.
distcc[1352] (dcc_please_send_email_after_investigation) Warning: file 'include/generated/autoconf.h', a dependency of arch/arm64/kernel/asm-offsets.c, changed during the build
distcc[1352] (dcc_note_discrepancy) Warning: now using plain distcc, possibly due to inconsistent file system changes during build
...

This can perhaps be fixed by using the hack explained here: https://github.com/distcc/distcc/issues/54

Maximizing parallelism

The idea is to keep all the cores in all the SoC's in the build farm as busy as possible. See this issue on GitHub: https://github.com/distcc/distcc/issues/136
In particular note that the default number of parallel processes for the servers is probably set too low for S912 8-core nodes, this can be adjusted using the LIMIT field in the host specification of the DISTCC_HOSTS environment variable.

Duplicate MAC addresses

Check that each node in the build farm has a unique MAC address. Duplicate MAC addresses will cause the most bizarre and difficult to understand problems on any network.
See: https://forum.khadas.com/t/duplicate-mac-addresses-and-serial-numbers/313/9
See also: http://forum.loverpi.com/discussion/418/duplicate-mac-addresses

References