Why reproducible builds? To be able to verify that the binary is really derived from the source. The source is something that can be audited, but the binary can’t. Binaries are signed, but only one person (or buildbot) signs it, so if that one is compromised…
A security hole can be really small in the source, it can even be a single bit of difference in the binary. And in case of Bitcoin, there is real financial impact from a break.
The Tor Browser project actually started with the idea of reproducible builds, and proved that it really is possible. For bitcoin, it’s also about protecting the developer, so if there is something wrong they can prove they didn’t insert it.
For Debian, if that can be made reproducible it matters a lot for many users.
How to make builds reproducible?
- Record the environment. This is tracked in a .buildinfo control file that goes with the binary. Also the build path is recorded because otherwise it’s difficult to make it deterministic.
- Rebuild the package in the same environment. srebuild looks at the buildinfo file and gets these packages from the snapshot archive.
- Eliminate variations. There are two approaches: either virtualize (use a VM and use exactly the same VM for the rebuild; libfaketime to make time reproducible). But in Debian, the variations are fixed instead of papered over. Work-arounds are only used as a last resort.
- Normalize the files. ar, gzip, jar, javadoc html, zip archives.
- When binaries do differ, try to understand why: unpack archives, uncompress PDF, disassemble, …. debbindiff does this, it gives a human-readable diff of two binary files.
jenkins.debian.org has 14 jobs running to reproduce packages. It builds about 1000 packages per day. The issues found are stored in a git repo. To insert variations, the hostname, user, etc. are all varied. At the moment, 21557 packages are rebuilt, and 82% of these are reproducible. 2200 of the 4000 which are not reproducible have been investigated.
An experimental “reproducible” toolchain is being created to make sure that the tools can do things reproducibly.
- Timestamps are used during the build, basically anything that is generated carries a timestamp, and anything that stores files (ar, jar).
- Ordering of files.
- Pseudo-randomness that leaks into the binary.
- Build paths => that will not be solved, too difficult to solve this.
Bottom line: don’t record things from the build environment, or make it optional. But those things in the log file.
What needs to be done?
- Investigate the remaining 2000 packages
- Fix known common issues
- Create a tool to display the reproducibility of the packages installed on your system.