License compliance is a problem for many companies. The Linux Foundation sponsors the SPDX project to standardize communication about licenses, and the yocto project is adding functionality to create SPDX files.
When you distribute a product that uses open or closed source software from different sources, you have to verify that you’re legally allowed to distribute it and under what restrictions. It turns out that a package which is supposedly GPL-2.0 (e.g. busybox) actually contains files with other licenses.
In SPDX, you can express what licenses all the files in the package has, what the maintainer declares that the license is, and what is the concluded license by your lawyers based on this information. SPDX allows you to create tools that manage and manipulate this information.
[Explanation what kind of information you find in SPDX files.] One of the fields can be NOASSERTION, which is what would be filled in by e.g. fossology. It means that this information is not verified.
To generate an SPDX, you can start from something that is machine-generated. Better is that a human reviews it (understanding how the automation works) and enhances it. Best would be that a human goes through all the files and evaluates. This talk is only about the machine-generated part, and it uses fossology. Fossology uses pattern matching to find out about licenses in files. University of Nebraska, Omaha has extended it with SPDX output. They also have a website where you can upload files for evaluation. Fossology is not very well written from a CS perspective: it’s completely linear, not parallelized, so it takes very long to analyse a tarball. It also takes a lot of memory. So make sure you have a fast machine to run it on.
In yocto, an SPDX generation step is inserted right after the patching of the source. This makes sure that the patches themselves are also verified. The task do_spdx checks if the spdx is in a cache, if not it creates a tar and sends it to the fossology server, waits for the resulting spdx and stores that in the output directory and the cache. This means that the build flow is blocking while waiting for the response for the server; the long connection time is also a risk for timeouts and retries. So future work is to do the build in parallel with the license processing. An experiment with this approach gave a pretty good speedup.
What is currently released doesn’t work: you have to apply some patches (which will be in yocto 1.5.1). Also see the slides for how to set it up. When accessing the server, it becomes really slow because the fossology server takes a long time. But when the cache can be used it doesn’t slow down at all. But for a large image it will take 3-4 days to process (estimated).
On the fossology-spdx side, there are going to be improvements to make the service really usable. One of them is making sure the API can be used by other build systems, and integrating it in other build processes. For humans, they will make a dashboard that allows you to make changes to the SPDX on the fossology server. This allows you to go through the review process directly on the server. This also allows it to combine it with data coming from other places, e.g. blackduck. For improving performance, a global SPDX package cache should be used and the long connection time should be avoided.
For Yocto, a next step is to trace the target back to the licenses of the source. Ideally, you should be able to match each file in the target to the source files that influence them. This will give you the ability to make a definitive statement of all the licenses involved in the target, and in the end generate an SPDX file for the target. Note that this will also need to handle linking with libraries, static and dynamic linking.
Comparison with BlackDuck: BlackDuck is an anti-plagiarism tool: it checks if the code looks similar to anything it already knows. If yes, it assigns the license accordingly.