This talk is about getting the best performance out of USB devices.
USB speeds: we’ll talk about full speed (12Mbps) and high speed (480Mbps).
The logical USB device has configurations, which has interfaces, which has endpoints. An endpoint is an addressable source/sink of data (unidirectional). Cfr. socket, but unidirectional. An interface is a related set of endpoints, that together provide a function, e.g. mass storage, HID, … . Multiple interfaces in a configuration are active at the same time => composite device. Out of multiple configuration, only 1 is active at the same time. Most devices have only 1 configuration.
4 types of endpoint – an endpoint has exactly one type. Control endpoint is mostly used during enumeration. Interrupt and bulk endpoint are used for the actual data. Interrupt is small amount of low-latency data – they reserve bandwidth to guarantee latency. Bulk endpoints transfer large amounts of data but has no guarantees. Isochronous endpoint is for large amounts of time-sensitive data. It has no guarantees, instead the data is dropped if it is late.
Endpoint length = max amount of data per transfer, e.g. 64 bytes for a bulk full-speed endpoint.
Transaction = basic unit of data across the bus, up to the endpoint length; transfer = one or more transactions in order to move a chunk of data from one side to the other. A transfer is ended by a short transaction (less than endpoint length), or when the desired amount of data is reached – but that’s determined by the protocol, and e.g. a USB analyser may not know about that.
USB is controlled by host, so host always initiates transfers and hosts polls devices to check if they want to send data.
IN transaction = host sends IN token to device, if device has data it sends it, host sends ACK; if device does not have data, it sends NAK. NAK just means “not ready yet”, not an error. If the device NAKs, the host keeps on trying until it times out.
OUT: host sends OUT token, host sends data up to endpoint length, device sends ACK or NAK. So the data is sent before the device has the responds at all. The host retries all of this until it times out.
IN and OUT are typically fully handled by hardware.
In Linux, the gadget framework for handling UDC (USB Device Controllers) is largely separate from the host USB stack. Unlikel OHCI/EHCI, the device interface is not standardized.
musb = IP block from Mentor
EG20T Platform Controller Hub = on embedded Intel SoMs
PIC32 non-Linux device, with M-Stack developed by Alan.
Why do you make a USB device?
- Easy, well-supported connection to PC
- Make use of an existing device class so you don’t have to write drivers
- Want to connect to PC and move a lot of data quickly (where you control both host and device)
For cases 1 and 2, naive implementations can work. You can use configfs to dynamically create the USB device from userspace, with no kernel driver. But if you really need performance (case 3), you’re going to have to do something more.
Synchronous API: USB transfer will only be initiated by HW when a transfer is active. So after the transfer completes, the bus sits idle until your software finally goes to the next iteration of the loop and starts the next transfer. Therefore, use the async API and submit multiple transfers. The HW will jump to the next transfer when the first one has finished. This is true both on the host and on the device side.
Transfers should be large enough. At high speed, the max endpoint length for bulk is 512 bytes – so try to use all of that to reduce overhead.
If you need to optimize, use an USB analyser to see what’s going on. But when looking at NAKs coming from the device that is too slow, don’t just count the NAKs because the host controller will adapt to the latency of the device and wait a little before its next attempt.
Increasing transfer size allows the USB controller to handle transactions back-to-back, avoiding any latency between them. Measured on BeagleBoneBlack: first transaction of a transfer has 40us latency, after that it’s only 6us. 64Kbyte transfers seem to work well. However, with musb, it turns out that very large OUT transfers are actually a little bit slower because the DMA is done at the transaction level.
However, since USB is message based, it’s convenient to put application messages in one transfer because then you have to add boundaries yourself. Queuing messages can also increase the latency.
Putting the protocol in the kernel rather than userspace slightly increases the performance (7%) because it avoids to have the userspace boundary in the latency-sensitive transfer-to-transfer hand-off.
Multiple bulk endpoints could increase performance, because you get extra DMA concurrency. It makes the protocol more complex to manage, and it also depends on host performance.
A high-bandwidth interrupt endpoint gives you reserved bandwidth, endpoint length can go up to 3072 bytes at high speed. But if the bandwidth is not available, the device doesn’t enumerate! Same for isochronous which supports even larger endpoint lengths.
Remember that hubs have an influence: they translate between high-speed and full-speed, thereby hiding some of the latency when using synchronous API.
Serial gadget is pretty suboptimal because it goes over the tty framework, which breaks it into small transfers.
To find performance issues in the kernel, use ftrace and kernelshark.