Linux Storage System Bottleneck for eMMC/UFS – Bean Huo & Zoltan Szubbocsev, Micron

Bean and Zoltan (the speaker) work at Micron in the embedded business unit, in storage software, often in automotive. As part of this they have quantified the storage system overhead in embedded systems for access to eMMC, UFS and NVMe, i.e. how much of the speed provided by these storage technologies can actually be achieved by userspace. Also quantify the overall performance improvement of NVMe over UFS. But comparing is difficult since you can’t have a NVMe device that is fully equivalent to a UFS device.

All three technologies are a NAND chip with a controller and firmware. eMMC can get 400MB/s at its interface, may go up to 566 in the next generation. UFS Gear3 can have two lanes of each up to 728MB/s. NVMe Gen3 1000MB/s per lane.

For tresting they use Fio in single and multi-threaded mode, always using DirectIO or sync IO. Using function_graph tracer and blktrace. The trace points allow to measure latency: from user space submission to BIO submission; from BIO submission to storage device submission; for data transfer up to completion algorithm; from completion interrupt to block layer completion.

They did experiments on two boards: a somewhat older 2xCortex-A9 Zedboard, and a newer 4xCortex-A57 Nvidia board. On the Zedboard, eMMC performance is completely dominated by software overhead, ranging from 63 to 92% of the total latency spent in software. To some extent this is caused by the cache invalidation which takes a long time in Cortex-A9. On the Nvidia board, performance is a lot better for large sizes (12% to 39%), but stil significant for small 4KB request sizes (up to 72%).

Experiments for UFS and NVMe have to be done on different boards. For 4K write there is still significant (60-74%) overhead. The graphs showed that with 8 threads the latency is significantly decreased; someone in the audience suggested that this is due to interrupt coalescing, which amortizes the interrupt time over all threads. Even for 128K accesses the overhead is non-negligible. The results show that the overhead of NVMe is indeed significantly lower, e.g. for 4K random write the overhead of NVMe is only 66% that of UFS. To estimate the system-level performance, they used some formulas that I didn’t understand but that result in 8-25% speed difference between UFS and NVMe. This difference is not so much because of faster NVMe, but mostly because the Linux stack is better. However, the hardware queue size is also a factor: NVMe can support a lot more outstanding tasks. For very high thread counts, UFS performance starts to drop while NVMe sustains.













Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s