Main use: mobile/embedded using SD cards and eMMC, but also applicable to SSD.
Flash has wear + erase/write cycle => FTL (flash translation layer) to deal with it. But this makes the performance vary according to how the flash is accessed, and the access patterns of traditional FSes don’t do it well. And especially cheap FTL devices degrade quickly (because you need a lot of memory to do it well).
Basically, you have an FS on top of an FTL, and both do vaguely similar things (e.g. journalling). Ideally, we would want to design FS and FTL so they collaborate well. However, since we can’t modify FTL, we write an FS that creates access patterns that are suitable for FTL. That means: sequential writes, and a log-structured FS because that writes metadata sequentially to the log.
Conventional LFS problems:
- Cleaning when the disk is getting full with old log data.
- Wandering tree (which causes non-sequential access patterns).
F2FS (flash-friendly filesystem) design points:
- Align datastructure to alignment that FTL uses (e.g. 128K).
- … (sorry but I couldn’t follow)
Typical FTL has hybrid mapping of addresses: kind of set-associative remapping scheme. Every block is remapped at a course grain; every page is also remapped, but only to one of 4 or 8 blocks; the FTL just checks all 4 possibilities in parallel to find the right one (no need to store the entire address). Implication is that when a block is moved, you may need to merge two blocks (of which a large part has been emptied) but you can’t map all pages to just any place so you still need a second empty block for the conflicting pages. By writing (and cleaning) sequentially, you reduce the need for remapping and for merging blocks. This should be beneficial for performance and also for longevity.
FTL has concept of superblock and superpage: there are several parallel NAND banks on the chip, and all of them can be erased (block) or written (page) in parallel. This increases performance for sequential write, but increases cleaning cost. Also, the concurrent banks mean that it is not needed to have really sequential writes for optimal performance, but you can have multiple concurrent sequential writes and still get good performance (each write will go to a different bank). Only when you have more concurrent write streams than there are banks you’ll see a performance degradation.
F2FS puts all FS metadata together, so there is spatial locality. Within this area, F2FS does random reads/writes so the FTL’s cache comes into play.
F2FS aligns the main area (containing the data) with the FTL’s zone size (but how to find out the FTL’s zone size???).
F2FS keeps several log heads inside the main area, distinguishing hot/warm/cold data and hot/warm/cold metadata. These are written (sequentially) concurrently.
Normally, updating file data means metadata has to be updated and propagated, which leads to many non-sequential writes. To avoid that, only the direct node is rewritten (to a different place, sequentially in the log) and there’s a global address translation table that remaps the direct node addresses. The NAT updates are random writes in the metadata area. It is checkpointed (so not synced on every write) and it has a backup to allow fs recovery.
Because flash makes it possible to have multiple concurrent log heads (as opposed to hard disks where accesses really have to be sequential), it is possible to separate hot/warm/cold (meta)data. This makes it possible to optimize the cleaning. (But I didn’t understand more of it.) Cleaning is done partly in foreground (when getting full, using quick-and-dirty cleaning that does random writes) and partly in background (when I/O is idle).
Directory structure is optimized for graceful performance degradation when directories get larger (i.e. small directories are relatively fast, very large directories are slower but not too much so).
Remaining problem: how to know the FTL characteristics… There is no way to retrieve that, except ask the manufacturer or reverse-engineer by testing it.
Upstream status: v3 of the series is under review now.
See also the LWN article