|About | ALIX | APU | APU2 | Flash | Tools | Shop | Support|
|NAND flash technical background|
|Summary||NAND flash is an attractive and popular form of data storage, but is not without pitfalls. The following information was written with CompactFlash cards in mind, but you can expect the controllers in SD cards to have similar limitations. SATA controllers as used in mSATA and 2.5" SSDs have more memory, and don't suffer so much from slow random writes / write amplification.|
|NAND flash structure||Typical flash devices as used in cf2slc / cf4slc / cf8slc are structured as 2KB pages (with 64 bytes spare data for ECC + management data). 64 of these pages make up an erase block of 128 KB. Pages can be written individually, but an entire block must be erased at a time.|
|Write endurance||NAND flash must be erased before it can be rewritten. The cycle life depends
on the flash technology, and becomes worse with each process generation.
|Why not use SLC flash for large cards ?||SLC flash is about 3.5x the price per bit for large capacities. MLC flash can be used in pseudo SLC mode (one bit per cell, half capacity) to get most of the benefits of SLC.|
|Write amplification||Erase blocks are 64 KB or more, so a single byte change can require a full block erase with a simple minded controller. This means that random or piecemeal writes can be quite expensive in terms of flash wear.|
|Wear leveling||CF controllers perform wear leveling to spread the erase cycles across multiple blocks, so frequently written blocks such as directories or file allocation tables don't wear out prematurely. Wear leveling algorithms are proprietary and undocumented - "secret sauce".|
|Read disturb||Reading is normally considered not to wear out the flash device. On recent MLC devices this is no longer a safe assumption - read disturb may require occasional rewriting of data. So much for the "read only" file system...|
|ECC||NAND flash is NOT guaranteed to be error free. The controller must implement an ECC code to allow error recovery, and replace bad blocks with spares. Again, on the 2 GB flash device used in cf2slc / cf4slc / cf8slc, out of 16384 erase blocks 16064 are guaranteed to be good at the end of the life cycle of the chip. The controller reserves additional spare sectors for mapping and internal operation - this is why you never get the full capacity.|
|How to corrupt CF cards ?||Easy...
Start a write, and immediately (less than 1 second or so) do a system reset.
Even if the sector gets written very quickly, the controller inside the CF
card may still be busy with internal housekeeping. If you are unlucky some
internal data structures will end up in a corrupted state. To avoid this,
please ensure some delay (a few seconds) between sync and reboot.
The flash controller may have some recovery procedure to clean up such an inconsistent state, but this takes time and may result in a BIOS time-out (error message: no boot device found).
|CF card performance||Test results for our cards measured using HDtune pro software. As you can see, read performance is excellent, over 3000 IOPS for 4 KB random reads on the 4 GB / 8 GB cards. Sequential write performance is adequate (over 10 MB/s for 1 MB transfers). On the other hand, random write performance is poor (on the order of 18 to 24 IOPS for 4 KB random writes). Interestingly the 8 GB card is slower than the 4 GB card, I believe this is caused by larger management blocks to keep the allocation map size within reason. By the way, we measured between 2 and 5 random write IOPS on MLC based cards - which is why we no longer sell them.|
|Why are sequential writes fast ?||Most flash cards are optimized for this. Typical scenario - store a photo or video to flash. Besides, writing large amounts of data is easy - take a free block, write data to it, erase the block that was replaced.|
|Why are random writes so slow ?||The flash controller on a CF card has a very small RAM buffer, and thus
cannot store a fine grain block allocation map. Typically the management
block equals one erase block, or even 2 or 4 blocks if multiple flash
devices are interleaved for faster sequential performance.
For a single sector write, the following may happen: Write the new data to a new block (often called the child block). The previous version is called the mother block. These blocks can coexist for a while, but have to be consolidated at some point by copying unmodified pages from the mother block over to the child block. Then the mother block can be erased and reused.
Please note that piecemeal writing (e.g. log file) in units of less than 512 bytes (could also be 2KB depending on the controller and flash) can get very time consuming.
|Unaligned writes||Logical sectors are 512 bytes. Flash pages are 2 or 4 KB. With a typical
geometry of 63 sectors per track, partitions may end up misaligned. Not
much of a penalty for read access, but on write access the performance
hit may be substantial.
Modern hard drives with 4 KB physical blocks run into the same issue with older operating systems such as Windows XP.
|How to work with flash, rather than against it ?|
|Are you in sync ?||Common lore says that after sync all updated buffers are written to disk.
That may not be the case - if you look at the
man page for sync you will
find that sync will schedule the dirty blocks for writing, but
will not wait for these writes to complete. Recent Linux kernels also
seem to have
some problems in this area...
Doing a simple sleep 5 after sync may not be enough to ensure proper writeback. Writing back 1 MB of dirty data may require 10 seconds if random writes are required - if the flash card can handle 25 IOPS. If the flash card is based on MLC flash, and your buffer cache is sufficiently big and dirty, sync could easily take an hour in pathological cases (all random accesses).
|pdflush configuration||When disk blocks are updated, they are first written to the buffer cache
in memory. They are then written back to disk by a kernel process (pdflush).
This process scans the buffer cache and looks for "dirty" (modified) blocks
that have been sitting around for a certain time. The advantage of this is
that blocks can be combined for sequential writes, or might never make it
to disk in the first place. A good description can be found
/proc/sys/vm/dirty_expire_centisecs -> default = 3000, this means that dirty blocks expire after 30 seconds. I think 500 (5 seconds) would be more reasonable for a small system running on a flash disk. Do you really want your data to be exposed to power failures for 30 seconds ?
/proc/sys/vm/dirty_bytes -> default = 0. I don't think it is a good idea to let too much dirty data accumulate in the buffer cache. Something like 1000000 (1 MB) seems more reasonable to me.
/proc/sys/vm/dirty_writeback_centisecs = 500 -> This means that the pdflush process is kicked off every 5 seconds. I think 100 (1 second) is a bit more reasonable.
If in doubt, consider adding a disk activity LED to your ALIX board (can be added quite easily, see page 6 of the board schematics).
|Inspiration for flash file systems||LogFS
Article on Microsoft FlashStore
Self destructing flash drives (interesting article about forensics issues on SSDs)
NVM Express - coming soon to a PCI slot near you.
|© 2002-2016 PC Engines GmbH. All rights reserved.|