Flashcache : A Write Back Block Cache for Linux
  Author: Mohan Srinivasan
	-----------------------------------------------

Introduction :
============
Flashcache is a write back block cache Linux kernel module. This
document describes the design, futures ideas, configuration, tuning of
the flashcache and concludes with a note covering the testability
hooks within flashcache and the testing that we did. Flashcache was
built primarily as a block cache for InnoDB but is general purpose and
can be used by other applications as well.

Design :
======
Flashcache is built using the Linux Device Mapper (DM), part of the
Linux Storage Stack infrastructure that facilitates building SW-RAID
and other components. LVM, for example, is built using the DM.

The cache is structured as a set associative hash, where the cache is
divided up into a number of fixed size sets (buckets) with linear
probing within a set to find blocks. The set associative hash has a
number of advantages (called out in sections below) and works very
well in practice.

The block size, set size and cache size are configurable parameters,
specified at cache creation. The default set size is 512 (blocks) and
there is little reason to change this.

In what follows, dbn refers to  "disk block number", the logical
device block number in sectors.

To compute the target set for a given dbn

	 target set = (dbn / block size / set size) mod (number of sets)

Once we have the target set, linear probe within the set finds the
block. Note that a sequential range of disk blocks will all map onto a
given set.

The DM layer breaks up all IOs into blocksize chunks before passing
the IOs down to the cache layer. By default, flashcache caches all 
full blocksize IOs, but can be configured to only cache random IO
whilst ignoring sequential IO.

Replacement policy is either FIFO or LRU within a cache set. The
default is FIFO but policy can be switched at any point at run time
via a sysctl (see the configuration and tuning section).

To handle a cache read, compute the target set (from the dbn), linear
search for the dbn in the set. In the case of a cache hit, the read is
serviced from flash. For a cache miss, the data is read from disk,
populated into flash and the data returned from the read.

Since the cache is writeback, a write only writes to flash,
synchronously updates the cache metadata (to mark the cache block as
dirty) and completes the write. On a block re-dirty, the metadata
update is skipped.

It is important to note that in the first cut, cache writes are
non-atomic, ie, the "Torn Page Problem" exists. In the event of a
power failure or a failed write, part of the block could be written,
resulting in a partial write. We have ideas on how to fix this and
provide atomic cache writes (see the Futures section).

Each cache block has on-flash metadata associated with it for cache
persistence. This per-block metadata consists of the dbn (disk block
cached in this slot) and flags (DIRTY, VALID, INVALID).

Cache metadata is only updated on a write or when a cache block is
cleaned. The former results in the state being marked DIRTY and the
latter results in the state being marked ~DIRTY. To minimize small
flash writes, cache block metadata is not updated in the read path.

In addition, we also have a on-flash cache superblock, which contains
cache parameters (read on a cache reload) and whether the cache
shutdown was clean (orderly) or unclean (node crash, power failure
etc).

On an clean cache shutdown, metadata for all cache blocks is written
out to flash. After an orderly shutdown, both VALID and DIRTY blocks
will persist on a subsequent cache reload. After a node crash or a
power failure, only DIRTY cache blocks will persist on a subsequent
cache reload. Node crashes or power failures will not result in data
loss, but they will result in the cache losing VALID and non-DIRTY
cached blocks.

Cache metadata updates are "batched" when possible. So if we have
pending metadata updates to multiple cache blocks which fall on the
same metadata sector, we batch these updates into 1 flash metadata
write. When a file is written sequentially, we will commonly be able
to batch several metadata updates (resulting from sequential block
writes) into 1 cache metadata update.

Dirty cache blocks are written lazily to disk in the background.
Flashcache's lazy writing is controlled by a configurable dirty
threshold (see the configuration and tunings section). Flashcache
strives to keep the percentage of dirty blocks in each set below the
dirty threshold. When the dirty blocks in a set exceeds the dirty
threshold, the set is eligible for cleaning.

Dirty blocks are also cleaned based on "idleness" By defalt a 
dirty block not read or written for 15 minutes (dev.flashcache.fallow_delay) 
will be cleaned. To disable idle cleaning set that value to 0.
A 2 handed clocklike algorithm is used to pick off fallow dirty
blocks to clean.

DIRTY blocks are selected for cleaning based on the replacement policy
(FIFO vs LRU). Once we have a target set of blocks to clean, we sort
these blocks, search for other contigous dirty blocks in the set
(which can be cleaned for free since they'll be merged into a large
IO) and send the writes down to the disk.

As mentioned earlier, the DM will break IOs into blocksize pieces
before passing them on to flashcache. For smaller (than blocksize) IOs
or IOs that straddle 2 cache blocks, we pass the IO directly to disk.
But before doing so, we invalidate any cacheblocks that overlap the
IO. If the overlapping cacheblocks are DIRTY we clean those
cacheblocks and pass the new overlapping IO do disk after those are
successfully cleaned. Invalidating cacheblocks for IOs that overlap 2
cache blocks is easy with a set associative hash, we need to search
for overlaps precisely in 2 cache sets.

Flashcache has support for block checksums, which are computed on
cache population and validated on every cache read. Block checksums is
a compile switch, turned off by default because of the "Torn Page"
problem. If a cache write fails after part of the block was committed
to flash, the block checksum will be wrong and any subsequent attempt
to read that block will fail (because of checksum mismatches).

How much cache metadata overhead do we incur ? For each cache block,
we have in-memory state of 24 bytes (on 64 bit architectures) and 16
bytes of on-flash metadata state. For a 300GB cache with 16KB blocks,
we have approximately 20 Million cacheblocks, resulting in an
in-memory metadata footprint of 480MB. If we were to configure a 300GB
cache with 4KB pages, that would quadruple to 1.8GB.

It is possible to mark IOs issued by particular pids as noncacheable
via flashcache ioctls. If a process is about to scan a large table
sequentially (for a backup say), it can mark itself as non-cacheable.
For a read issued by a "non cacheable" process, if the read results
in a cache hit, the data is served from cache. If the read results in
a cache miss, the read is served directly from disk (without a cache
population). For a write issued by a non cacheable process, the
write is sent directly to disk. But before that is done, we invalidate
any overlapping cache blocks (cleaning them first if necessary).

A few things to note about tagging pids non-cacheable. First, this
only really works reliably with Direct IO. For buffered IO, writes
will almost always happen from kernel threads (eg pdflush). So writes
will continue to be cached. For most filesystems, these ioctls will
make buffered reads uncached - readaheads will be kicked off the filemap
code, so the readaheads will be kicked off from the same context as the
reads.

If a process that marked itself non-cacheable dies, flashcache has
no way of cleaning up (the Linux kernel doesn't have a at_exit() hook).
Applications have to work around this (see configuration below). The
cleanup issue can be fixed by making the cache control aspect of
flashcache a pseudo-filesystem so that the last close of the fd on
process exit cleans things up (see Futures for more details).

In spite of the limitations, we think the ability to mark Direct IOs
issued by a pid will be valuable to prevent backups from wiping out
the cache.

Alternatively, rather than specifically marking pids as non-cacheable,
users may wish to experiment with the sysctl 'skip_seq_thresh_kb' which
disables caching of IO determined to be sequential, above a configurable
threshold of consecutive reads or writes.  The algorithm to spot 
sequential IO has some ability to handle multiple 'flows' of IO, so
it should, for example, be able to skip caching of IOs of two
flows of sequential reads or writes, but only cache IOs from a third
random IO flow.  Note that multiple small files may be written to 
consecutive blocks.  If these are written out in a batch (e.g. by
an untar), this may appear as a single sequential write, hence these 
multiple small files will not be cached.  The categorization of IO as 
sequential or random occurs purely at the block level, not the file level.

(For a more detailed discussion about caching controls, see the SA Guide).

Futures and Features :
====================
Cache Mirroring :
---------------
Mirroring the cache across 2 physical flash devices should work
without any code changes. Since the cache device is a block device, we
can build a RAID-1 block device out of the 2 physical flash devices
and use that as our cache device. (I have not yet tested this).

Cache Resizing :
--------------
The easiest way to resize the cache is to bring the cache offline, and
then resize. Resizing the cache when active is complicated and bug
prone.

Integration with ATA TRIM Command :
---------------------------------
The ATA TRIM command was introduced as a way for the filesystem to
inform the ssd that certain blocks were no longer in use to faciliate
improved wear levelling algorithms in the ssd controller. Flashcache
can leverage this as well. We can simply discard all blocks falling
within a TRIM block range from the cache regardless of the state,
since they are no longer needed.

Deeper integration with filesystems :
-----------------------------------
Non-cacheability could be much better implemented with a deeper
integration of flashcache and the filesystem. The filesystem could
easily tag IOs as non-cacheable, based on user actions.

Fixing the "Torn Page Problem" (make Cache Writes atomic) :
---------------------------------------------------------
As mentioned above, cache block writes are non-atomic. If we have a
power failure or if the flash write fails, part of the block (a few
sectors) could be written out, corrupting the block. In this respect,
flashcache behaves no different from disk.

We have ideas on how to fix this and achieve atomic cache block writes
using shadow paging techniques. Mark C. says that we could avoid
doublebuffer writes if we have atomic cache block writes. However,
with flashcache, doublebuffer writes will all be absorbed by the flash
(and we should get excellent write hits/overwrites for doublebuffer
blocks so they would never hit disk). So it is not clear how much of a
win atomic cache block writes will be.

It is however a handy feature to provide. If we have atomic block
writes we could also enable cache block checksums.

There are broadly 3 ways to fix this.
1) If the flash device offers configurable sector sizes, configure it
to match the cache block size (FusionIO offers upto a 4KB configurable
sector size).
2) If we choose a shadow page that falls in the same metadata sector
as the page being overwritten, we can do the shadow page write and
switch the metadata atomically.
3) If we don't want the restriction that the shadow page and the page
overwritten are part of the same metadata sector, to allow us to pick
a shadow page more freely across the cache set, we would need to
introduce a monotonically increasing timestamp per write in the cache
metadata that will allow us to disambiguate dirty blocks in the event
of a crash.

Breaking up the cache spinlock :
------------------------------
All cache state is protected by a single spinlock. Currently CPU
utilization in the cache routines is very low, and there is no
contention on this spinlock. That may change in the future.

Make non-cacheability more robust :
---------------------------------
The non-cacheability aspect need fixing in terms of cleanup when a
process dies. Probably the best way to approach this is to approach
this in a pseudo filesystemish way.

Several other implementation TODOs/Futures are documented in the code.

Testing and Testability :
=======================
Stress Tester :
-------------
I modified NetApps open source sio load generator, adding support for
data verification to it with block checksums maintained in an
mmap'ed file. I've been stress testing the cache with this tool. We
can vary the read/write mix, seq/rand IO mix, block size, direct IO vs
buffered IO, number of IO threads etc with this tool.

In addition, I've used other workloads to stress test flashcache.

Error Injection :
---------------
I've added hooks for injecting all kinds of errors into the flashcache
code (flash IO errors, disk IO errors, various kernel memory
allocation errors). The error injection can be controlled by a sysctl
"error_inject". Writing the following flags into "error_inject" causes
the next event of that type to result in an error. The flag is cleared
after the error is simulated. So we'd need to set the flag for each
error we'd like to simulate.

/* Error injection flags */
#define READDISK_ERROR				0x00000001
#define READCACHE_ERROR				0x00000002
#define READFILL_ERROR				0x00000004
#define WRITECACHE_ERROR			0x00000008
#define WRITECACHE_MD_ERROR			0x00000010
#define WRITEDISK_MD_ERROR			0x00000020
#define KCOPYD_CALLBACK_ERROR			0x00000040
#define DIRTY_WRITEBACK_JOB_ALLOC_FAIL		0x00000080
#define READ_MISS_JOB_ALLOC_FAIL		0x00000100
#define READ_HIT_JOB_ALLOC_FAIL			0x00000200
#define READ_HIT_PENDING_JOB_ALLOC_FAIL		0x00000400
#define INVAL_PENDING_JOB_ALLOC_FAIL		0x00000800
#define WRITE_HIT_JOB_ALLOC_FAIL		0x00001000
#define WRITE_HIT_PENDING_JOB_ALLOC_FAIL	0x00002000
#define WRITE_MISS_JOB_ALLOC_FAIL		0x00004000
#define WRITES_LIST_ALLOC_FAIL			0x00008000
#define MD_ALLOC_SECTOR_ERROR			0x00010000

I then use a script like this to simulate errors under heavy IO load.

#!/bin/bash

for ((debug = 0x00000001 ; debug<=0x00010000 ; debug=debug*2))
do
        echo $debug >/proc/sys/dev/flashcache/error_inject
        sleep 1
done

Acknowledgements :
================
I would like to thank Bob English for doing a critical review of the
design and the code of flashcache, for discussing this in detail with
me and providing valuable suggestions.

The option to detect and skip sequential IO was added by Will Smith.