FlashCache System Administration Guide -------------------------------------- Introduction : ============ Flashcache is a block cache for Linux, built as a kernel module, using the Device Mapper. Flashcache supports writeback, writethrough and writearound caching modes. This document is a quick administration guide to flashcache. Requirements : ============ Flashcache has been tested on variety of kernels between 2.6.18 and 2.6.38. If you'd like to build and use it on a newer kernel, please send me an email and I can help. I will not support older than 2.6.18 kernels. Choice of Caching Modes : ========================= Writethrough - safest, all writes are cached to ssd but also written to disk immediately. If your ssd has slower write performance than your disk (likely for early generation SSDs purchased in 2008-2010), this may limit your system write performance. All disk reads are cached (tunable). Writearound - again, very safe, writes are not written to ssd but directly to disk. Disk blocks will only be cached after they are read. All disk reads are cached (tunable). Writeback - fastest but less safe. Writes only go to the ssd initially, and based on various policies are written to disk later. All disk reads are cached (tunable). Writeonly - variant of writeback caching. In this mode, only incoming writes are cached. No reads are ever cached. Cache Persistence : ================= Writethrough and Writearound caches are not persistent across a device removal or a reboot. Only Writeback caches are persistent across device removals and reboots. This reinforces 'writeback is fastest', 'writethrough is safest'. Known Bugs : ============ See https://github.com/facebook/flashcache/issues and report new issues there please. Data corruption has been reported when using a loopback device for the cache device. See also the 'Futures and Features' section of the design document, flashcache-doc.txt. Cache creation and loading using the flashcache utilities : ========================================================= Included are 3 utilities - flashcache_create, flashcache_load and flashcache_destroy. These utilities use dmsetup internally, presenting a simpler interface to create, load and destroy flashcache volumes. It is expected that the majority of users can use these utilities instead of using dmsetup. flashcache_create : Create a new flashcache volume. flashcache_create [-v] -p back|around|thru [-s cache size] [-w] [-b block size] cachedevname ssd_devname disk_devname -v : verbose. -p : cache mode (writeback/writethrough/writearound). -s : cache size. Optional. If this is not specified, the entire ssd device is used as cache. The default units is sectors. But you can specify k/m/g as units as well. -b : block size. Optional. Defaults to 4KB. Must be a power of 2. The default units is sectors. But you can specify k as units as well. (A 4KB blocksize is the correct choice for the vast majority of applications. But see the section "Cache Blocksize selection" below). -f : force create. by pass checks (eg for ssd sectorsize). -w : write cache mode. Only writes are cached, not reads -d : disk associativity, within each cache set, we store several contigous disk extents. Defaults to off. Examples : flashcache_create -p back -s 1g -b 4k cachedev /dev/sdc /dev/sdb Creates a 1GB writeback cache volume with a 4KB block size on ssd device /dev/sdc to cache the disk volume /dev/sdb. The name of the device created is "cachedev". flashcache_create -p thru -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb Same as above but creates a write through cache with units specified in sectors instead. The name of the device created is "cachedev". flashcache_load : Load an existing writeback cache volume. flashcache_load ssd_devname [cachedev_name] Example : flashcache_load /dev/sd Load the existing writeback cache on /dev/sdc, using the virtual cachedev_name from when the device was created. If you're upgrading from an older flashcache device format that didn't store the cachedev name internally, or you want to change the cachedev name use, you can specify it as an optional second argument to flashcache_load. For writethrough and writearound caches flashcache_load is not needed; flashcache_create should be used each time. flashcache_destroy : Destroy an existing writeback flashcache. All data will be lost !!! flashcache_destroy ssd_devname Example : flashcache_destroy /dev/sdc Destroy the existing cache on /dev/sdc. All data is lost !!! For writethrough and writearound caches this is not necessary. Removing a flashcache volume : ============================ Use dmsetup remove to remove a flashcache volume. For writeback cache mode, the default behavior on a remove is to clean all dirty cache blocks to disk. The remove will not return until all blocks are cleaned. Progress on disk cleaning is reported on the console (also see the "fast_remove" flashcache sysctl). A reboot of the node will also result in all dirty cache blocks being cleaned synchronously (again see the note about "fast_remove" in the sysctls section). For writethrough and writearound caches, the device removal or reboot results in the cache being destroyed. However, there is no harm is doing a 'dmsetup remove' to tidy up before boot, and indeed this will be needed if you ever need to unload the flashcache kernel module (for example to load an new version into a running system). Example: dmsetup remove cachedev This removes the flashcache volume name cachedev. Cleaning all blocks prior to removal. Cache Stats : =========== Use 'dmsetup status' for cache statistics. 'dmsetup table' also dumps a number of cache related statistics. Examples : dmsetup status cachedev dmsetup table cachedev Flashcache errors are reported in /proc/flashcache//flashcache_errors Flashcache stats are also reported in /proc/flashcache//flashcache_stats for easier parseability. Using Flashcache sysVinit script (Redhat based systems): ======================================================= Kindly note that, this sections only applies to the Redhat based systems. Use 'utils/flashcache' from the repository as the sysvinit script. This script is to load, unload and get statistics of an existing flashcache writeback cache volume. It helps in loading the already created cachedev during system boot and removes the flashcache volume before system halt happens. This script is necessary, because, when a flashcache volume is not removed before the system halt, kernel panic occurs. Configuring the script using chkconfig: 1. Copy 'utils/flashcache' from the repo to '/etc/init.d/flashcache' 2. Make sure this file has execute permissions, 'sudo chmod +x /etc/init.d/flashcache'. 3. Edit this file and specify the values for the following variables SSD_DISK, BACKEND_DISK, CACHEDEV_NAME, MOUNTPOINT, FLASHCACHE_NAME 4. Modify the headers in the file if necessary. By default, it starts in runlevel 3, with start-stop priority 90-10 5. Register this file using chkconfig 'chkconfig --add /etc/init.d/flashcache' Cache Blocksize selection : ========================= Cache blocksize selection is critical for good cache utilization and performance. A 4KB cache blocksize for the vast majority of workloads (and filesystems). Cache Metadata Blocksize selection : ================================== This section only applies to the writeback cache mode. Writethrough and writearound modes store no cache metadata at all. In Flashcache version 1, the metadata blocksize was fixed at 1 (512b) sector. Flashcache version 2 removes this limitation. In version 2, we can configure a larger flashcache metadata blocksize. Version 2 maintains backwards compatibility for caches created with Version 1. For these cases, a metadata blocksize of 512 will continue to be used. flashcache_create -m can be used to optionally configure the metadata blocksize. Defaults to 4KB. Ideal choices for the metadata blocksize are 4KB (default) or 8KB. There is little benefit to choosing a metadata blocksize greater than 8KB. The choice of metadata blocksize is subject to the following rules : 1) Metadata blocksize must be a power of 2. 2) Metadata blocksize cannot be smaller than sector size configured on the ssd device. 3) A single metadata block cannot contain metadata for 2 cache sets. In other words, with the default associativity of 512 (with each cache metadata slot sizing at 16 bytes), the entire metadata for a given set fits in 8KB (512*16b). For an associativity of 512, we cannot configure a metadata blocksize greater than 8KB. Advantages of choosing a larger (than 512b) metadata blocksize : - Allows the ssd to be configured to larger sectors. For example, some ssds allow choosing a 4KB sector, often a more performant choice. - Allows flashache to do better batching of metadata updates, potentially reducing metadata updates, small ssd writes, reducing write amplification and higher ssd lifetimes. Thanks due to Earle Philhower of Virident for this feature ! FlashCache Sysctls : ================== Flashcache sysctls operate on a per-cache device basis. A couple of examples first. Sysctls for a writearound or writethrough mode cache : cache device /dev/ram3, disk device /dev/ram4 dev.flashcache.ram3+ram4.cache_all = 1 dev.flashcache.ram3+ram4.zero_stats = 0 dev.flashcache.ram3+ram4.reclaim_policy = 0 dev.flashcache.ram3+ram4.pid_expiry_secs = 60 dev.flashcache.ram3+ram4.max_pids = 100 dev.flashcache.ram3+ram4.do_pid_expiry = 0 dev.flashcache.ram3+ram4.io_latency_hist = 0 dev.flashcache.ram3+ram4.skip_seq_thresh_kb = 0 Sysctls for a writeback mode cache : cache device /dev/sdb, disk device /dev/cciss/c0d2 dev.flashcache.sdb+c0d2.fallow_delay = 900 dev.flashcache.sdb+c0d2.fallow_clean_speed = 2 dev.flashcache.sdb+c0d2.cache_all = 1 dev.flashcache.sdb+c0d2.fast_remove = 0 dev.flashcache.sdb+c0d2.zero_stats = 0 dev.flashcache.sdb+c0d2.reclaim_policy = 0 dev.flashcache.sdb+c0d2.pid_expiry_secs = 60 dev.flashcache.sdb+c0d2.max_pids = 100 dev.flashcache.sdb+c0d2.do_pid_expiry = 0 dev.flashcache.sdb+c0d2.max_clean_ios_set = 2 dev.flashcache.sdb+c0d2.max_clean_ios_total = 4 dev.flashcache.sdb+c0d2.dirty_thresh_pct = 20 dev.flashcache.sdb+c0d2.stop_sync = 0 dev.flashcache.sdb+c0d2.do_sync = 0 dev.flashcache.sdb+c0d2.io_latency_hist = 0 dev.flashcache.sdb+c0d2.skip_seq_thresh_kb = 0 Sysctls common to all cache modes : dev.flashcache..cache_all: Global caching mode to cache everything or cache nothing. See section on Caching Controls. Defaults to "cache everything". dev.flashcache..zero_stats: Zero stats (once). dev.flashcache..reclaim_policy: FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at runtime. dev.flashcache..io_latency_hist: Compute IO latencies and plot these out on a histogram. The scale is 250 usecs. This is disabled by default since internally flashcache uses gettimeofday() to compute latency and this can get expensive depending on the clocksource used. Setting this to 1 enables computation of IO latencies. The IO latency histogram is appended to 'dmsetup status'. (There is little reason to tune these) dev.flashcache..max_pids: Maximum number of pids in the white/black lists. dev.flashcache..do_pid_expiry: Enable expiry on the list of pids in the white/black lists. dev.flashcache..pid_expiry_secs: Set the expiry on the pid white/black lists. dev.flashcache..skip_seq_thresh_kb: Skip (don't cache) sequential IO larger than this number (in kb). 0 (default) means cache all IO, both sequential and random. Sequential IO can only be determined 'after the fact', so this much of each sequential I/O will be cached before we skip the rest. Does not affect searching for IO in an existing cache. Sysctls for writeback mode only : dev.flashcache..fallow_delay = 900 In seconds. Clean dirty blocks that have been "idle" (not read or written) for fallow_delay seconds. Default is 15 minutes. Setting this to 0 disables idle cleaning completely. dev.flashcache..fallow_clean_speed = 2 The maximum number of "fallow clean" disk writes per set per second. Defaults to 2. dev.flashcache..fast_remove = 0 Don't sync dirty blocks when removing cache. On a reload both DIRTY and CLEAN blocks persist in the cache. This option can be used to do a quick cache remove. CAUTION: The cache still has uncommitted (to disk) dirty blocks after a fast_remove. dev.flashcache..dirty_thresh_pct = 20 Flashcache will attempt to keep the dirty blocks in each set under this %. A lower dirty threshold increases disk writes, and reduces block overwrites, but increases the blocks available for read caching. dev.flashcache..stop_sync = 0 Stop the sync in progress. dev.flashcache..do_sync = 0 Schedule cleaning of all dirty blocks in the cache. (There is little reason to tune these) dev.flashcache..max_clean_ios_set = 2 Maximum writes that can be issues per set when cleaning blocks. dev.flashcache..max_clean_ios_total = 4 Maximum writes that can be issued when syncing all blocks. Using dmsetup to create and load flashcache volumes : =================================================== Few users will need to use dmsetup natively to create and load flashcache volumes. This section covers that. dmsetup create device_name table_file where device_name: name of the flashcache device being created or loaded. table_file : other cache args (format below). If this is omitted, dmsetup attempts to read this from stdin. table_file format : 0 flashcache [size of cache in sectors] [cache set size] cache mode: 1: Write Back 2: Write Through 3: Write Around flashcache cmd: 1: load existing cache 2: create cache 3: force create cache (overwriting existing cache). USE WITH CAUTION blksize in sectors: 4KB (8 sectors, PAGE_SIZE) is the right choice for most applications. See note on block size selection below. Unused (can be omitted) for cache loads. size of cache in sectors: Optional. if size is not specified, the entire ssd device is used as cache. Needs to be a power of 2. Unused (can be omitted) for cache loads. cache set size: Optional. The default set size is 512, which works well for most applications. Little reason to change this. Needs to be a power of 2. Unused (can be omitted) for cache loads. Example : echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 cachedev 1 2 8 522000000 | dmsetup create cachedev This creates a writeback cache device called "cachedev" (/dev/mapper/cachedev) with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2. The size of the cache is 522000000 sectors. (TODO : Change loading of the cache happen via "dmsetup load" instead of "dmsetup create"). Caching Controls ================ Flashcache can be put in one of 2 modes - Cache Everything or Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache everything". These 2 modes have a blacklist and a whitelist. The tgid (thread group id) for a group of pthreads can be used as a shorthand to tag all threads in an application. The tgid for a pthread is returned by getpid() and the pid of the individual thread is returned by gettid(). The algorithm works as follows : In "cache everything" mode, 1) If the pid of the process issuing the IO is in the blacklist, do not cache the IO. ELSE, 2) If the tgid is in the blacklist, don't cache this IO. UNLESS 3) The particular pid is marked as an exception (and entered in the whitelist, which makes the IO cacheable). 4) Finally, even if IO is cacheable up to this point, skip sequential IO if configured by the sysctl. Conversely, in "cache nothing" mode, 1) If the pid of the process issuing the IO is in the whitelist, cache the IO. ELSE, 2) If the tgid is in the whitelist, cache this IO. UNLESS 3) The particular pid is marked as an exception (and entered in the blacklist, which makes the IO non-cacheable). 4) Anything whitelisted is cached, regardless of sequential or random IO. Examples : -------- 1) You can make the global cache setting "cache nothing", and add the tgid of your pthreaded application to the whitelist. Which makes only IOs issued by your application cacheable by Flashcache. 2) You can make the global cache setting "cache everything" and add tgids (or pids) of other applications that may issue IOs on this volume to the blacklist, which will make those un-interesting IOs not cacheable. Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush, kswapd would also do the writes, with flashcache caching those. The following cacheability ioctls are supported on /dev/mapper/ FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist. FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist. FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to cleanup if a process dies. FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist. FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist. FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to cleanup if a process dies. /proc/flashcache_pidlists shows the list of pids on the whitelist and the blacklist. Security Note : ============= With Flashcache, it is possible for a malicious user process to corrupt data in files with only read access. In a future revision of flashcache, this will be addressed (with an extra data copy). Not documenting the mechanics of how a malicious process could corrupt data here. You can work around this by setting file permissions on files in the flashcache volume appropriately. Why is my cache only (<< 100%) utilized ? ======================================= (Answer contributed by Will Smith) - There is essentially a 1:many mapping between SSD blocks and HDD blocks. - In more detail, a HDD block gets hashed to a set on SSD which contains by default 512 blocks. It can only be stored in that set on SSD, nowhere else. So with a simplified SSD containing only 3 sets: SSD = 1 2 3 , and a HDD with 9 sets worth of data, the HDD sets would map to the SSD sets like this: HDD: 1 2 3 4 5 6 7 8 9 SSD: 1 2 3 1 2 3 1 2 3 So if your data only happens to live in HDD sets 1 and 4, they will compete for SSD set 1 and your SSD will at most become 33% utilized. If you use XFS you can tune the XFS agsize/agcount to try and mitigate this (described next section). Tuning XFS for better flashcache performance : ============================================ If you run XFS/Flashcache, it is worth tuning XFS' allocation group parameters (agsize/agcount) to achieve better flashcache performance. XFS allocates blocks for files in a given directory in a new allocation group. By tuning agsize and agcount (mkfs.xfs parameters), we can achieve much better distribution of blocks across flashcache. Better distribution of blocks across flashcache will decrease collisions on flashcache sets considerably, increase cache hit rates significantly and result in lower IO latencies. We can achieve this by computing agsize (and implicitly agcount) using these equations, C = Cache size, V = Size of filesystem Volume. agsize % C = (1/agcount)*C agsize * agcount ~= V where agsize <= 1000g (XFS limits on agsize). A couple of examples that illustrate the formula, For agcount = 4, let's divide up the cache into 4 equal parts (each part is size C/agcount). Let's call the parts C1, C2, C3, C4. One ideal way to map the allocation groups onto the cache is as follows. Ag1 Ag2 Ag3 Ag4 -- -- -- -- C1 C2 C3 C4 (stripe 1) C2 C3 C4 C1 (stripe 2) C3 C4 C1 C2 (stripe 3) C4 C1 C2 C3 (stripe 4) C1 C2 C3 C4 (stripe 5) In this simple example, note that each "stripe" has 2 properties 1) Each element of the stripe is a unique part of the cache. 2) The union of all the parts for a stripe gives us the entire cache. Clearly, this is an ideal mapping, from a distribution across the cache point of view. Another example, this time with agcount = 5, the cache is divided into 5 equal parts C1, .. C5. Ag1 Ag2 Ag3 Ag4 Ag5 -- -- -- -- -- C1 C2 C3 C4 C5 (stripe 1) C2 C3 C4 C5 C1 (stripe 2) C3 C4 C5 C1 C2 (stripe 3) C4 C5 C1 C2 C3 (stripe 4) C5 C1 C2 C3 C4 (stripe 5) C1 C2 C3 C4 C5 (stripe 6) A couple of examples that compute the optimal agsize for a given Cachesize and Filesystem volume size. a) C = 600g, V = 3,5TB Consider agcount = 5 agsize % 600 = (1/5)*600 agsize % 600 = 120 So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB) b) C = 150g, V = 3.5TB Consider agcount=4 agsize % 150 = (1/4)*150 agsize % 150 = 37.5 So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB) As an alternative, agsize % C = (1 - (1/agcount))*C agsize * agcount ~= V Works just as well as the formula above. This computation has been implemented in the utils/get_agsize utility. Tuning Sequential IO Skipping for better flashcache performance =============================================================== Skipping sequential IO makes sense in two cases: 1) your sequential write speed of your SSD is slower than the sequential write speed or read speed of your disk. In particular, for implementations with RAID disks (especially modes 0, 10 or 5) sequential reads may be very fast. If 'cache_all' mode is used, every disk read miss must also be written to SSD. If you notice slower sequential reads and writes after enabling flashcache, this is likely your problem. 2) Your 'resident set' of disk blocks that you want cached, i.e. those that you would hope to keep in cache, is smaller than the size of your SSD. You can check this by monitoring how quick your cache fills up ('dmsetup table'). If this is the case, it makes sense to prioritize caching of random IO, since SSD performance vastly exceeds disk performance for random IO, but is typically not much better for sequential IO. In the above cases, start with a high value (say 1024k) for sysctl dev.flashcache..skip_seq_thresh_kb, so only the largest sequential IOs are skipped, and gradually reduce if benchmarks show it's helping. Don't leave it set to a very high value, return it to 0 (the default), since there is some overhead in categorizing IO as random or sequential. If neither of the above hold, continue to cache all IO, (the default) you will likely benefit from it. Further Information =================== Git repository : https://github.com/facebook/flashcache/ Developer mailing list : http://groups.google.com/group/flashcache-dev/