forked from stec-inc/EnhanceIO
-
Notifications
You must be signed in to change notification settings - Fork 31
/
Copy pathREADME.txt
225 lines (172 loc) · 9.67 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
STEC EnhanceIO SSD Caching Software
25th December, 2012
1. WHAT IS ENHANCEIO?
EnhanceIO driver is based on EnhanceIO SSD caching software product
developed by STEC Inc. EnhanceIO was derived from Facebook's open source
Flashcache project. EnhanceIO uses SSDs as cache devices for
traditional rotating hard disk drives (referred to as source volumes
throughout this document).
EnhanceIO can work with any block device, be it an entire physical
disk, an individual disk partition, a RAIDed DAS device, a SAN volume,
a device mapper volume or a software RAID (md) device.
The source volume to SSD mapping is a set-associative mapping based on
the source volume sector number with a default set size
(aka associativity) of 512 blocks and a default block size of 4 KB.
Partial cache blocks are not used. The default value of 4 KB is chosen
because it is the common I/O block size of most storage systems. With
these default values, each cache set is 2 MB (512 * 4 KB). Therefore,
a 400 GB SSD will have a little less than 200,000 cache sets because a
little space is used for storing the meta data on the SSD.
EnhanceIO supports three caching modes: read-only, write-through, and
write-back and three cache replacement policies: random, FIFO, and LRU.
Read-only caching mode causes EnhanceIO to direct write IO requests only
to HDD. Read IO requests are issued to HDD and the data read from HDD is
stored on SSD. Subsequent Read requests for the same blocks are carried
out from SSD, thus reducing their latency by a substantial amount.
In Write-through mode - reads are handled similar to Read-only mode.
Write-through mode causes EnhanceIO to write application data to both
HDD and SSD. Subsequent reads of the same data benefit because they can
be served from SSD.
Write-back improves write latency by writing application requested data
only to SSD. This data, referred to as dirty data, is copied later to
HDD asynchronously. Reads are handled similar to Read-only and
Write-through modes.
2. WHAT HAS ENHANCEIO CHANGED TO FLASHCACHE?
2.1. A new write-back engine
The write-back engine in EnhanceiO has been designed from scratch.
Several optimizations have been done. IO completion guarantees have
been improved. We have defined limits to let a user control the amount
of dirty data in a cache. Clean-up of dirty data is stopped by default
under a high load; this can be overridden if required. A user can
control the extent to which a single cache set can be filled with dirty
data. A background thread cleans-up dirty data at regular intervals.
Clean-up is also done at regular intevals by identifying cache sets
which have been written least recently.
2.2. Transparent cache
EnhanceIO does not use device mapper. This enables creation and
deletion of caches while a source volume is being used. It's possible
to either create or delete cache while a partition is mounted.
EnhanceIO also supports creation of a cache for a device which contains
partitions. With this feature it's possible to create a cache without
worrying about having to create several SSD partitions and many
separate caches.
2.3. Large I/O Support
Unlike Flashcache, EnhanceIO does not cause source volume I/O requests
to be split into cache block size pieces. For the typical SSD cache
block size of 4 KB, this means that a write I/O request size of, say,
64 KB to the source volume is not split into 16 individual requests of
4 KB each. This is a performance improvement over Flashcache. IO
codepaths have been substantially modified for this improvement.
2.4. Small Memory Footprint
Through a special compression algorithm, the meta data RAM usage has
been reduced to only 4 bytes for each SSD cache block (versus 16 bytes
in Flashcache). Since the most typical SSD cache block size is 4 KB,
this means that RAM usage is 0.1% (1/1000) of SSD capacity.
For example, for a 400 GB SSD, EnhanceIO will need only 400 MB to keep
all meta data in RAM.
For an SSD cache block size of 8 KB, RAM usage is 0.05% (1/2000) of SSD
capacity.
The compression algorithm needs at least 32,768 cache sets
(i.e., 16 bits to encode the set number). If the SSD capacity is small
and there are not at least 32,768 cache sets, EnhanceIO uses 8 bytes of
RAM for each SSD cache block. In this case, RAM usage is 0.2% (2/1000)
of SSD capacity for a cache block size of 4K.
2.5. Loadable Replacement Policies
Since the SSD cache size is typically 10%-20% of the source volume
size, the set-associative nature of EnhanceIO necessitates cache
block replacement.
The main EnhanceIO kernel module that implements the caching engine
uses a random (actually, almost like round-robin) replacement policy
that does not require any additional RAM and has the least CPU
overhead. However, there are two additional kernel modules that
implement FIFO and LRU replacement policies. FIFO is the default cache
replacement policy because it uses less RAM than LRU. The FIFO and LRU
kernel modules are independent of each other and do not have to be
loaded if they are not needed.
Since the replacement policy modules do not consume much RAM when not
used, both modules are typically loaded after the main caching engine
is loaded. RAM is used only after a cache has been instantiated to use
either the FIFO or the LRU replacement policy.
Please note that the RAM used for replacement policies is in addition
to the RAM used for meta data (mentioned in Section 2.1). The table
below shows how much RAM each cache replacement policy uses:
POLICY RAM USAGE
------ ---------
Random 0
FIFO 4 bytes per cache set
LRU 4 bytes per cache set + 4 bytes per cache block
2.6. Optimal Alignment of Data Blocks on SSD
EnhanceIO writes all meta data and data blocks on 4K-aligned blocks
on the SSD. This minimizes write amplification and flash wear.
It also improves performance.
2.7. Improved device failure handling
Failure of an SSD device in read-only and write-through modes is
handled gracefully by allowing I/O to continue to/from the
source volume. An application may notice a drop in performance but it
will not receive any I/O errors.
Failure of an SSD device in write-back mode obviously results in the
loss of dirty blocks in the cache. To guard against this data loss, two
SSD devices can be mirrored via RAID 1.
EnhanceIO identifies device failures based on error codes. Depending on
whether the failure is likely to be intermittent or permanent, it takes
the best suited action.
2.8. Coding optimizations
Several coding optizations have been done to reduce CPU usage. These
include removing queues which are not required for write-through and
read-only cache modes, splitting of a single large spinlock, and more.
Most of the code paths in flashcache have been substantially
restructured.
2.9 Sequential I/O bypass
EnhanceIO has removed the bypass of sequential IO available in flashcache.
The sequential detection logic has a limited use case, espescially in a
reasonably multithreaded scenario.
3. EnhanceIO usage
3.1. Cache creation, deletion and editing properties
eio_cli utility is used for creating and deleting caches and editing
their properties. Manpage for this utility eio_cli(8) provides more
information.
3.2. Making a cache configuration persistent
It's essential that a cache be resumed before any applications or a
filesystem use the source volume during a bootup. If a cache is enabled
after a source volume is written to, stale data may be present in the
cache. It may cause data corruption. The document Persistent.txt
describes how to enable a cache during bootup using udev scripts.
In case an SSD does not come up during a bootup, it's ok to allow read
and write access to HDD only in the case of a Write-through or a
read-only cache. A cache should be created again when SSD becomes
available. If a previous cache configuration is resumed, it may cause
stale data to be read.
3.3. Using a Write-back cache
It's absolutely necessary to make a Write-back cache configuration
persistent. This is required particularly in the case of an OS crash or
a power failure. A Write-back cache may contain dirty blocks which
haven't been written to HDD yet. Reading the source volume without
enabling the cache will cause incorrect data to be read.
In case an SSD does not come up during a bootup, access to HDD should
stopped. It should be enabled only after SSD comes-up and a cache is
enabled.
Write-back cache needs to perform clean operation in order to flush the
dirty data to the source device(HDD). The clean can be either trigerred
by the user or automatically initiated, based on preconfigured
thresholds. These thresholds are described below. They can be set using
sysctl calls.
a) Dirty high threshold (%) : The upper limit on percentage of dirty
blocks in the entire cache.
b) Dirty low threshold (%) : The lower limit on percentage of dirty
blocks in the entire cache.
c) Dirty set high threshold (%) : The upper limit on percentage of dirty
blocks in a set.
d) Dirty set low threshold (%) : The lower limit on percentage of dirty
blocks in a set.
e) Automatic clean-up threshold : An automatic clean-up of the cache
will occur only if the number of outstanding I/O requests from the
HDD is below the threshold.
f) Time based clean-up interval (minutes) : This option allows you to
specify an interval between each clean-up process.
Clean is trigerred when one of the upper thresholds or time based clean
threshold is met and stops when all the lower thresholds are met.
4. ACKNOWLEDGEMENTS
STEC acknowledges Facebook and in particular Mohan Srinivasan
for the design, development, and release of Flashcache as an
open source project.
Flashcache, in turn, is based on DM-Cache by Ming Zhao.