Skip to content

Commit

Permalink
btrfs-progs: docs: add more chapters (part 3)
Browse files Browse the repository at this point in the history
All main pages have some content and many typos have been fixed.

Signed-off-by: David Sterba <[email protected]>
  • Loading branch information
kdave committed Dec 17, 2021
1 parent c6be848 commit 208aed2
Show file tree
Hide file tree
Showing 26 changed files with 561 additions and 417 deletions.
7 changes: 6 additions & 1 deletion Documentation/Balance.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
Balance
=======

...
.. include:: ch-balance-intro.rst

Filters
-------

.. include:: ch-balance-filters.rst
54 changes: 39 additions & 15 deletions Documentation/Common-features.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,44 @@
Common Linux features
=====================

Anything that's standard and also supported

- statx

- fallocate modes

- birth/origin inode time

- filesystem label

- xattr, acl

- FIEMAP

- O_TMPFILE
The Linux operating system implements a POSIX standard interfaces and API with
additional interfaces. Many of them have become common in other filesystems. The
ones listed below have been added relatively recently and are considered
interesting for users:

birth/origin inode time
a timestamp associated with an inode of when it was created, cannot be
changed and requires the *statx* syscall to be read

statx
an extended version of the *stat* syscall that provides extensible
interface to read more information that are not available in original
*stat*

fallocate modes
the *fallocate* syscall allows to manipulate file extents like punching
holes, preallocation or zeroing a range

FIEMAP
an ioctl that enumerates file extents, related tool is ``filefrag``

filesystem label
another filesystem identification, could be used for mount or for better
recognition, can be set or read by an ioctl or by command ``btrfs
filesystem label``

O_TMPFILE
mode of open() syscall that creates a file with no associated directory
entry, which makes it impossible to be seen by other processes and is
thus safe to be used as a temporary file
(https://lwn.net/Articles/619146/)

xattr, acl
extended attributes (xattr) is a list of *key=value* pairs associated
with a file, usually storing additional metadata related to security,
access control list in particular (ACL) or properties (``btrfs
property``)

- XFLAGS, fileattr

- cross-rename
21 changes: 13 additions & 8 deletions Documentation/Custom-ioctls.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
Custom ioctls
=============

Anything that's not doing the other features and stands on it's own
Filesystems are usually extended by custom ioctls beyond the standard system
call interface to let user applications access the advanced features. They're
low level and the following list gives only an overview of the capabilities or
a command if available:

- reverse lookup, from file offset to inode
- reverse lookup, from file offset to inode, ``btrfs inspect-internal
logical-resolve``

- resolve inode number -> name
- resolve inode number to list of name, ``btrfs inspect-internal inode-resolve``

- file offset -> all inodes that share it
- tree search, given a key range and tree id, lookup and return all b-tree items
found in that range, basically all metadata at your hand but you need to know
what to do with them

- tree search, all the metadata at your hand (if you know what to do with them)
- informative, about devices, space allocation or the whole filesystem, many of
which is also exported in ``/sys/fs/btrfs``

- informative (device, fs, space)

- query/set a subset of features on a mounted fs
- query/set a subset of features on a mounted filesystem
2 changes: 1 addition & 1 deletion Documentation/Defragmentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@ happens inside the page cache, that is the central point caching the file data
and takes care of synchronization. Once a filesystem sync or flush is started
(either manually or automatically) all the dirty data get written to the
devices. This however reduces the chances to find optimal layout as the writes
happen together with other data and the result depens on the remaining free
happen together with other data and the result depends on the remaining free
space layout and fragmentation.
2 changes: 1 addition & 1 deletion Documentation/Reflink.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ also copied, though there are no ready-made tools for that.
cp --reflink=always source target
There are some constaints:
There are some constraints:

- cross-filesystem reflink is not possible, there's nothing in common between
so the block sharing can't work
Expand Down
4 changes: 2 additions & 2 deletions Documentation/Resize.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ Resize

A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a
multi device filesystem the space occupied on each device can be resized
independently. Data tha reside in the are that would be out of the new size are
relocated to the remaining space below the limit, so this constrains the
independently. Data that reside in the area that would be out of the new size
are relocated to the remaining space below the limit, so this constrains the
minimum size to which a filesystem can be shrunk.

Growing a filesystem is quick as it only needs to take note of the available
Expand Down
2 changes: 1 addition & 1 deletion Documentation/Subvolumes.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Subvolumes
==========

...
.. include:: ch-subvolume-intro.rst
47 changes: 47 additions & 0 deletions Documentation/Tree-checker.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,53 @@
Tree checker
============

Metadata blocks that have been just read from devices or are just about to be
written are verified and sanity checked by so called **tree checker**. The
b-tree nodes contain several items describing the filesystem structure and to
some degree can be verified for consistency or validity. This is additional
check to the checksums that only verify the overall block status while the tree
checker tries to validate and cross reference the logical structure. This takes
a slight performance hit but is comparable to calculating the checksum and has
no noticeable impact while it does catch all sorts of errors.

There are two occasions when the checks are done:

Pre-write checks
----------------

When metadata blocks are in memory about to be written to the permanent storage,
the checks are performed, before the checksums are calculated. This can catch
random corruptions of the blocks (or pages) either caused by bugs or by other
parts of the system or hardware errors (namely faulty RAM).

Once a block does not pass the checks, the filesystem refuses to write more data
and turns itself to read-only mode to prevent further damage. At this point some
the recent metadata updates are held *only* in memory so it's best to not panic
and try to remember what files could be affected and copy them elsewhere. Once
the filesystem gets unmounted, the most recent changes are unfortunately lost.
The filesystem that is stored on the device is still consistent and should mount
fine.

Post-read checks
----------------

Metadata blocks get verified right after they're read from devices and the
checksum is found to be valid. This protects against changes to the metadata
that could possibly also update the checksum, less likely to happen accidentally
but rather due to intentional corruption or fuzzing.

The checks
----------

As implemented right now, the metadata consistency is limited to one b-tree node
and what items are stored there, ie. there's no extensive or broad check done
eg. against other data structures in other b-tree nodes. This still provides
enough opportunities to verify consistency of individual items, besides verifying
general validity of the items like the length or offset. The b-tree items are
also coupled with a key so proper key ordering is also part of the check and can
reveal random bitflips in the sequence (this has been the most successful
detector of faulty RAM).

The capabilities of tree checker have been improved over time and it's possible
that a filesystem created on an older kernel may trigger warnings or fail some
checks on a new one.
43 changes: 40 additions & 3 deletions Documentation/Trim.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,41 @@
Trim
====
Trim/discard
============

...
Trim or discard is an operation on a storage device based on flash technology
(SSD, NVMe or similar), a thin-provisioned device or could be emulated on top
of other block device types. On real hardware, there's a different lifetime
span of the memory cells and the driver firmware usually tries to optimize for
that. The trim operation issued by user provides hints about what data are
unused and allow to reclaim the memory cells. On thin-provisioned or emulated
this is could simply free the space.

There are three main uses of trim that BTRFS supports:

synchronous
enabled by mounting filesystem with ``-o discard`` or ``-o
discard=sync``, the trim is done right after the file extents get freed,
this however could have severe performance hit and is not recommended
as the ranges to be trimmed could be too fragmented

asynchronous
enabled by mounting filesystem with ``-o discard=async``, which is an
improved version of the synchronous trim where the freed file extents
are first tracked in memory and after a period or enough ranges accumulate
the trim is started, expecting the ranges to be much larger and
allowing to throttle the number of IO requests which does not interfere
with the rest of the filesystem activity

manually by fstrim
the tool ``fstrim`` starts a trim operation on the whole filesystem, no
mount options need to be specified, so it's up to the filesystem to
traverse the free space and start the trim, this is suitable for running
it as periodic service

The trim is considered only a hint to the device, it could ignore it completely,
start it only on ranges meeting some criteria, or decide not to do it because of
other factors affecting the memory cells. The device itself could internally
relocate the data, however this leads to unexpected performance drop. Running
trim periodically could prevent that too.

When a filesystem is created by ``mkfs.btrfs`` and is capable of trim, then it's
by default performed on all devices.
2 changes: 1 addition & 1 deletion Documentation/Volume-management.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Volume management
=================

...
.. include:: ch-volume-management-intro.rst
2 changes: 1 addition & 1 deletion Documentation/Zoned-mode.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Zoned mode
==========

...
.. include:: ch-zoned-intro.rst
Loading

0 comments on commit 208aed2

Please sign in to comment.