-
Notifications
You must be signed in to change notification settings - Fork 1
Index file format v5
Thomas Gummerer edited this page Jul 24, 2012
·
4 revisions
GIT index format
================
== The git index file format
The git index file (.git/index) documents the status of the files
in the git staging area.
The staging area is used for preparing commits, merging, etc.
All binary numbers are in network byte order. Version 5 is described
here.
- A 20-byte header consisting of
sig (32-bits): Signature:
The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
vnr (32-bits): Version number:
The current supported versions are 2, 3, 4 and 5.
ndir (32-bits): number of directories in the index.
nfile (32-bits): number of file entries in the index.
fblockoffset (32-bits): offset to the file block, relative to the
beginning of the file.
- Offset to the extensions.
nextensions (32-bits): number of extensions.
extoffset (32-bits): offset to the extension. (Possibly none, as
many as indicated in the 4-byte number of extensions)
headercrc (32-bits): crc checksum for the header and extension
offsets
- diroffsets (ndir * directory offsets): A directory offset for each
of the ndir directories in the index, sorted by pathname (of the
directory it's pointing to) (see below). The diroffsets are relative
to the beginning of the direntries block. [1]
- direntries (ndir * directory entries): A directory entry for each
of the ndir directories in the index, sorted by pathname (see
below). [2]
- fileoffsets (nfile * file offsets): A file offset for each of the
nfile files in the index (see below). The file offsets are relative
to the beginning of the fileentries block. [1]
- fileentries (nfile * file entries): A file entry for each of the
nfile files in the index (see below).
- crdata: A number of entries for conflicted data/resolved conflicts
(see below).
- Extensions (Currently none, see below in the future)
Extensions are identified by signature. Optional extensions can
be ignored if GIT does not understand them.
GIT supports an arbitrary number of extension, but currently none
is implemented. [3]
extsig (32-bits): extension signature. If the first byte is 'A'..'Z'
the extension is optional and can be ignored.
extsize (32-bits): size of the extension, excluding the header
(extsig, extsize, extchecksum).
extchecksum (32-bits): crc32 checksum of the extension signature
and size.
- Extension data.
== Directory offsets (diroffsets)
diroffset (32-bits): offset to the directory relative to the beginning
of the index file. There are ndir + 1 offsets in the diroffset table,
the last is pointing to the end of the last direntry. With this last
entry, we can replace the strlen when reading each filename, by
calculating its length with the offsets.
This part is needed for making the directory entries bisectable and
thus allowing a binary search.
== Directory entry (direntries)
Directory entries are sorted in lexicographic order by the name
of their path starting with the root.
pathname (variable length, nul terminated): relative to top level
directory (without the leading slash). '/' is used as path
separator. A string of length 0 ('') indicates the root directory.
The special path components ".", and ".." (without quotes) are
disallowed. The path also includes a trailing slash. [9]
foffset (32-bits): offset to the lexicographically first file in
the file offsets (fileoffsets), relative to the beginning of
the fileoffset block.
cr (32-bits): offset to conflicted/resolved data at the end of the
index. 0 if there is no such data. [4]
ncr (32-bits): number of conflicted/resolved data entries at the
end of the index if the offset is non 0. If cr is 0, ncr is
also 0.
nsubtrees (32-bits): number of subtrees this tree has in the index.
nfiles (32-bits): number of files in the directory, that are in
the index.
nentries (32-bits): number of entries in the index that is covered
by the tree this entry represents. (-1 if the entry is invalid).
This number includes all the files in this tree, recursively.
objname (160-bits): object name for the object that would result
from writing this span of index as a tree. This is only valid
if nentries is valid, meaning the cache-tree is valid.
flags (16-bits): 'flags' field split into (high to low bits) (For
D/F conflicts)
stage (2-bits): stage of the directory during merge
14-bit unused
dircrc (32-bits): crc32 checksum for each directory entry.
The last 24 bytes (4-byte number of entries + 160-bit object name) are
for the cache tree. An entry can be in an invalidated state which is
represented by having -1 in the entry_count field.
The entries are written out in the top-down, depth-first order. The
first entry represents the root level of the repository, followed by
the first subtree - let's call it A - of the root level, followed by
the first subtree of A, ... There is no prefix compression for
directories.
== File offsets (fileoffsets)
fileoffset (32-bits): offset to the file.
This part is needed for making the file entries bisectable and
thus allowing a binary search. There are nfile + 1 offsets in the
fileoffset table, the last is pointing to the end of the last
fileentry. With this last entry, we can replace the strlen when
reading each filename, by calculating its length with the offsets.
== File entry (fileentries)
File entries are sorted in ascending order on the name field, after the
respective offset given by the directory entries. All file names are
prefix compressed, meaning the file name is relative to the directory.
filename (variable length, nul terminated). The exact encoding is
undefined, but the filename cannot contain a NUL byte (iow, the same
encoding as a UNIX pathname).
flags (16-bits): 'flags' field split into (high to low bits)
assumevalid (1-bit): assume-valid flag
intenttoadd (1-bit): intent-to-add flag, used by "git add -N".
Extended flag in index v3.
stage (2-bit): stage of the file during merge
skipworktree (1-bit): skip-worktree flag, used by sparse checkout.
Extended flag in index v3.
11-bit unused, must be zero [6]
mode (16-bits): file mode, split into (high to low bits)
objtype (4-bits): object type
valid values in binary are 1000 (regular file), 1010 (symbolic
link) and 1110 (gitlink)
3-bit unused
permission (9-bits): unix permission. Only 0755 and 0644 are valid
for regular files. Symbolic links and gitlinks have value 0 in
this field.
mtimes (32-bits): mtime seconds, the last time a file's data changed
this is stat(2) data
mtimens (32-bits): mtime nanosecond fractions
this is stat(2) data
statcrc (32-bits): crc32 checksum over ctime seconds, ctime
nanoseconds, ino, file size, dev, uid, gid (All stat(2) data
except mtime) [7]
objhash (160-bits): SHA-1 for the represented object
# This will probably be changed in future versions as discussed here: http://colabti.org/irclogger/irclogger_log/git-devel?date=2012-06-21
entrycrc (32-bits): crc32 checksum for the file entry. The crc code
includes the offset to the file.
== Conflict data
A conflict is represented in the index as a set of higher stage entries.
These entries are stored at the end of the index. When a conflict is
resolved (e.g. with "git add path"). A bit is flipped, to indicate that
the conflict is resolved, but the entries will be kept, so that
conflicts can be recreated (e.g. with "git checkout -m", in case users
want to redo a conflict resolution from scratch.
The first part of a conflict (usually stage 1) will be stored both in
the entries part of the index and in the conflict part. All other parts
will only be stored in the conflict part.
filename (variable length, nul terminated): filename of the entry,
relative to its containing directory).
nfileconflicts (32-bits): number of conflicts for the file [8]
flags (nfileconflicts * flags) (16-bits): 'flags' field split into:
conflicted (1-bit): conflicted state (conflicted/resolved) (1 if
conflicted)
stage (2-bits): stage during merge.
13-bit unused
entry_mode (nfileconflicts * entry mode) (16-bits): octal numbers, entry
mode of eache entry in the different stages. (How many is defined by
the 4-byte number before)
objectnames (nfileconflicts * object names) (160-bits): object names
of the different stages.
conflictcrc (32-bits): crc32 checksum over conflict data.
== Design explanations
[1] The directory and file offsets are included in the index format
to enable bisectability of the index, for binary searches.Updating
a single entry and partial reading will benefit from this.
[2] The directories are saved in their own block, to be able to
quickly search for a directory in the index. They include a
offset to the (lexically) first file in the directory.
[3] The data of the cache-tree extension and the resolve undo
extension is now part of the index itself, but if other extensions
come up in the future, there is no need to change the index, they
can simply be added at the end.
[4] To avoid rewrites of the whole index when there are conflicts or
conflicts are being resolved, conflicted data will be stored at
the end of the index. To mark the conflict resolved, just a bit
has to be flipped. The data will still be there, if a user wants
to redo the conflict resolution.
[5] Since only 4 modes are effectively allowed in git but 32-bit are
used to store them, having a two bit flag for the mode is enough
and saves 4 byte per entry.
[6] The length of the file name was dropped, since each file name is
nul terminated anyway.
[7] Since all stat data (except mtime and ctime) is just used for
checking if a file has changed a checksum of the data is enough.
In addition to that Thomas Rast suggested ctime could be ditched
completely (core.trustctime=false) and thus included in the
checksum. This would save 24 bytes per index entry, which would
be about 4 MB on the Webkit index.
(Thanks for the suggestion to Michael Haggerty)
[8] Since there can be more stage #1 entries, it is necessary to know
the number of conflict data entries there are.
[9] As Michael Haggerty pointed out on the mailing list, storing the
trailing slash will simplify a few operations.
== Order of relevance of the stat data (Thanks to jrnieder)
1. usually mtime changes
2. if mtime didn't change, presumably the ctime would have changed
(in that world where ctime is meaningful)
3. the ctime is only meaningful within the context of looking at
a particular inode. So if the ctimes are the same, that still
doesn't mean "no change" --- it could be a different inode
4. the inode number is only meaningful within the context of a
particular device. So if the inode numbers are the same,
that _still_ doesn't mean "no change" --- it could be a
different device
== Flat loading
Since internally git expects and works with lexicografic ordering,
a simple linear scan throught the subdirectories doesn't give
the right internal sorting. To achieve the right internal sorting
the loading will be done in the following way:
The data structure is a stack of queues, to allow continous reading
of the file.
s -> queue1
t -> queue2
a -> queue3
c -> queue4
k -> queue5
dirs = read_all_directories
foreach dir in dirs do
file = read_next_file
while element_on_top_of_stack.first_element < nextdir
indexentries.append(dequeue(element_on_top_of_stack))
if element_on_top_of_stack == emtpy:
remove_element_on_top_of_stack
if file[filename] < nextdir
indexentries.append(file)
else
queue.add(file)
foreach f in rest_of_files_in_directory:
queue.add(f)
stack.push(queue)
foreach queue in stack:
foreach entry in queue:
indexentry.append(entry)