Are Git's pack files deltas rather than snapshots?

Summary:
Git’s pack files are carefully constructed to effectively use disk caches and provide “nice” access patterns for common commands and for reading recently referenced objects.


Git’s pack file format is quite flexible (see Documentation/technical/pack-format.txt, or The Packfile in The Git Community Book). The pack files store objects in two main ways: “undeltified” (take the raw object data and deflate-compress it), or “deltified” (form a delta against some other object then deflate-compress the resulting delta data). The objects stored in a pack can be in any order (they do not (necessarily) have to be sorted by object type, object name, or any other attribute) and deltified objects can be made against any other suitable object of the same type.

Git’s pack-objects command uses several heuristics to provide excellent locality of reference for common commands. These heuristics control both the selection of base objects for deltified objects and the order of the objects. Each mechanism is mostly independent, but they share some goals.

Git does form long chains of delta compressed objects, but the heuristics try to make sure that only “old” objects are at the ends of the long chains. The delta base cache (whose size is controlled by the core.deltaBaseCacheLimit configuration variable) is automatically used and can greatly reduce the number of “rebuilds” required for commands that need to read a large number of objects (e.g. git log -p).

Delta Compression Heuristic

A typical Git repository stores a very large number of objects, so it can not reasonably compare them all to find the pairs (and chains) that will yield the smallest delta representations.

The delta base selection heuristic is based on the idea that the good delta bases will be found among objects with similar filenames and sizes. Each type of object is processed separately (i.e. an object of one type will never be used as the delta base for an object of another type).

For the purposes of delta base selection, the objects are sorted (primarily) by filename and then size. A window into this sorted list is used to limit the number of objects that are considered as potential delta bases. If a “good enough”1 delta representation is not found for an object among the objects in its window, then the object will not be delta compressed.

The size of the window is controlled by the --window= option of git pack-objects, or the pack.window configuration variable. The maximum depth of a delta chain is controlled by the --depth= option of git pack-objects, or the pack.depth configuration variable. The --aggressive option of git gc greatly enlarges both the window size and the maximum depth to attempt to create a smaller pack file.

The filename sort clumps together the objects for entries with with identical names (or at least similar endings (e.g. .c)). The size sort is from largest to smallest so that deltas that remove data are preferred to deltas that add data (since removal deltas have shorter representations) and so that the earlier, larger objects (usually newer) tend to be represented with plain compression.

1 What qualifies as “good enough” depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.

Object Ordering Heuristic

Objects are stored in the pack files in a “most recently referenced” order. The objects needed to reconstruct the most recent history are placed earlier in the pack and they will be close together. This usually works well for OS disk caches.

All the commit objects are sorted by commit date (most recent first) and stored together. This placement and ordering optimizes the disk accesses needed to walk the history graph and extract basic commit information (e.g. git log).

The tree and blob objects are stored starting with the tree from the first stored (most recent) commit. Each tree is processed in a depth first fashion, storing any objects that have not already been stored. This puts all the trees and blobs required to reconstruct the most recent commit together in one place. Any trees and blobs that have not yet been saved but that are required for later commits are stored next, in the sorted commit order.

The final object ordering is slightly affected by the delta base selection in that if an object is selected for delta representation and its base object has not been stored yet, then its base object is stored immediately before the deltified object itself. This prevents likely disk cache misses due to the non-linear access required to read a base object that would have “naturally” been stored later in the pack file.


The use of delta storage in the pack file is just an implementation detail. At that level, Git doesn't know why or how something changed from one revision to the next, rather it just knows that blob B is pretty similar to blob A except for these changes C. So it will only store blob A and changes C (if it chooses to do so - it could also choose to store blob A and blob B).

When retrieving objects from the pack file, the delta storage is not exposed to the caller. The caller still sees complete blobs. So, Git works the same way it always has without the delta storage optimisation.


As I mentioned in "What are git's thin packs?"

Git does deltification only in packfiles

I detailed the delta encoding used for pack files in "Is the git binary diff algorithm (delta storage) standardized?".
See also "When and how does git use deltas for storage?".

Note that the core.deltaBaseCacheLimit config which controls the default size for the pack file will soon be bumped from 16MB to 96MB, for Git 2.0.x/2.1 (Q3 2014).

See commit 4874f54 by David Kastrup (May 2014):

Bump core.deltaBaseCacheLimit to 96m

The default of 16m causes serious thrashing for large delta chains combined with large files.

Here are some benchmarks (pu variant of git blame):

time git blame -C src/xdisp.c >/dev/null

for a repository of Emacs repacked with git gc --aggressive (v1.9, resulting in a window size of 250) located on an SSD drive.
The file in question has about 30000 lines, 1Mb of size, and a history with about 2500 commits.

16m (previous default):
  real  3m33.936s
  user  2m15.396s
  sys   1m17.352s

96m:
  real  2m5.668s
  user  1m50.784s
  sys   0m14.288s

This is further optimized with Git 2.29 (Q4 2020), where "git index-pack"(man) learned to resolve deltified objects with greater parallelism.

See commit f08cbf6 (08 Sep 2020), and commit ee6f058, commit b4718ca, commit a7f7e84, commit 46e6fb1, commit fc968e2, commit 009be0d (24 Aug 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit b7e65b5, 22 Sep 2020)

index-pack: make quantum of work smaller

Signed-off-by: Jonathan Tan

Currently, when index-pack resolves deltas, it does not split up delta trees into threads: each delta base root (an object that is not a REF_DELTA or OFS_DELTA) can go into its own thread, but all deltas on that root (direct or indirect) are processed in the same thread.

This is a problem when a repository contains a large text file (thus, delta-able) that is modified many times - delta resolution time during fetching is dominated by processing the deltas corresponding to that text file.

This patch contains a solution to that.
When cloning using

git -c core.deltabasecachelimit=1g clone \
  https://fuchsia.googlesource.com/third_party/vulkan-cts  

on my laptop, clone time improved from 3m2s to 2m5s (using 3 threads, which is the default).

The solution is to have a global work stack. This stack contains delta bases (objects, whether appearing directly in the packfile or generated by delta resolution, that themselves have delta children) that need to be processed; whenever a thread needs work, it peeks at the top of the stack and processes its next unprocessed child. If a thread finds the stack empty, it will look for more delta base roots to push on the stack instead.

The main weakness of having a global work stack is that more time is spent in the mutex, but profiling has shown that most time is spent in the resolution of the deltas themselves, so this shouldn't be an issue in practice. In any case, experimentation (as described in the clone command above) shows that this patch is a net improvement.


With Git 2.31 (Q1 2021), you have more details about the format.

See commit 7b77f5a (29 Dec 2020) by Martin Ågren (none).
(Merged by Junio C Hamano -- gitster -- in commit 16a8055, 15 Jan 2021)

pack-format.txt: document sizes at start of delta data

Reported-by: Ross Light
Signed-off-by: Martin Ågren

We document the delta data as a set of instructions, but forget to document the two sizes that precede those instructions: the size of the base object and the size of the object to be reconstructed.
Fix this omission.

Rather than cramming all the details about the encoding into the running text, introduce a separate section detailing our "size encoding" and refer to it.

technical/pack-format now includes in its man page:

Size encoding

This document uses the following "size encoding" of non-negative integers: From each byte, the seven least significant bits are used to form the resulting integer.
As long as the most significant bit is 1, this process continues; the byte with MSB 0 provides the last seven bits.

The seven-bit chunks are concatenated.
Later values are more significant.

This size encoding should not be confused with the "offset encoding", which is also used in this document.

technical/pack-format now includes in its man page:

The delta data starts with the size of the base object and the size of the object to be reconstructed. These sizes are encoded using the size encoding from above.

The remainder of the delta data is a sequence of instructions to reconstruct the object