What types of binary files does Git keep deltas for?
First, let's get a bit of terminology out of the way. Files are stored as blob objects. These are one of four object types, the other three being commit, tree, and annotated tag.
Git's model is that all objects are logically independent. Everything is stored by its hash ID key, in a database. To retrieve any object, you start by knowing its hash ID, which you get from something or someone else.1 You feed that hash ID to an object-getter, and it either looks up the object where it is stored directly, with no chance at delta compression at all—this is what Git calls a loose object—or, failing that, Git looks inside pack files, which pack multiple separate objects together and provide the opportunity for delta compression.2
What you're looking for, then, is information about which blob objects Git chooses to delta-compress against which other blob objects inside these pack files. The answer has evolved somewhat over time, so there is no single correct answer—but there are certain control knobs, including the .gitattributes
one you mentioned.
The actual delta format is a modification of xdelta. It can, literally, compress (or "deltify") any binary data against any other binary data—but the results will be poor unless the inputs are well-chosen. It's the input choices that are the real key here. Git also has a technical documentation file describing how objects are chosen for deltification. This takes file path names, and especially final path component names, into account.
Note that if deltification fails to make the object smaller, the object is simply not delta-compressed. The object's original file size is also an input here, and core.bigFileThreshold
(introduced in Git 1.7.6) sets a size value: files above this level are never deltified at all.
Hence, you can prevent Git from considering a file (object, really) for deltification by either of two ways:
- set
core.bigFileThreshold
so that the object is too big, or - make the object's path name match a
.gitattributes
line that has-delta
specified.
Note that when using Git-LFS, large files are not stored in Git at all. Instead, a large file (as defined by the Git-LFS settings) is replaced (at git add
time) by an indirect name. Git then stores this indirect name as the blob object (using the original file's path). When Git extracts the object, Git-LFS inspects it before allowing it to go into your work-tree. Git-LFS detects that the object's data were replaced with an indirect name, and retrieves the "real" data from another (separate, not-Git-at-all) server using the indirect name. So Git never sees the large file's data at all: instead, it sees only these indirect names.
1For instance, we might start with a branch name like master
, which gets us the latest (or tip) commit hash ID. That hash ID gives us access to the commit object. The commit lists the hash ID of a tree. The tree, once we obtain it, lists the hash ID of some blob, along with the file's name. So, now we know that the hash ID for the version of README
in the tip commit of master
, if that's what we're looking for. Or, we use the commit data to find an older commit, which we use to find another even-older commit, and so on, until we arrive at the commit we want; and then we use the tree to find the blob IDs (and names) of files.
2Normally, an object can only be "deltified" against other objects in the same pack. For transport purposes, Git provides what it calls a thin pack in which objects can be delta-compressed against other objects that are omitted, but are assumed to be available on the other side of the transport mechanism. The other Git must "fatten up" the thin pack.
git will hapilly store every file you feed it in the repository, be it binary or not, of any size. It will magically know how to compress the deltas, also for binary files (and I don't think you can control that). This is even true if you're not using git-lfs. It is not a recommended practice, because it makes your repository large. Since a git working copy always contains the whole history, it will stay large forever. If you use git-lfs, at least you will only have the latest versions of the large files in your working copy (for the downside that you need a connection to the server for many more operations, as with subversion).
Would it be an option to split the binaries into their own repository and embed it as module?