How does git store duplicate files?

I am probably not going to explain this quite right but my understanding is that every commit stores only a tree structure representing the file structure of your project with pointers to the actual files which are stored in an objects sub folder. Git uses a SHA1 hash of the file contents to create the file name and sub folder, so for example if a file's contents created the following hash:

0b064b56112cc80495ba59e2ef63ffc9e9ef0c77

It would be stored as:

.git/objects/0b/064b56112cc80495ba59e2ef63ffc9e9ef0c77

The first two characters are used as a directory name and the rest as the file name.

The result is that even if you have multiple files with the same contents but different names or in different locations or from different commits only one copy would ever be saved but with several pointers to it in each commit tree.


By default/itself: No. Yes.

Git works on the basis that it creates snapshots of files, and not incremental differences like other VCS do.

EDIT

As mentioned by Dave and opatut, my understanding of how git stores files was incorrect and I apologize for the confusion caused. Upon doing more research, Git does store duplicated files as pointers to 1 file. Quoting VonC in the accepted answer of this question,

... several files with the same content are stored only once.

Please also note that as mentioned in that answer, conceptually ...

Referencing the git-scm documentation:

Git thinks of its data more like a set of snapshots of a miniature filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored. Git thinks about its data more like a stream of snapshots.

However, on a storage level, deltas are still used in which Git tries to generate the smallest possible delta based on heuristic selection of blobs as fast as possible, there are options that optimize for compression. Which will reduce the sizes of the repository.

Also as tested by opatut in his pastebin link of outputs from the comments, duplicate objects are stored only once. That means that git will recognize duplicate binary files and store them only once. Which was what the original question asked for. The following are other options of handling duplicate files.

Other alternative: Symlinks

You could set up symlinks to the previous files, that way when you work on them, they will point to the same large file, however note that git does not track the files that the symlinks point to, meaning they will only store the symlink. This satisfies your need to reduce space, at the sacrifice of portability, that is, if you move to another dev machine, you'll have to make sure the files are where the symlinks point to. Which might not be what you want. See this very good SO Q&A on what git does to symlinks.

Another alternatve: tools!

I've found multiple tools that might help accomplish what you need on managing binary files.

You can try git-annex, where it basically only tracks the most recent version of binary files and the rest are maintained by symlinks, so in a way this a more automatic way of handling symbolic links. Here's their project site.

Or the built in git-submodules and a separate repo to achieve what you want, where you only fetch the large binary files to use them.

Admittedly I have no attempted these options so here is the reference link to read more explanations about them. Reference: this SO question

Tags:

Git