Git + a large data set?

This sounds like the perfect occasion to try git-annex:

git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, checksumming time, or disk space.


use submodules to isolate your giant files from your source code. More on that here:

http://git-scm.com/book/en/v2/Git-Tools-Submodules

The examples talk about libraries, but this works for large bloated things like data samples for testing, images, movies, etc.

You should be able to fly while developing, only pausing here and there if you need to look at new versions of giant data.

Sometimes it's not even worth while tracking changes to such things.

To address your issues with getting more clones of the data: If your git implementation supports hard links on your OS, this should be a breeze.

The nature of your giant dataset is also at play. If you change some of it, are you changing giant blobs or a few rows in a set of millions? This should determine how effective VCS will be in playing a notification mechanism for it.

Hope this helps.


Git BUP claims to do a good job with incrementally backing up large files.

I think BUP assumes a separate repository to do it's work so you'd end up using submodules anyway. However, if you want good bandwidth reduction this is the thing