Deduplicate Git forks on a server

Now the same 'huge' file resides in both the developer's forks on the server. It does not create a hard-link automatically

Actually, with Git 2.20, that issue might disappear, because of delta islands, a new way of doing delta computation so that an object that exists in one fork is not made into a delta against another object that does not appear in the same forked repository.

See commit fe0ac2f, commit 108f530, commit f64ba53 (16 Aug 2018) by Christian Couder (chriscool).
Helped-by: Jeff King (peff), and Duy Nguyen (pclouds).
See commit 9eb0986, commit 16d75fa, commit 28b8a73, commit c8d521f (16 Aug 2018) by Jeff King (peff).
Helped-by: Jeff King (peff), and Duy Nguyen (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit f3504ea, 17 Sep 2018)

Add delta-islands.{c,h}

Hosting providers that allow users to "fork" existing repositories want those forks to share as much disk space as possible.

Alternates are an existing solution to keep all the objects from all the forks into a unique central repository, but this can have some drawbacks.
Especially when packing the central repository, deltas will be created between objects from different forks.

This can make cloning or fetching a fork much slower and much more CPU intensive as Git might have to compute new deltas for many objects to avoid sending objects from a different fork.

Because the inefficiency primarily arises when an object is deltified against another object that does not exist in the same fork, we partition objects into sets that appear in the same fork, and define "delta islands".
When finding delta base, we do not allow an object outside the same island to be considered as its base.

So "delta islands" is a way to store objects from different forks in the same repository and packfile without having deltas between objects from different forks.

This patch implements the delta islands mechanism in "delta-islands.{c,h}", but does not yet make use of it.

A few new fields are added in 'struct object_entry' in "pack-objects.h" though.

See Documentation/git-pack-objects.txt: Delta Island:

DELTA ISLANDS

When possible, pack-objects tries to reuse existing on-disk deltas to avoid having to search for new ones on the fly. This is an important optimization for serving fetches, because it means the server can avoid inflating most objects at all and just send the bytes directly from disk.

This optimization can't work when an object is stored as a delta against a base which the receiver does not have (and which we are not already sending). In that case the server "breaks" the delta and has to find a new one, which has a high CPU cost. Therefore it's important for performance that the set of objects in on-disk delta relationships match what a client would fetch.

In a normal repository, this tends to work automatically.
The objects are mostly reachable from the branches and tags, and that's what clients fetch. Any deltas we find on the server are likely to be between objects the client has or will have.

But in some repository setups, you may have several related but separate groups of ref tips, with clients tending to fetch those groups independently.

For example, imagine that you are hosting several "forks" of a repository in a single shared object store, and letting clients view them as separate repositories through GIT_NAMESPACE or separate repositories using the alternates mechanism.

A naive repack may find that the optimal delta for an object is against a base that is only found in another fork.
But when a client fetches, they will not have the base object, and we'll have to find a new delta on the fly.

A similar situation may exist if you have many refs outside of refs/heads/ and refs/tags/ that point to related objects (e.g., refs/pull or refs/changes used by some hosting providers). By default, clients fetch only heads and tags, and deltas against objects found only in those other groups cannot be sent as-is.

Delta islands solve this problem by allowing you to group your refs into distinct "islands".

Pack-objects computes which objects are reachable from which islands, and refuses to make a delta from an object A against a base which is not present in all of A's islands. This results in slightly larger packs (because we miss some delta opportunities), but guarantees that a fetch of one island will not have to recompute deltas on the fly due to crossing island boundaries.


A side effect though: some commands were more verbose. Git 2.23 (Q3 2019) fixes this.

See commit bdbdf42 (20 Jun 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit a4c8352, 09 Jul 2019)

delta-islands: respect progress flag

The delta island code always prints "Marked %d islands", even if progress has been suppressed with --no-progress or by sending stderr to a non-tty.

Let's pass a progress boolean to load_delta_islands().
We already do the same thing for the progress meter in resolve_tree_islands().


I have decided to do this:

 shared-objects-database.git/
foo.git/
  objects/info/alternate (will have ../../shared-objects-database.git/objects)
bar.git/
  objects/info/alternate (will have ../../shared-objects-database.git/objects)
baz.git/
  objects/info/alternate (will have ../../shared-objects-database.git/objects)

All the forks will have an entry in their objects/info/alternates file that gives a relative path to the objects' database repository.

It is important to make the object database a repository, because we can save objects and refs of different users having a repository of the same name.

Steps:

  1. git init --bare shared-object-database.git
  2. I run the following lines of code either every time there is a push to any fork (via post-recieve) or by running a cronjob

    for r in list-of-forks
        do
    

    ( cd "$r" && git push ../shared-objects-database.git "refs/:refs/remotes/$r/" && echo ../../shared-objects-database.git/objects >objects/info/alternates # to be save I add the "fat" objects to alternates every time ) done

Then in the next "git gc" all the objects in forks that already exist in alternate will be deleted.

git repack -adl is also an option!

This way we save space so that two users pushing the same data on their respective forks on the server will share the objects.

We need to set the gc.pruneExpire variable up to never in the shared-object-database. Just to be safe!

To occasionally prune objects, add all forks as remotes to the shared, fetch, and prune! Git will do the rest!

(I finally found a solution that works for me! (Not tested in production! :p Thanks to this post.)