New repo with copied history of only currently tracked files

Run git filter branch only once

The script in the question is going to be processing thousands of commits, thousands of times - and it's doing various (very slow) things once per iteration that ordinarily you'll only do at the end. That really is going to take forever.

Instead run the script once, removing all files in one go:

del=`cat deleted.txt`
git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch $del" \
  --prune-empty --tag-name-filter cat -- --all

Once the process has finished then cleanup:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now

# optional extra gc. Slow and may not further-reduce the repo size
git gc --aggressive --prune=now 

If the above fails due to the number of files

If there are enough files in deleted.txt such that the above command is too large to run, it can be rewritten as something like so:

git filter-branch --force --index-filter \
  'cat /abs/path/to/deleted.txt | xargs git rm --cached --ignore-unmatch' \
  --prune-empty --tag-name-filter cat -- --all

(cleanup steps are the same)

This is identical to the version above - but the command to delete the files does so one at a time instead of all at once.


As of April 2020, git produces the following warning when using git filter-branch:

WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.

I'm sure there's a safe way to use git filter-branch, but for those (like myself) unaware of how to avoid the gotchas mentioned above, git-filter-repo makes it pretty easy to retain the history of only currently tracked files:

$ git checkout master
$ git ls-files > /tmp/keep-these.txt
$ git filter-repo --paths-from-file /tmp/keep-these.txt

While git filter-branch took about 5 minutes to run on my repo, git filter-repo ran and repacked the repo in a little under a second!

It can be installed by following the instructions on its GitHub page. Alternatively, on a Mac you can just run brew install git-filter-repo.


Base on AD7six, with renamed files history preserved. (you can skip the preliminary optional section)

Optional

remove all remotes:

git remote | while read -r line; do (git remote rm "$line"); done

remove all tags:

git tag | xargs git tag -d

remove all other branches:

git branch | grep -v \* | xargs git branch -D

remove all stashes:

git stash clear

remove all submodules configuration and cache:

git config --local -l | grep submodule | sed -e 's/^\(submodule\.[^.]*\)\(.*\)/\1/g' | while read -r line; do (git config --local --remove-section "$line"); done
rm -rf .git/modules/

Pruning untracked files history, keeping tracked files history & renames

git ls-files | sed -e 's/^/"/g' -e 's/$/"/g' > keep-these.txt
git ls-files | while read -r line; do (git log --follow --raw --diff-filter=R --pretty=format:%H "$line" | while true; do if ! read hash; then break; fi; IFS=$'\t' read mode_etc oldname newname; read blankline; echo $oldname; done); done | sed -e 's/^/"/g' -e 's/$/"/g' >> keep-these.txt
git filter-branch --force --index-filter "git rm --ignore-unmatch --cached -qr .; cat \"$PWD/keep-these.txt\" | xargs git reset -q \$GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all
rm keep-these.txt
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
  • First two commands are to list tracked files and tracked files old names, using quotes to preserve paths with spaces.
  • Third command is to rewrite the commits for those files only.
  • Subsequent commands are to clean the history.

Optional (not recommended)

repack (from the-woes-of-git-gc-aggressive):

git repack -a -d --depth=250 --window=250

Delete everything and restore what you want

Rather than delete this-list-of-files one at a time, do the almost-opposite: delete everything and just restore the files you want to keep.

Like so:

# for unix

$ git checkout master
$ git ls-files > keep-these.txt
$ git filter-branch --force --index-filter \
  "git rm  --ignore-unmatch --cached -qr . ; \
  cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -d '\0' git reset -q \$GIT_COMMIT --" \
  --prune-empty --tag-name-filter cat -- --all
# for macOS

$ git checkout master
$ git ls-files > keep-these.txt
$ git filter-branch --force --index-filter \
  "git rm  --ignore-unmatch --cached -qr . ; \
  cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -0 git reset -q \$GIT_COMMIT --" \
  --prune-empty --tag-name-filter cat -- --all

It may be faster to execute.

Cleanup steps

Once the whole process has finished, then cleanup:

$ rm -rf .git/refs/original/
$ git reflog expire --expire=now --all
$ git gc --prune=now

# optional extra gc. Slow and may not further-reduce the repo size
$ git gc --aggressive --prune=now

Comparing the repository size before and after, should indicate a sizable reduction, and of course only commits that touch the kept files, plus merge commits - even if empty (because that's how --prune-empty works), will be in the history.

$GIT_COMMIT?

The use of $GIT_COMMIT seems to have caused some confusion, from the git filter-branch documentation (emphasis added):

The argument is always evaluated in the shell context using the eval command (with the notable exception of the commit filter, for technical reasons). Prior to that, the $GIT_COMMIT environment variable will be set to contain the id of the commit being rewritten.

That means git filter-branch will provide the variable at run time, it's not provided by you before hand. This can be demonstrated if there's any doubt using this no-op filter branch command:

$ git filter-branch --index-filter "echo current commit is \$GIT_COMMIT"
Rewrite d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 (1/xxxxx)current commit is d832800a85be9ef4ee6fda2fe4b3b6715c8bb860
Rewrite cd86555549ac17aeaa28abecaf450b49ce5ae663 (2/xxxxx)current commit is cd86555549ac17aeaa28abecaf450b49ce5ae663
...

Tags:

Git