How to improve git log performance?
You are correct, it does take somewhere between 20 and 35 seconds to generate the report on 56'000 commits generating 224'000 lines (15MiB) of output. I actually think that's pretty decent performance but you don't; okay.
Because you are generating a report using a constant format from an unchanging database, you only have to do it once. Afterwards, you can use the cached result of git log
and skip the time-consuming generation. For example:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat > log-pretty.txt
You might wonder how long it takes to search that entire report for data of interest. That's a worthy question:
$ tail -1 log-pretty.txt
30 0 railties/test/webrick_dispatcher_test.rb
$ time grep railties/test/webrick_dispatcher_test.rb log-pretty.txt
…
30 0 railties/test/webrick_dispatcher_test.rb
real 0m0.012s
…
Not bad, the introduction of a "cache" has reduced the time needed from 35+ seconds to a dozen milliseconds. That's almost 3000 times as fast.
TLDR; as mentioned in GitMerge 2019:
git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write
Actually (see at the end), the first two config are not needed with Git 2.24+ (Q3 2019): they are true
by default.
As T4cC0re mentions in the comments:
If you are on git version 2.29 or above you should rather run:
git commit-graph write --reachable --changed-paths
This will pre-compute file paths, so that
git log
commands that are scoped to files also benefit from this cache.
Git 2.18 (Q2 2018) will improve git log
performance:
See commit 902f5a2 (24 Mar 2018) by René Scharfe (rscharfe
).
See commit 0aaf05b, commit 3d475f4 (22 Mar 2018) by Derrick Stolee (derrickstolee
).
See commit 626fd98 (22 Mar 2018) by brian m. carlson (bk2204
).
(Merged by Junio C Hamano -- gitster
-- in commit 51f813c, 10 Apr 2018)
sha1_name
: usebsearch_pack()
for abbreviations
When computing abbreviation lengths for an object ID against a single packfile, the method
find_abbrev_len_for_pack()
currently implements binary search.
This is one of several implementations.
One issue with this implementation is that it ignores the fanout table in thepack-index
.Translate this binary search to use the existing
bsearch_pack()
method that correctly uses a fanout table.Due to the use of the fanout table, the abbreviation computation is slightly faster than before.
For a fully-repacked copy of the Linux repo, the following 'git log' commands improved:
* git log --oneline --parents --raw Before: 59.2s After: 56.9s Rel %: -3.8% * git log --oneline --parents Before: 6.48s After: 5.91s Rel %: -8.9%
The same Git 2.18 adds a commits graph: Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking.
See commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit b10edb2, 08 May 2018)
commit
: integrate commit graph with commit parsing
Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling
parse_commit_gently()
.
This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date.If
core.commitGraph
isfalse
, then do not check graph files.In test script t5318-commit-graph.sh, add
output-matching
conditions on read-only graph operations.By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks.
Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind '
origin/master
' by 59,929 commits.| Command | Before | After | Rel % | |----------------------------------|--------|--------|-------| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% |
To know more about commit graph, see "How does 'git log --graph
' work?".
The same Git 2.18 (Q2 2018) adds lazy-loading tree.
The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so.
See commit 279ffad (30 Apr 2018) by SZEDER Gábor (szeder
).
See commit 7b8a21d, commit 2e27bd7, commit 5bb03de, commit 891435d (06 Apr 2018) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit c89b6e1, 23 May 2018)
commit-graph
: lazy-load trees for commits
The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits.
Delay loading the tree object for a commit loaded from the graph until requested via
get_commit_tree()
.
Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small.On the Linux repository, performance tests were run for the following command:
git log --graph --oneline -1000 Before: 0.92s After: 0.66s Rel %: -28.3%
Git 2.21 (Q1 2019) adds loose cache.
See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe
).
(Merged by Junio C Hamano -- gitster
-- in commit eb8638a, 18 Jan 2019)
object-store
: use oneoid_array
per subdirectory for loose cache
The loose objects cache is filled one subdirectory at a time as needed.
It is stored in anoid_array
, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.Use one
oid_array
for each subdirectory.
This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus.The cache is used for collision checks for the log placeholders
%h
,%t
and%p
, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:$ git count-objects 26733 objects, 68808 kilobytes Test HEAD^ HEAD -------------------------------------------------------------------- 4205.1: log with %H 0.51(0.47+0.04) 0.51(0.49+0.02) +0.0% 4205.2: log with %h 0.84(0.82+0.02) 0.60(0.57+0.03) -28.6% 4205.3: log with %T 0.53(0.49+0.04) 0.52(0.48+0.03) -1.9% 4205.4: log with %t 0.84(0.80+0.04) 0.60(0.59+0.01) -28.6% 4205.5: log with %P 0.52(0.48+0.03) 0.51(0.50+0.01) -1.9% 4205.6: log with %p 0.85(0.78+0.06) 0.61(0.56+0.05) -28.2% 4205.7: log with %h-%h-%h 0.96(0.92+0.03) 0.69(0.64+0.04) -28.1%
Git 2.22 (Apr. 2019) checks errors before using data read from the commit-graph file.
See commit 93b4405, commit 43d3561, commit 7b8ce9c, commit 67a530f, commit 61df89c, commit 2ac138d (25 Mar 2019), and commit 945944c, commit f6761fa (21 Feb 2019) by Ævar Arnfjörð Bjarmason (avar
).
(Merged by Junio C Hamano -- gitster
-- in commit a5e4be2, 25 Apr 2019)
commit-graph
write: don't die if the existing graph is corrupt
When the
commit-graph
is written we end up callingparse_commit()
. This will in turn invoke code that'll consult the existingcommit-graph
about the commit, if the graph is corrupted we die.We thus get into a state where a failing "
commit-graph verify
" can't be followed-up with a "commit-graph write
" ifcore.commitGraph=true
is set, the graph either needs to be manually removed to proceed, orcore.commitGraph
needs to be set to "false".Change the "
commit-graph write
" codepath to use a newparse_commit_no_graph()
helper instead ofparse_commit()
to avoid this.
The latter will callrepo_parse_commit_internal()
withuse_commit_graph=1
as seen in 177722b ("commit
: integrate commit graph with commit parsing", 2018-04-10, Git v2.18.0-rc0).Not using the old graph at all slows down the writing of the new graph by some small amount, but is a sensible way to prevent an error in the existing commit-graph from spreading.
With Git 2.24+ (Q3 2019), the commit-graph is active by default:
See commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit f4f8dfe, 09 Sep 2019)
commit-graph
: turn on commit-graph by default
The commit-graph feature has seen a lot of activity in the past year or so since it was introduced.
The feature is a critical performance enhancement for medium- to large-sized repos, and does not significantly hurt small repos.Change the defaults for
core.commitGraph
andgc.writeCommitGraph
to true so users benefit from this feature by default.
Still with Git 2.24 (Q4 2019), a configuration variable tells "git fetch
" to write the commit graph after finishing.
See commit 50f26bd (03 Sep 2019) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 5a53509, 30 Sep 2019)
fetch: add fetch.writeCommitGraph config setting
The commit-graph feature is now on by default, and is being written during '
git gc
' by default.
Typically, Git only writes a commit-graph when a 'git gc --auto
' command passes thegc.auto
setting to actualy do work. This means that a commit-graph will typically fall behind the commits that are being used every day.To stay updated with the latest commits, add a step to '
git fetch
' to write a commit-graph after fetching new objects.
Thefetch.writeCommitGraph
config setting enables writing a split commit-graph, so on average the cost of writing this file is very small. Occasionally, the commit-graph chain will collapse to a single level, and this could be slow for very large repos.For additional use, adjust the default to be true when
feature.experimental
is enabled.
And still with Git 2.24 (Q4 2019), the commit-graph
is more robust.
See commit 6abada1, commit fbab552 (12 Sep 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 098e8c6, 07 Oct 2019)
commit-graph
: bumpDIE_ON_LOAD
check to actual load-time
Commit 43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25, Git v2.22.0-rc0) added an environment variable we use only in the test suite,
$GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD
.
But it put the check for this variable at the very top ofprepare_commit_graph()
, which is called every time we want to use the commit graph.
Most importantly, it comes before we check the fast-path "did we already try to load?", meaning we end up callinggetenv()
for every single use of the commit graph, rather than just when we load.
getenv()
is allowed to have unexpected side effects, but that shouldn't be a problem here; we're lazy-loading the graph so it's clear that at least one invocation of this function is going to call it.But it is inefficient.
getenv()
typically has to do a linear search through the environment space.We could memoize the call, but it's simpler still to just bump the check down to the actual loading step. That's fine for our sole user in t5318, and produces this minor real-world speedup:
[before] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 1.460 s ± 0.017 s [User: 1.174 s, System: 0.285 s] Range (min … max): 1.440 s … 1.491 s 10 runs [after] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 1.391 s ± 0.005 s [User: 1.118 s, System: 0.273 s] Range (min … max): 1.385 s … 1.399 s 10 runs
Git 2.24 (Q4 2019) also includes a regression fix.
See commit cb99a34, commit e88aab9 (24 Oct 2019) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit dac1d83, 04 Nov 2019)
commit-graph
: fix writing first commit-graph during fetchReported-by: Johannes Schindelin
Helped-by: Jeff King
Helped-by: Szeder Gábor
Signed-off-by: Derrick Stolee
The previous commit includes a failing test for an issue around fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we fix that bug and set the test to
"test_expect_success"
.The problem arises with this set of commands when the remote repo at
<url>
has a submodule.
Note that--recurse-submodules
is not needed to demonstrate the bug.$ git clone <url> test $ cd test $ git -c fetch.writeCommitGraph=true fetch origin Computing commit graph generation numbers: 100% (12/12), done. BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2> Aborted (core dumped)
As an initial fix, I converted the code in
builtin/fetch.c
that callswrite_commit_graph_reachable()
to instead launch a "git commit-graph
write--reachable --split
" process. That code worked, but is not how we want the feature to work long-term.That test did demonstrate that the issue must be something to do with internal state of the 'git fetch' process.
The
write_commit_graph()
method incommit-graph.c
ensures the commits we plan to write are "closed under reachability" usingclose_reachable()
.
This method walks from the input commits, and uses theUNINTERESTING
flag to mark which commits have already been visited. This allows the walk to takeO(N)
time, whereN
is the number of commits, instead ofO(P)
time, whereP
is the number of paths. (The number of paths can be exponential in the number of commits.)However, the
UNINTERESTING
flag is used in lots of places in the codebase. This flag usually means some barrier to stop a commit walk, such as in revision-walking to compare histories.
It is not often cleared after the walk completes because the starting points of those walks do not have theUNINTERESTING
flag, andclear_commit_marks()
would stop immediately.This is happening during a '
git fetch
' call with a remote. The fetch negotiation is comparing the remote refs with the local refs and marking some commits asUNINTERESTING
.I tested running
clear_commit_marks_many()
to clear the UNINTERESTING flag insideclose_reachable()
, but the tips did not have the flag, so that did nothing.It turns out that the
calculate_changed_submodule_paths()
method is at fault. Thanks, Peff, for pointing out this detail! More specifically, for each submodule, thecollect_changed_submodules()
runs a revision walk to essentially do file-history on the list of submodules. That revision walk marks commitsUNININTERESTING
if they are simplified away by not changing the submodule.Instead, I finally arrived on the conclusion that I should use a flag that is not used in any other part of the code. In
commit-reach.c
, a number of flags were defined for commit walk algorithms. TheREACHABLE
flag seemed like it made the most sense, and it seems it was not actually used in the file.
TheREACHABLE
flag was used in early versions ofcommit-reach.c
, but was removed by 4fbcca4 ("commit-reach
: makecan_all_from_reach
... linear", 2018-07-20, v2.20.0-rc0).Add the
REACHABLE
flag tocommit-graph.c
and use it instead of UNINTERESTING inclose_reachable()
.
This fixes the bug in manual testing.
Fetching from multiple remotes into the same repository in parallel had a bad interaction with the recent change to (optionally) update the commit-graph after a fetch job finishes, as these parallel fetches compete with each other.
That has been corrected with Git 2.25 (Q1 2020).
See commit 7d8e72b, commit c14e6e7 (03 Nov 2019) by Johannes Schindelin (dscho
).
(Merged by Junio C Hamano -- gitster
-- in commit bcb06e2, 01 Dec 2019)
fetch
: add the command-line option--write-commit-graph
Signed-off-by: Johannes Schindelin
This option overrides the config setting
fetch.writeCommitGraph
, if both are set.
And:
fetch
: avoid locking issues between fetch.jobs/fetch.writeCommitGraphSigned-off-by: Johannes Schindelin
When both
fetch.jobs
andfetch.writeCommitGraph
is set, we currently try to write the commit graph in each of the concurrent fetch jobs, which frequently leads to error messages like this one:fatal: Unable to create '.../.git/objects/info/commit-graphs/commit-graph-chain.lock': File exists.
Let's avoid this by holding off from writing the commit graph until all fetch jobs are done.
The code to write split commit-graph file(s) upon fetching computed bogus value for the parameter used in splitting the resulting files, which has been corrected with Git 2.25 (Q1 2020).
See commit 63020f1 (02 Jan 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 037f067, 06 Jan 2020)
commit-graph
: prefer defaultsize_mult
when given zeroSigned-off-by: Derrick Stolee
In 50f26bd ("
fetch
: add fetch.writeCommitGraph config setting", 2019-09-02, Git v2.24.0-rc0 -- merge listed in batch #4), the fetch builtin added the capability to write a commit-graph using the "--split
" feature.
This feature creates multiple commit-graph files, and those can merge based on a set of "split options" including a size multiple.
The default size multiple is 2, which intends to provide alog_2
N depth of the commit-graph chain where N is the number of commits.However, I noticed during dogfooding that my commit-graph chains were becoming quite large when left only to builds by '
git fetch
'.
It turns out that insplit_graph_merge_strategy()
, we default thesize_mult
variable to 2, except we override it with the context'ssplit_opts
if they exist.
Inbuiltin/fetch.c
, we create such asplit_opts,
but do not populate it with values.This problem is due to two failures:
- It is unclear that we can add the flag
COMMIT_GRAPH_WRITE_SPLIT
with aNULL
split_opts
.- If we have a non-NULL
split_opts,
then we override the default values even if a zero value is given.Correct both of these issues.
- First, do not override
size_mult
when the options provide a zero value.- Second, stop creating a
split_opts
in the fetch builtin.
Note that git log
was broken between Git 2.22 (May 2019) and Git 2.27 (Q2 2020), when using magic pathspec.
The command line parsing of "git log :/a/b/
" was broken for about a full year without anybody noticing, which has been corrected.
See commit 0220461 (10 Apr 2020) by Jeff King (peff
).
See commit 5ff4b92 (10 Apr 2020) by Junio C Hamano (gitster
).
(Merged by Junio C Hamano -- gitster
-- in commit 95ca489, 22 Apr 2020)
sha1-name
: do not assume that the ref store is initializedReported-by: Érico Rolim
c931ba4e ("
sha1
-name.c``: removethe_repo
fromhandle_one_ref()
", 2019-04-16, Git v2.22.0-rc0 -- merge listed in batch #8) replaced the use offor_each_ref()
helper, which works with the main ref store of the default repository instance, withrefs_for_each_ref()
, which can work on any ref store instance, by assuming that the repository instance the function is given has its ref store already initialized.But it is possible that nobody has initialized it, in which case, the code ends up dereferencing a
NULL
pointer.
And:
repository
: mark the "refs" pointer as privateSigned-off-by: Jeff King
The "refs" pointer in a struct repository starts life as
NULL
, but then is lazily initialized when it is accessed viaget_main_ref_store()
.
However, it's easy for calling code to forget this and access it directly, leading to code which works some of the time, but fails if it is called before anybody else accesses the refs.This was the cause of the bug fixed by 5ff4b920eb ("
sha1-name
: do not assume that the ref store is initialized", 2020-04-09, Git v2.27.0 -- merge listed in batch #3). In order to prevent similar bugs, let's more clearly mark the "refs" field as private.
My first thought was to improve your IO, but I tested against the rails repository using an SSD and got a similar result: 30 seconds.
--numstat
is what's slowing everything down, otherwise git-log
can complete in 1 second even with the formatting. Doing a diff is expensive, so if you can remove that from your process that will speed things up immensely. Perhaps do it after the fact.
Otherwise if you filter the log entries using git-log
's own search facilities that will reduce the number of entries which need to do a diff. For example, git log --grep=foo --numstat
takes just one second.They're in the docs under "Commit Limiting". This can greatly reduce the number of entries git has to format. Revision ranges, date filters, author filters, log message grepping... all this can improve the performance of git-log
on a large repository while doing an expensive operation.