> Even though we’re telling Git to do a shallow clone, to not fetch any tags, an...

dataflow · on Oct 26, 2020

Some tools (like linters) might need to look at the actual changes that occurred for various reasons, such as to avoid doing redundant work on unmodified files. To do that, you need all the merge bases... which can present a kind of a chicken-and-egg problem because, to figure this out with git, you need the commits to be there locally to begin with. I'm sure you can find a way around it if you put enough effort into scripting against the remote git server, but you might need to deal with git internals in the process, and it's kind of a pain compared to just cloning the whole repo.

Dylan16807 · on Oct 26, 2020

If you're interested in metadata, you can use --filter=blob:none to get the commit history but without any file contents.

dataflow · on Oct 26, 2020

Did not know, that's great, thanks! Seems this is a relatively recent feature?

MarkSweep · on Oct 26, 2020

I can’t speak for the original post, but I’ve seen other people[1] increase the commit count because part of the build process looks for a specific commit to checkout after cloning. If you have pull requests landing concurrently and you only clone the most recent commit, there is a race condition between when you queue the build with a specific commit id and when you start the clone.

All that being said, I don’t know why you would need you build agents to clone the whole damn repo for every build. Why not keep a copy around? That’s what TFS does.

One other thing I've seen to reduce the Git clone bottleneck is to clone from Git once, create a Git bundle from the clone, upload the bundle to cloud storage, and then have the subsequent steps use the bundle instead of cloning directly. See these two files for the .NET Runtime repo[2][3]. I assume they do this because the clone step is slow or unreliable and then the subsequent moving around of the bundle is faster and more reliable. It also makes every node get the exact same clone (they build on macOS, Windows, and Linux).

Lastly, be careful with the depth option when cloning. It causes a higher CPU burden on the remote. You can see this in the console output when the remote says it is compressing objects. And if you subsequently do a normaly fetch after a shallow clone, you can cause the server to do ever more work[4].

1: https://github.com/dotnet/runtime/pull/35109

2: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

3: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

4: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

pushrax · on Oct 26, 2020

Also worth noting that git is pretty efficient at cloning a bunch of subsequent commits, due to delta encoding.

edit: looks like git doesn't implement fetching thin packs when populating a shallow clone. It will still avoid fetching unnecessary packs, so the efficiency is still high for most software repositories.

senkora · on Oct 26, 2020

Does git do delta encoding during clones? I know it doesn’t use deltas for most things.

pushrax · on Oct 26, 2020

I am fairly sure it uses thin packs during a clone usually. Though I checked the docs at https://www.git-scm.com/docs/shallow, and it says:

> There are some unfinished ends of the whole shallow business:

> - maybe we have to force non-thin packs when fetching into a shallow repo (ATM they are forced non-thin).

globular-toast · on Oct 26, 2020

Tags. All of my builds use `git describe` to get a meaningful version number for the build.