Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Even though we’re telling Git to do a shallow clone, to not fetch any tags, and to fetch the last 50 commits ...

What is the reason for cloning 50 commits? Whenever I clone a repo off GitHub for a quick build and don't care about sending patches back, I always use --depth=1 to avoid any history or stale assets. Is there a reason to get more commits if you don't care about having a local copy of the history? Do automated build pipelines need more info?



Some tools (like linters) might need to look at the actual changes that occurred for various reasons, such as to avoid doing redundant work on unmodified files. To do that, you need all the merge bases... which can present a kind of a chicken-and-egg problem because, to figure this out with git, you need the commits to be there locally to begin with. I'm sure you can find a way around it if you put enough effort into scripting against the remote git server, but you might need to deal with git internals in the process, and it's kind of a pain compared to just cloning the whole repo.


If you're interested in metadata, you can use --filter=blob:none to get the commit history but without any file contents.


Did not know, that's great, thanks! Seems this is a relatively recent feature?


I can’t speak for the original post, but I’ve seen other people[1] increase the commit count because part of the build process looks for a specific commit to checkout after cloning. If you have pull requests landing concurrently and you only clone the most recent commit, there is a race condition between when you queue the build with a specific commit id and when you start the clone.

All that being said, I don’t know why you would need you build agents to clone the whole damn repo for every build. Why not keep a copy around? That’s what TFS does.

One other thing I've seen to reduce the Git clone bottleneck is to clone from Git once, create a Git bundle from the clone, upload the bundle to cloud storage, and then have the subsequent steps use the bundle instead of cloning directly. See these two files for the .NET Runtime repo[2][3]. I assume they do this because the clone step is slow or unreliable and then the subsequent moving around of the bundle is faster and more reliable. It also makes every node get the exact same clone (they build on macOS, Windows, and Linux).

Lastly, be careful with the depth option when cloning. It causes a higher CPU burden on the remote. You can see this in the console output when the remote says it is compressing objects. And if you subsequently do a normaly fetch after a shallow clone, you can cause the server to do ever more work[4].

1: https://github.com/dotnet/runtime/pull/35109

2: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

3: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

4: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...


Also worth noting that git is pretty efficient at cloning a bunch of subsequent commits, due to delta encoding.

edit: looks like git doesn't implement fetching thin packs when populating a shallow clone. It will still avoid fetching unnecessary packs, so the efficiency is still high for most software repositories.


Does git do delta encoding during clones? I know it doesn’t use deltas for most things.


I am fairly sure it uses thin packs during a clone usually. Though I checked the docs at https://www.git-scm.com/docs/shallow, and it says:

> There are some unfinished ends of the whole shallow business:

> - maybe we have to force non-thin packs when fetching into a shallow repo (ATM they are forced non-thin).


Tags. All of my builds use `git describe` to get a meaningful version number for the build.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: