Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using Nerdbank.GitVersioning's ManagedGit implementation for reading Git repositories #343

Closed
mscottford opened this issue Jul 6, 2021 · 4 comments
Labels
design For design discussion issues

Comments

@mscottford
Copy link
Member

As documented in #243, there going to be some benefits to moving away from libgit2sharp and towards a different solution.

The Nerdbank.GitVersioning project has experimented with this in the past as well. Their discussion in dotnet/Nerdbank.GitVersioning#505 is a very interesting read. Pretty much all of the objections that I have to us continuing to use libgit2sharp are well stated in that discussion. While reading through what they implemented as part of dotnet/Nerdbank.GitVersioning#521, I realized that their implementation is marked public, which means we can try using it for our needs.

Since their intended use for the git repository is very similar to ours (just reading through the history with no interest in making changes/commits to the repository), then I think that there's a really good chance that we could use their code to pull out the contents of dependency manifest files along with their history.

Our current usage of git requires us to make a clone of the repository if it doesn't exist already. The Nerdbank.GitVersioning ManagedGit code does implement the git clone command because for its use case it makes sense to assume that they already have access to a clone. So that's something that we'd have to build ourselves.

Our current usage also assumes that the history is stored directly on the filesystem, and that's something that I'd love to break our dependency on. We could try to avoid the need to rely on the filesystem by instead implementing our own object/pack storage mechanism in memory (similar to what the C library libgit2 and Go library go-git support).

Since we're going to have to build our own equivalent of the clone command and the git data transfer protocols, starting out with support for performing a clone operation directly into an in-memory store is something that we should consider including.

It's worth noting that since libgit2 has functionality for using custom Git object storage mechanisms, it might be exposed via libgit2sharp as well. Even if this is the case, I'm still not a fan of us depending on libgit2sharp, if we can avoid it. I'd rather keep our dependency on the filesystem and remove our dependency on libgit2sharp than the other way around.

Implementing our own clone command does open other possibilities that are worth considering. We could make our implementation smart enough to only grab objects, packs, commits, and trees that contain the files that we're interested in reading. We don't need all of the source code, just the dependency manifest information. This would potentially save us a bunch of time that we're currently spending waiting for a full git clone command to complete. If we could walk the commits on the remote to find ones that reference dependency manifest files, then we could just request the objects/packs that contain the versions of those files that we need. This would result in a much smaller data transfer in terms of the raw number of bytes. It is possible that the extra processing that we'd have to do would negate any performance benefit as measured in seconds. We could do some profiling to assess that, though. And I suspect that transferring less data would be a big win for really large repositories.

There is a potential risk that needs to be noted if we go forward with this idea. The Nerdbank.GitVersioning team might not be excited to learn that we plan on using their ManagedGit code directly. They might react by marking those classes internal. The discussion in dotnet/Nerdbank.GitVersioning#505 included some back-and-forth about where the ManagedGit implementation should live, with one of the options being to move it into a separate package. Perhaps that's an extraction effort that we could assist them with in the event that they object to us using Nerdbank.GitVersioning as a dependency just for the purpose of consuming the ManagedGit code that it contains.

@mscottford
Copy link
Member Author

I've also discovered that JGit has support for performing an in-memory clone operation. https://github.com/centic9/jgit-cookbook/blob/master/src/main/java/org/dstadler/jgit/porcelain/CloneRemoteRepositoryIntoMemoryAndReadFile.java

@mrbiggred mrbiggred added the design For design discussion issues label Jul 14, 2021
@mscottford
Copy link
Member Author

I think that a good start in this direction would be to add an alternative to LibGit2Sharp instead of just replacing our current implementation. It could be behavior that gets turned on via an environment variable, or we could even leverage https://github.com/scientistproject/Scientist.net to run both libraries in parallel.

@mscottford mscottford moved this to Backlog in Freshli Apr 29, 2022
@mscottford
Copy link
Member Author

Freshli-CLI is wrapping the Git executable directly.

Repository owner moved this from Icebox to Done in Freshli Sep 8, 2022
@rcdailey
Copy link

@mscottford Can you explain a bit further? Do you mean that you execute shell commands from C# code? Do you rely on Git being available on the system already?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design For design discussion issues
Projects
Archived in project
Development

No branches or pull requests

3 participants