Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintenance: add prune-remote-refs task #1838

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pastelsky
Copy link

@pastelsky pastelsky commented Dec 17, 2024

As discussed previously on:
https://lore.kernel.org/git/[email protected]/T/#t

Remote-tracking refs can accumulate in local repositories even as branches
are deleted on remotes, impacting git performance negatively. Existing
alternatives to keep refs pruned have a few issues — 

  1. The fetch.prune config automatically cleans up remote ref on fetch,
    but also pulls in new ref from remote which is an undesirable side-effect.

2.git remote prune cleans up refs without adding to the existing list
but requires periodic user intervention.

This adds a new maintenance task 'prune-remote-refs' that runs
'git remote prune' for each configured remote daily. This provides an
automated way to clean up stale remote-tracking refs — especially when
users may not do a full fetch.

This task is disabled by default.

CC: Junio C Hamano [email protected], Patrick Steinhardt [email protected]

Copy link

gitgitgadget bot commented Dec 17, 2024

There are issues in commit 7076784:
maintenance: add prune-remote-refs task
Commit not signed off

@pastelsky pastelsky force-pushed the sk/add-remote-prune-maintenance branch 7 times, most recently from 849e20a to 1af8661 Compare December 18, 2024 03:38
Copy link

gitgitgadget bot commented Dec 18, 2024

There are issues in commit 1af8661:
maintenance: add prune-remote-refs task
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

@pastelsky pastelsky force-pushed the sk/add-remote-prune-maintenance branch from 1af8661 to b67f04c Compare December 23, 2024 09:04
Copy link

gitgitgadget bot commented Dec 23, 2024

There are issues in commit b67f04c:
maintenance: add prune-remote-refs task
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

@pastelsky pastelsky force-pushed the sk/add-remote-prune-maintenance branch 2 times, most recently from b66f827 to 72e27d3 Compare December 23, 2024 09:33
@pastelsky
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Dec 23, 2024

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1838/pastelsky/sk/add-remote-prune-maintenance-v1

To fetch this version to local tag pr-1838/pastelsky/sk/add-remote-prune-maintenance-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1838/pastelsky/sk/add-remote-prune-maintenance-v1

Copy link

gitgitgadget bot commented Dec 27, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Thanks for a patch.


"Shubham Kanodia via GitGitGadget" <[email protected]> writes:

You'd want to check your procedure to tell GGG about addresses; I am
seeing these

    From: "Shubham Kanodia via GitGitGadget" <[email protected]>
    To: [email protected]
    Cc: "mailto:[email protected]" <[[email protected]]>,
            "mailto:[email protected]" <[[email protected]]>,
            Shubham Kanodia <[email protected]>,
            Shubham Kanodia <[email protected]>

and Cc addresses in it would probably not work as-is (I've fixed
them up manually).

> From: Shubham Kanodia <[email protected]>
>
> Remote-tracking refs can accumulate in local repositories even as branches
> are deleted on remotes, impacting git performance negatively. Existing
> alternatives to keep refs pruned have a few issues:
>
> 1. The `fetch.prune` config automatically cleans up remote ref on fetch,
> but also pulls in new ref from remote which is an undesirable side-effect.

This makes it sound as if fetch.prune configuration makes new refs
pulled, but that is not what happens and that is not what you wanted
to hint.

	If you run "git fetch" with the "--prune" option (or with
	the fetch.prune configuration set to true) while having the
	default refspec "+refs/heads/*:refs/remotes/$name/*"
	configured in remote.$name.fetch, then ...

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 6e6651309d3..0c8f1e01ccd 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -158,6 +158,26 @@ pack-refs::
>  	need to iterate across many references. See linkgit:git-pack-refs[1]
>  	for more information.
>  
> +prune-remote-refs::
> +	The `prune-remote-refs` task runs `git remote prune` on each remote
> +	repository registered in the local repository. This task helps clean
> +	up deleted remote branches, improving the performance of operations
> +	that iterate through the refs. See linkgit:git-remote[1] for more
> +	information. This task is disabled by default.
> ++
> +NOTE: This task is opt-in to prevent unexpected removal of remote refs
> +for users of git-maintenance. For most users, configuring `fetch.prune=true`
> +is a acceptable solution, as it will automatically clean up stale remote-tracking
> +branches during normal fetch operations. However, this task can be useful in
> +specific scenarios:
> ++
> +--
> +* When using selective fetching (e.g., `git fetch origin +foo:refs/remotes/origin/foo`)
> +  where `fetch.prune` would not affect refs outside the fetched hierarchy

The word "hierarchy" hints that things under refs/remotes/origin/
(which is the hierarchy 'foo' is fetched into) that went away would
be pruned, but that is not what happens (otherwise you would not be
adding this feature).

> +* When third-party tools might perform unexpected full fetches, and you want
> +  periodic cleanup independently of fetch operations

You'd want a full-stop after these two sentences, by the way.

> diff --git a/builtin/gc.c b/builtin/gc.c
> index 4ae5196aedf..9acf1d29895 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -20,6 +20,7 @@
>  #include "lockfile.h"
>  #include "parse-options.h"
>  #include "run-command.h"
> +#include "remote.h"
>  #include "sigchain.h"
>  #include "strvec.h"
>  #include "commit.h"
> @@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct option *opt, const char *arg,
>  	return 0;
>  }
>  
> +static int collect_remote(struct remote *remote, void *cb_data)
> +{
> +	struct string_list *list = cb_data;
> +
> +	if (!remote->url.nr)
> +		return 0;
> +
> +	string_list_append(list, remote->name);
> +	return 0;
> +}
> +
> +static int maintenance_task_prune_remote(struct maintenance_run_opts *opts UNUSED,
> +					 struct gc_config *cfg UNUSED)
> +{
> +	struct string_list_item *item;
> +	struct string_list remotes_list = STRING_LIST_INIT_NODUP;
> +	struct child_process child = CHILD_PROCESS_INIT;
> +	int result = 0;
> +
> +	for_each_remote(collect_remote, &remotes_list);
> +
> +	for_each_string_list_item (item, &remotes_list) {
> +		const char *remote_name = item->string;
> +		child.git_cmd = 1;
> +		strvec_pushl(&child.args, "remote", "prune", remote_name, NULL);
> +
> +		if (run_command(&child))
> +			result = error(_("failed to prune '%s'"), remote_name);
> +	}

Hmph, is there a reason why you need two loops, instead of
for-each-remote calling a function that does the run_command()
thing?

"git grep for_each_string_list_item \*.c" tells me that we almost
never write SP between the macro name and the opening parenthesis.

This loop does not stop at the first error, but returns a non-zero
error after noticing even a single remote fail to run prune, which
sounds like a seneible design.  Would an error percolate up the same
way when two different tasks run and one of them fails in the
control folow in "git maintenance"?  Just want to see if we are
being consistent with the surrounding code.

Thanks.

@pastelsky pastelsky force-pushed the sk/add-remote-prune-maintenance branch from 72e27d3 to 547f14d Compare December 28, 2024 09:37
Remote-tracking refs can accumulate in local repositories even as branches
are deleted on remotes, impacting git performance negatively. Existing
alternatives to keep refs pruned have a few issues — 

1. Running `git fetch` with either `--prune` or `fetch.prune=true` set will
prune stale refs, but requires a manual operation and also pulls in new
refs from remote which can be an undesirable side-effect.

2.`git remote prune` cleans up refs without adding to the existing list
but requires periodic user intervention.

This adds a new maintenance task 'prune-remote-refs' that runs
'git remote prune' for each configured remote daily. This provides an
automated way to clean up stale remote-tracking refs — especially when
users may not do a full fetch.

This task is disabled by default.

Signed-off-by: Shubham Kanodia <[email protected]>
@pastelsky pastelsky force-pushed the sk/add-remote-prune-maintenance branch from 547f14d to 4d6c143 Compare December 28, 2024 09:56
Copy link

gitgitgadget bot commented Dec 28, 2024

On the Git mailing list, Shubham Kanodia wrote (reply to this):

On Fri, Dec 27, 2024 at 2:37 PM Junio C Hamano <[email protected]> wrote:
>
> Thanks for a patch.
>
>
> "Shubham Kanodia via GitGitGadget" <[email protected]> writes:
>
> You'd want to check your procedure to tell GGG about addresses; I am
> seeing these
>
>     From: "Shubham Kanodia via GitGitGadget" <[email protected]>
>     To: [email protected]
>     Cc: "mailto:[email protected]" <[[email protected]]>,
>             "mailto:[email protected]" <[[email protected]]>,
>             Shubham Kanodia <[email protected]>,
>             Shubham Kanodia <[email protected]>
>
> and Cc addresses in it would probably not work as-is (I've fixed
> them up manually).

I think the GGG comment had a few formatting errors. Thanks for fixing the cc.

> Hmph, is there a reason why you need two loops, instead of
> for-each-remote calling a function that does the run_command()
> thing?

It can be collapsed into one.

> This loop does not stop at the first error, but returns a non-zero
> error after noticing even a single remote fail to run prune, which
> sounds like a seneible design.  Would an error percolate up the same
> way when two different tasks run and one of them fails in the
> control folow in "git maintenance"?  Just want to see if we are
> being consistent with the surrounding code.

Fair point. I'll make the process flow identical to the prefetch refs
task that works similarly across remotes.
It returns as soon as the first remote fails (without necessarily
affecting other tasks).

@pastelsky
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Dec 28, 2024

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1838/pastelsky/sk/add-remote-prune-maintenance-v2

To fetch this version to local tag pr-1838/pastelsky/sk/add-remote-prune-maintenance-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1838/pastelsky/sk/add-remote-prune-maintenance-v2

Copy link

gitgitgadget bot commented Dec 28, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Shubham Kanodia <[email protected]> writes:

>> Hmph, is there a reason why you need two loops, instead of
>> for-each-remote calling a function that does the run_command()
>> thing?
>
> It can be collapsed into one.

Sorry, but that is not an answer, as my question was not a
suggestion to change anything.

It was a question asking you if there was a specific reason why the
code was structured the way it was written.  If there is another way
to write it, you need to answer why the alternative wasn't picked.

>> This loop does not stop at the first error, but returns a non-zero
>> error after noticing even a single remote fail to run prune, which
>> sounds like a seneible design.  Would an error percolate up the same
>> way when two different tasks run and one of them fails in the
>> control folow in "git maintenance"?  Just want to see if we are
>> being consistent with the surrounding code.
>
> Fair point. I'll make the process flow identical to the prefetch refs
> task that works similarly across remotes.
> It returns as soon as the first remote fails (without necessarily
> affecting other tasks).

... and the first failure signals the caller a failure?  That would
match what you did in your new feature, which is perfect.

Thanks.

Copy link

gitgitgadget bot commented Dec 28, 2024

On the Git mailing list, Shubham Kanodia wrote (reply to this):

On Sat, Dec 28, 2024 at 9:35 PM Junio C Hamano <[email protected]> wrote:
>
> Shubham Kanodia <[email protected]> writes:
>
> >> Hmph, is there a reason why you need two loops, instead of
> >> for-each-remote calling a function that does the run_command()
> >> thing?
> >
> > It can be collapsed into one.
>
> Sorry, but that is not an answer, as my question was not a
> suggestion to change anything.
>
> It was a question asking you if there was a specific reason why the
> code was structured the way it was written.  If there is another way
> to write it, you need to answer why the alternative wasn't picked.

There wasn't a good reason for doing it that way. I guess I was trying
to understand the second argument for `for_each_remote` would be best
used if the command was called directly (while avoiding a compilation
warning), but looking at a few other usages of `for_each_remote` I
realised that it could just be marked unused in this case (since we
aren't doing anything with it).

I should've probably looked deeper and learnt from existing patterns
(e.g. `maintenance_task_prefetch`) — which I have in my last patch.

> >> This loop does not stop at the first error, but returns a non-zero
> >> error after noticing even a single remote fail to run prune, which
> >> sounds like a seneible design.  Would an error percolate up the same
> >> way when two different tasks run and one of them fails in the
> >> control folow in "git maintenance"?  Just want to see if we are
> >> being consistent with the surrounding code.
> >
> > Fair point. I'll make the process flow identical to the prefetch refs
> > task that works similarly across remotes.
> > It returns as soon as the first remote fails (without necessarily
> > affecting other tasks).
>
> ... and the first failure signals the caller a failure?  That would
> match what you did in your new feature, which is perfect.

Exactly — the first failing remote will signal that the
`prune-remote-refs` task has failed via an immediate `return 1`.
The maintenance command uses this to register the exit code of the top
level command to 1, while continuing to execute all other tasks
anyway.

Thanks,
Shubham K

Copy link

gitgitgadget bot commented Dec 28, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Shubham Kanodia via GitGitGadget" <[email protected]> writes:

> From: Shubham Kanodia <[email protected]>
>
> Remote-tracking refs can accumulate in local repositories even as branches
> are deleted on remotes, impacting git performance negatively. Existing
> alternatives to keep refs pruned have a few issues — 
>
> 1. Running `git fetch` with either `--prune` or `fetch.prune=true` set will
> prune stale refs, but requires a manual operation and also pulls in new
> refs from remote which can be an undesirable side-effect.

It is only true if you cloned without any tweaks.  For example, if
you cloned with the single-branch option, you would not pull in new
refs, wouldn't you?  Also "requires a manual operation" is not quite
a good rationale, as you could have placed such a "git fetch"
instead of "git remote prune", in the maintenance schedule.

For this to become an issue, the condition we saw in earlier
discussion, i.e.

    while having the *default* refspec
    "+refs/heads/*:refs/remotes/$name/*" configured in
    remote.$name.fetch

is crucial.  Since that is the default refspec "git clone" gives
you, your "git fetch --prune" would give you full set of refs while
pruning, and the end result is that you have up-to-date set of
remote-tracking branches (which you may not want).

> 2.`git remote prune` cleans up refs without adding to the existing list
> but requires periodic user intervention.

You have a SP after "1." but not after "2.".

> This adds a new maintenance task 'prune-remote-refs' that runs
> 'git remote prune' for each configured remote daily. This provides an
> automated way to clean up stale remote-tracking refs — especially when
> users may not do a full fetch.

"This adds" -> "Add".

I'd strike the latter sentence.  Regardless of what users do or do
not do, the automated clean-up is performed.

> This task is disabled by default.
>
> Signed-off-by: Shubham Kanodia <[email protected]>
> ---

> +NOTE: This task is opt-in to prevent unexpected removal of remote refs
> +for users of git-maintenance. For most users, configuring `fetch.prune=true`
> +is a acceptable solution, as it will automatically clean up stale remote-tracking

"a acceptable" -> "an acceptable".

> +branches during normal fetch operations. However, this task can be useful in
> +specific scenarios:
> ++
> +--
> +* When using selective fetching (e.g., `git fetch origin +foo:refs/remotes/origin/foo`)
> +  where `fetch.prune` would only affect refs that are explicitly fetched.
> +* When third-party tools might perform unexpected full fetches, and you want
> +  periodic cleanup independently of fetch operations.
> +--

Very well written.

> @@ -913,6 +914,30 @@ static int maintenance_opt_schedule(const struct option *opt, const char *arg,
>  	return 0;
>  }
>  
> +static int prune_remote(struct remote *remote, void *cb_data UNUSED)
> +{
> +	struct child_process child = CHILD_PROCESS_INIT;
> +
> +	if (!remote->url.nr)
> +		return 0;
> +
> +	child.git_cmd = 1;
> +	strvec_pushl(&child.args, "remote", "prune", remote->name, NULL);
> +
> +	return !!run_command(&child);
> +}
> +
> +static int maintenance_task_prune_remote(struct maintenance_run_opts *opts,
> +					 struct gc_config *cfg UNUSED)
> +{
> +	if (for_each_remote(prune_remote, opts)) {
> +		error(_("failed to prune remotes"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}

Nice reuse of the program structure, which is very clean and easy to read.

Overall very well written.  Will queue, with attached range-diff.
Thanks.


--- >8 ---
1:  0ae9668b5c ! 1:  8a40f8b319 maintenance: add prune-remote-refs task
    @@ Commit message
     
         Remote-tracking refs can accumulate in local repositories even as branches
         are deleted on remotes, impacting git performance negatively. Existing
    -    alternatives to keep refs pruned have a few issues — 
    +    alternatives to keep refs pruned have a few issues:
     
    -    1. Running `git fetch` with either `--prune` or `fetch.prune=true` set will
    -    prune stale refs, but requires a manual operation and also pulls in new
    -    refs from remote which can be an undesirable side-effect.
    +      1. Running `git fetch` with either `--prune` or `fetch.prune=true`
    +         set, with the default refspec to copy all their branches into
    +         our remote-tracking branches, will prune stale refs, but also
    +         pulls in new branches from remote.  That is undesirable if the
    +         user wants to only work with a selected few remote branches.
     
    -    2.`git remote prune` cleans up refs without adding to the existing list
    -    but requires periodic user intervention.
    +      2. `git remote prune` cleans up refs without adding to the
    +         existing list but requires periodic user intervention.
     
    -    This adds a new maintenance task 'prune-remote-refs' that runs
    -    'git remote prune' for each configured remote daily. This provides an
    -    automated way to clean up stale remote-tracking refs — especially when
    -    users may not do a full fetch.
    -
    -    This task is disabled by default.
    +    Add a new maintenance task 'prune-remote-refs' that runs 'git remote
    +    prune' for each configured remote daily.  Leave the task disabled by
    +    default, as it may be unexpected to see their remote-tracking
    +    branches to disappear while they are not watching for unsuspecting
    +    users.
     
         Signed-off-by: Shubham Kanodia <[email protected]>
         Signed-off-by: Junio C Hamano <[email protected]>
    @@ Documentation/git-maintenance.txt: pack-refs::
     ++
     +NOTE: This task is opt-in to prevent unexpected removal of remote refs
     +for users of git-maintenance. For most users, configuring `fetch.prune=true`
    -+is a acceptable solution, as it will automatically clean up stale remote-tracking
    ++is an acceptable solution, as it will automatically clean up stale remote-tracking
     +branches during normal fetch operations. However, this task can be useful in
     +specific scenarios:
     ++

Copy link

gitgitgadget bot commented Dec 28, 2024

This patch series was integrated into seen via git@6898464.

@gitgitgadget gitgitgadget bot added the seen label Dec 28, 2024
Copy link

gitgitgadget bot commented Dec 28, 2024

This patch series was integrated into seen via git@c020efd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant