Cleaning up your git history

#git#fix#script#tutorial::alemi::2024-07-04 16:32

Introduction

As writing code becomes more and more accessible, the number of software developers is growing steadily and more of them start at younger ages. This is great!

But a side effect of starting to work on one's software portfolio earlier is that often choices may be... not very professional. Maybe an old email address you'd rather make disappear, maybe throwing in your full legal name, maybe vulgar commit messages, it happened to everyone. What to do? Hide the project? Delete git history and re-commit as a single "initial commit"?? Sacrifice an NVIDIA gpu to Torvalds hoping to be blessed with git expertise in your dreams???

None of this really was acceptable for me (except the last one but it didn't work, oof) so I decided to look into better solutions. Git is such an awesome tool, there must be a perfect solution, right????

The problem

Nope. Sorry, there will be compromises. Get used to disappointment.

Git repositories are chains of commits: every new one builds cryptographically on top of previous commits. This makes impossible to just go and change something in the first commit: you need to redo them all! While a few dozens of commits are manageable, hundreds are really hard to fix and trying to fix thousands manually is plain madness.

Commits store both an "author" and a "committer". What you see is usually the "author", but both fields are there. Remade commits may replace the "committer" field, making your rewrite obvious, be wary!

Similarly, commits have an "author date" and a "commit date". We want to preserve the "author date", as it is displayed and shows our project timeline, but we should probably be fine updating the "commit date" to current time.

So, without further ado, let's list our needs and dive into possible solutions

Desiderata

  • replace authors: must be able to change commit authors
  • preserve original time: while "committed" date can change, "created" date shouldn't
  • fine-grained editing: allow changing only some commits
  • commits signature: keep commits signed, if signed
  • automatic: there should be no need for manual tweaking while rewriting
  • fast: ideally this shouldn't take hours or days to complete

Mailmap

The simplest and built-in way to do this is with git's mailmap. Just create a .mailmap in your repository root directory with every occurrence to replace.

For example, this will replace all commits by alemidev (with or without mail) to the full alemi with mail:

alemi <me@alemi.dev> alemidev <me@alemi.dev>
alemi <me@alemi.dev> alemidev <>

This file can then be placed in the repository root, and git will use it to replace old authors in every command it runs.

This method doesn't rewrite history, rather registers "an alias" that git uses. This is by far the cleanest and fastest method (as it doesn't really rewrite anything), but also the worst method as it doesn't really rewrite history. If you're trying to hide sensitive information, this isn't useful for you, but if you just want to remove an old dead email go for .mailmap.

summary

  • replace authors: this doesn't rewrite history
  • preserve original time
  • fine-grained editing: literal matching via .mailmap file
  • commits signature: as commits are not rewritten, signatures are unchanged
  • automatic
  • fast

Rebase + Amend

This method is the simplest for very small repos and will give the best results, but as repository complexity increases it quickly becomes excessively hard to make work.

Basically this starts from the first commit and rebases + amends all commits, resetting author to your current .gitconfig values and updating the date to the original value (as it otherwise would get updated to now).

This works extremely well for small repos with a single non complicated branch and where all commits are yours, but if this isn't the case avoid using this method as it will rewrite even other contributors' commits, and won't automatically handle merges, requiring manual intervention.

$ git rebase -r --root --exec 'env GIT_COMMITTER_DATE="$(git log -n 1 --format=%aD)" git commit --amend --no-edit --reset-author --no-verify --date="$(git log -n 1 --format=%aD)"'

Since this is a big command, let's break it down:

  • git rebase will start a new rebase
  • -r tells git to attempt to keep branching structure while rebasing
  • --root tells git to rebase from this branch root, allowing us to redo the first commit
  • --exec will run given command for each commit being rebased:
    • env GIT_COMMITTER_DATE="$(git log -n 1 --format=%aD)" gets commit date from log and sets env var1
    • git commit --amend amends this commit, so we can change its metadata
    • --no-edit we don't want to change the commit itself
    • --reset-author we want to reset this commit's author to our configured values
    • --no-verify we are changing signatures, verification will fail, skip it and put new signatures in place (if configured to do so)
    • --date="$(git log -n 1 --format=%aD)" we want to keep the author date as the original value (which we get from git log), otherwise it would get updated to current time

1: if I remember correctly, there is no command line argument to set committer date, so we need to use an env var

this is taken and adapted from a stack overflow answer, you may be interested in reading the original context for more tips

summary

  • replace authors
  • preserve original time
  • fine-grained editing: will overwrite everything unconditionally
  • ! sign commits: commits will all be signed again if your git is configured to do so, but signatures will be different from original ones and all commits will be signed
  • automatic: merges are not handled automatically
  • fast: as every commit is actually redone, this takes longer than mailmap method

Git Filter Repo

If neither of above methods works for you, a dedicated tool exists for precise in-depth repository rewriting: git-filter-repo.

git filter-repo likely is in your distro's repositories for installation, but if not can be run as a python script (more here)

This tool allows writing arbitrary python callbacks which will be run for each commit and can do filtering and rewriting. For example, this command will change all commits which don't have me@alemi.dev as mail but start with alemi as author:

$ git filter-repo --commit-callback '
  if commit.author_name.startswith(b"alemi") and commit.author_email != b"me@alemi.dev":
    print(f"changing commit by {commit.author_name} <{commit.author_email}>")
    commit.author_email = b"me@alemi.dev"
    commit.author_name = b"alemi"
    commit.committer_email = b"me@alemi.dev"
    commit.committer_name = b"alemi"
'

sky is the limit! be sure to check its man pages if you plan to do more complex rewriting.

note that filter-repo can also work with mailmap files: just git filter-repo --mailmap .mailmap to rewrite your repo history applying mailmap changes.

summary

  • replace authors
  • preserve original time
  • fine-grained editing
  • sign commits: because all signatures change, filter-repo will just remove commit and tags signatures
  • automatic
  • fast: filter-repo is made to be fast and efficient

Conclusions

Just include as little personal identifying information as possible in your git stuff, honestly in your online presence in general. You'll thank me later! c:

More seriously, git is awesome and we often are put off by its complexity. It's easy to get started with git, but as stuff won't merge it's easier to delete and clone again rather than understanding what happened. Gaining more insights about how git works under the hood allows to use it more efficiently: git is your best friend!