Cleaning up your git history
alemi
2024-08-02 17:25As writing code becomes more and more accessible, the number of software developers is growing steadily and more of them start at younger ages. This is great!
But a side effect of starting to work on one's software portfolio earlier is that often choices may be... not very professional. Maybe an old email address you'd rather make disappear, maybe throwing in your full legal name, maybe vulgar commit messages, it happened to everyone. What to do? Hide the project? Delete git history and re-commit as a single "initial commit"?? Sacrifice an NVIDIA gpu to Torvalds hoping to be blessed with git expertise in your dreams???
None of this really was acceptable for me (except the last one but it didn't work, oof) so I decided to look into better solutions. Git is such an awesome tool, there must be a perfect solution, right????
The problem
Sorry, there will be compromises. Get used to disappointment.
Git repositories are chains of commits: every new one builds cryptographically on top of previous commits. This makes impossible to just go and change something in the first commit: you need to redo them all! While a few dozens of commits are manageable, hundreds are really hard to fix and trying to fix thousands manually is plain madness.
Commits store both an "author" and a "committer". What you see is usually the "author", but both fields are there. Remade commits may replace the "committer" field, making your rewrite obvious, be wary!
Similarly, commits have an "author date" and a "commit date". We want to preserve the "author date", as it is displayed and shows our project timeline, but we should probably be fine updating the "commit date" to current time.
So, without further ado, let's list our needs and dive into possible solutions
Desiderata
- replace authors: must be able to change commit authors
- preserve original time: while "committed" date can change, "created" date shouldn't
- fine-grained editing: allow changing only some commits
- commits signature: keep commits signed, if signed
- automatic: there should be no need for manual tweaking while rewriting
- fast: ideally this shouldn't take hours or days to complete
Our options
note that this is not a complete list, just what i am aware of! get in touch if you know more or have notes on these!
Mailmap
The simplest and built-in way to do this is with git's mailmap
. Just create a .mailmap
in your repository root directory with every occurrence to replace.
For example, this will replace all commits by old-user
(with or without mail) to your-user
with mail:
your-user <your@email.com> old-user <old@email.com>
your-user <your@email.com> old-user <>
This file can then be placed in the repository root, and git will use it to replace old authors in every command it runs.
This method doesn't rewrite history, rather registers "an alias" that git uses. This is by far the cleanest and fastest method (as it doesn't really rewrite anything), but also the worst method as it doesn't really rewrite history. If you're trying to hide sensitive information, this isn't useful for you, but if you just want to remove an old dead email go for .mailmap.
summary
- replace authors: this doesn't rewrite history
- preserve original time
- fine-grained editing: literal matching via .mailmap file
- commits signature: as commits are not rewritten, signatures are unchanged
- automatic
- fast
Rebase + Amend
This method is the simplest for very small repos and will give the best results, but as repository complexity increases it quickly becomes excessively hard to make work.
Basically this starts from the first commit and rebases + amends all commits, resetting author to your current .gitconfig
values and updating the date to the original value (as it otherwise would get updated to now).
This works extremely well for small repos with a single non complicated branch and where all commits are yours, but if this isn't the case avoid using this method as it will rewrite even other contributors' commits, and won't automatically handle merges, requiring manual intervention.
$ git rebase -r --root --exec 'env GIT_COMMITTER_DATE="$(git log -n 1 --format=%aD)" git commit --amend --no-edit --reset-author --no-verify --date="$(git log -n 1 --format=%aD)"'
Since this is a big command, let's break it down:
git rebase
will start a new rebase-r
tells git to attempt to keep branching structure while rebasing--root
tells git to rebase from this branch root, allowing us to redo the first commit--exec
will run given command for each commit being rebased:env GIT_COMMITTER_DATE="$(git log -n 1 --format=%aD)"
gets commit date from log and sets env var1git commit --amend
amends this commit, so we can change its metadata--no-edit
we don't want to change the commit itself--reset-author
we want to reset this commit's author to our configured values--no-verify
we are changing signatures, verification will fail, skip it and put new signatures in place (if configured to do so)--date="$(git log -n 1 --format=%aD)"
we want to keep the author date as the original value (which we get from git log), otherwise it would get updated to current time
1: if I remember correctly, there is no command line argument to set committer date, so we need to use an env var
this is taken and adapted from a stack overflow answer, you may be interested in reading the original context for more tips
summary
- replace authors
- preserve original time
- fine-grained editing: will overwrite everything unconditionally
-
!
sign commits: commits will all be signed again if your git is configured to do so, but signatures will be different from original ones and all commits will be signed - automatic: merges are not handled automatically
- fast: as every commit is actually redone, this takes longer than mailmap method
Git Filter Repo
If neither of above methods works for you, a dedicated tool exists for precise in-depth repository rewriting: git-filter-repo.
git filter-repo
likely is in your distro's repositories for installation, but if not can be run as a python script (more here)
This tool allows writing arbitrary python callbacks which will be run for each commit and can do filtering and rewriting. For example, this command will change all commits which don't have your@email.com
as mail but start with your-user
as author:
$ git filter-repo --commit-callback '
if commit.author_name.startswith(b"your-user") and commit.author_email != b"your@email.com":
print(f"changing commit by {commit.author_name} <{commit.author_email}>")
commit.author_email = b"your@email.com"
commit.author_name = b"your-user"
commit.committer_email = b"your@email.com"
commit.committer_name = b"your-user"
'
sky is the limit! be sure to check its man pages if you plan to do more complex rewriting.
note that filter-repo
can also work with mailmap files: just git filter-repo --mailmap .mailmap
to rewrite your repo history applying mailmap changes.
summary
- replace authors
- preserve original time
- fine-grained editing
-
sign commits: because all signatures change,
filter-repo
will just remove commit and tags signatures - automatic
-
fast:
filter-repo
is made to be fast and efficient
Bonus: BFG
This is another very convenient cleaning method, but it won't replace commit authors, instead it replaces content at each relevant revision. Say you committed something... angry, how do get rid of it without changing that commit and rebasing completely on it? BFG repo cleaner comes to the rescue! Apart from clearing large objects from your repo (which is super useful), BFG allows removing sensitive data from previous revisions.
Be warned that BFG won't change anything present in current revision! It's important to commit away whatever must be deleted before running BFG.
Using BFG for this purpose is super simple: first make a plaintext file with your words to be replaced, one per line. Then just:
$ java -jar bfg.jar --replace-text replacements.txt path/to/repo
and done! BFG is super fast and will show you detailed informations on what it did
summary
- replace authors: doesn't touch commit authors
- preserve original time
- fine-grained editing
- sign commits: BFG strips all signatures
- automatic
- fast
Wrapping up
Just include as little personal identifying information as possible in your git stuff, honestly in your online presence in general. You'll thank me later! c:
More seriously, git is awesome and we often are put off by its complexity. It's easy to get started with git, but as stuff won't merge it's easier to delete and clone again rather than understanding what happened. Gaining more insights about how git works under the hood allows to use it more efficiently: git is your best friend!