Monday, March 21, 2016

Migrating to Git and dealing with large repositories

When migrating source code to Git from a legacy revision control system the default conversion often results in a single repository that is enormous.  Faced with this large messy repository many argue that splitting up the old repository is the only solution and yet there is a disconnect because companies like Google and Facebook seem to have success with a single large repository.   Whether or not the repository is split cleaning it up and removing the interdependencies is required.  Some companies keep this slimmed down repository that contains all projects and use partial checkouts because it provides other benefits around versioning.  It is worth evaluating those benefits before jumping straight into splitting a repository up.

Default behaviors

The default behavior and example usages of legacy systems such as Subversion and Perforce both show creating a single server and repository where all projects live.  By default users checkout everything and over time it is understandable how everything can be required to build.  Users have to actively police or put guards in place to keep thing separated which rarely occurs.  Also because users only have the most recent copy of the repository they don't pay any penalties for checking in binaries or large files.  It further doesn't help that the systems don't provide any incentives to keep those binaries in specific locations.

The above two behaviors result in a disaster when trying to migrate to Git because all files and their history need to get pulled into a single repository.  The repository can easily be tens of gigabytes and commands like Git status can take many seconds to run.  This is why a common reaction is to say that to make Git usable the solution is to split up the one large Git repository into many smaller repositories.

Bloat and lack of organization

Messy bookstore
Photo by Iván Santiesteban (license
The tree structure of legacy repositories are rarely clean.  Build scripts that as the last step check the binaries into the repository inside the src/Debug directory.  ISO images of compilers, editors, and other software that might be used by the developer dumped anywhere and everywhere.  All sorts of generated files sprinkled about.  While the organization might be divided into four sub-organizations there might be a single src directory and only one build script.  This results in an overly difficult task for anyone that wants to export a repository to Git.

Like a small town book store owner that didn't intend to create stacks of books everywhere this is a state that is never strived for.  Cleaning this up and removing the bloat is simply something that should be done whether it is moving to a single Git repository, going to many smaller repository, or even staying on Subversion or Perforce.


While a legacy repository might required everything to build that is usually historical and not desired.  Just a few of the problems that arise from interdependencies:
  • Increased "code, build, test" cycle from seconds, minutes, hours or even days.
  • Projects reaching into each other in ways they really shouldn't causing significant long term technical debt.
  • Bad API's that resulted from the idea that because the API between projects could be change at any time the developers didn't put any effort into making decent API's to begin with which results in higher API churn and more build breakage.
  • Common shared libraries that are unowned and in horrible shape from everyone pushing in their required features and changing behavior without understanding the consequences.
  • With everyone committing to the main repository, no build bots and everything required to build developers might rarely update resulting in more build breakage.  When the repository rarely builds the overall health of a project is hard to determine and release planning becomes more difficult.
  • The entire repository and all of the projects together can only be released together at a pace that is slower and more complex than any smaller project would desire.
  • Anything that is shared from libraries, scripts, tools and even the release might not get the care it needs from lack of stewardship.
Interdependencies is something to avoid, but once you have the problem there is no silver bullet to fix it.  It just take a lot of code reading and manpower working on the problem.  The end result is that the projects are split up and independent.

Announcing that the repository will be split up is one way to force the projects apart because they can't depend on what isn't there.  Alternatively all of the projects can still live in one repository, but developers start using partial checkouts with only the projects they need to work on.  This forces the projects apart without requiring multiple repositories.  Subversion, Perforce and Git all support partial checkouts.  Besides having one repository what other value does this provide over splitting up the repository?

Build dependency and versioning

When a projects is being built it needs to resolve its dependencies.  With multiple repositories it is hard to get by without specifying versions.  With a single repository where all of the dependencies are checked out with the project no version needs to supplied.  This ability is both powerful and limiting.

First off what is limiting about it?
  • If a library wants to change its api it must update everyone that depends upon it or their build would immediately be broken.
  • A project can not depend upon old versions of their dependencies.
What is powerful about it?
  • An api change can be made atomically with the change that migrates all clients to the new api.
  • A project can only ever be built against one version of their dependencies and so problems such as a project that has two different dependency that depend upon different versions of a third dependency disappear entirely.
  • When you have a problem you only have to report one revision, not one per project / dependancy.
  • You can bisect the source across project and dependencies without worry about what version of dependencies where built and deployed together.
  • When a bug is fixed in a static library it is easy to find all deployed binaries older than the fix to be rebuilt.
  • All tools dealing with history (like the already mentioned bisect and commit) such as log work for the project and its dependancies together.
  • Projects by default immediately get improvements (and regressions) from upstream without having to pull them manually.
  • Regression introduced by dependencies are caught much quicker.
The single repository using partial checkouts and no versioning has some appealing traits.   It isn't perfect and won't work from everyone, but it is almost the best of both worlds and worth thinking about before splitting up the repository.


When migrating to Git from a system like Subversion or Perforce it is easy to conclude that the path
forward is to create many repositories that contain only part of the legacy repository.  The behaviors of the legacy systems have often resulted in repositories that are bloated, lacking in organization and have intertwined code.  Solving these problems doesn't require that the repository be split up and it is worth exploring the benefits that a single repository has to provide.  Whether or not you go with many small repositories or you go with one large one with partial checkouts, the first step is to clean the legacy repository up because that will have to be done before either outcome can be reached.

Popular Posts