The default behavior and example usages of legacy systems such as Subversion and Perforce both show creating a single server and repository where all projects live. By default users checkout everything and over time it is understandable how everything can be required to build. Users have to actively police or put guards in place to keep thing separated which rarely occurs. Also because users only have the most recent copy of the repository they don't pay any penalties for checking in binaries or large files. It further doesn't help that the systems don't provide any incentives to keep those binaries in specific locations.
The above two behaviors result in a disaster when trying to migrate to Git because all files and their history need to get pulled into a single repository. The repository can easily be tens of gigabytes and commands like Git status can take many seconds to run. This is why a common reaction is to say that to make Git usable the solution is to split up the one large Git repository into many smaller repositories.
Bloat and lack of organization
|Photo by Iván Santiesteban (license)|
Like a small town book store owner that didn't intend to create stacks of books everywhere this is a state that is never strived for. Cleaning this up and removing the bloat is simply something that should be done whether it is moving to a single Git repository, going to many smaller repository, or even staying on Subversion or Perforce.
While a legacy repository might required everything to build that is usually historical and not desired. Just a few of the problems that arise from interdependencies:
- Increased "code, build, test" cycle from seconds, minutes, hours or even days.
- Projects reaching into each other in ways they really shouldn't causing significant long term technical debt.
- Bad API's that resulted from the idea that because the API between projects could be change at any time the developers didn't put any effort into making decent API's to begin with which results in higher API churn and more build breakage.
- Common shared libraries that are unowned and in horrible shape from everyone pushing in their required features and changing behavior without understanding the consequences.
- With everyone committing to the main repository, no build bots and everything required to build developers might rarely update resulting in more build breakage. When the repository rarely builds the overall health of a project is hard to determine and release planning becomes more difficult.
- The entire repository and all of the projects together can only be released together at a pace that is slower and more complex than any smaller project would desire.
- Anything that is shared from libraries, scripts, tools and even the release might not get the care it needs from lack of stewardship.
Announcing that the repository will be split up is one way to force the projects apart because they can't depend on what isn't there. Alternatively all of the projects can still live in one repository, but developers start using partial checkouts with only the projects they need to work on. This forces the projects apart without requiring multiple repositories. Subversion, Perforce and Git all support partial checkouts. Besides having one repository what other value does this provide over splitting up the repository?
Build dependency and versioning
When a projects is being built it needs to resolve its dependencies. With multiple repositories it is hard to get by without specifying versions. With a single repository where all of the dependencies are checked out with the project no version needs to supplied. This ability is both powerful and limiting.
First off what is limiting about it?
- If a library wants to change its api it must update everyone that depends upon it or their build would immediately be broken.
- A project can not depend upon old versions of their dependencies.
- An api change can be made atomically with the change that migrates all clients to the new api.
- A project can only ever be built against one version of their dependencies and so problems such as a project that has two different dependency that depend upon different versions of a third dependency disappear entirely.
- When you have a problem you only have to report one revision, not one per project / dependancy.
- You can bisect the source across project and dependencies without worry about what version of dependencies where built and deployed together.
- When a bug is fixed in a static library it is easy to find all deployed binaries older than the fix to be rebuilt.
- All tools dealing with history (like the already mentioned bisect and commit) such as log work for the project and its dependancies together.
- Projects by default immediately get improvements (and regressions) from upstream without having to pull them manually.
- Regression introduced by dependencies are caught much quicker.
When migrating to Git from a system like Subversion or Perforce it is easy to conclude that the path
forward is to create many repositories that contain only part of the legacy repository. The behaviors of the legacy systems have often resulted in repositories that are bloated, lacking in organization and have intertwined code. Solving these problems doesn't require that the repository be split up and it is worth exploring the benefits that a single repository has to provide. Whether or not you go with many small repositories or you go with one large one with partial checkouts, the first step is to clean the legacy repository up because that will have to be done before either outcome can be reached.