Deploying via Git

Git has arguably swept through web development like a sharknado; becoming the most popular VCS in every tech community I know of: PHP, Ruby, JavaScript (and Node.js), Python, Objective-C… like I said, pretty much every tech community.

Using git for deployment is therefore an obvious choice — and being able to deploy by fetching just the differences means it can be fast and easy.

When deploying in the cloud, one of the things you need to keep in mind is the ephemeral nature of the "machines" you are using. Whether it's degraded instances, or the need to scale up either vertically or horizontally, chances are you'll need to clone your repository more frequently than you might think.

The speed at which you can deploy is especially critical when you are trying to scale to handle unexpected spikes in traffic.

When cloning a git repository — by default — you clone the entire history of the repository. The size therefore depends greatly on the length of the history, and the amount of changes over the lifetime of the repository. Pushing around that much data can take some time.

Note:The following benchmarks are very rough, and will obviously differ based on a number of factors. They were all done from github to the same Amazon EC2 instance however.

Project Clone Time Size
Zend Framework 2 25s 118MB
Symfony 15s 56MB
Drupal 27s 138MB
CakePHP 13s 48MB
Zend Framework 1 15s 202MB
Ruby on Rails 25s 129MB

These were obtained by running:

$ git clone https://github.com/project/repo.git

To help minimize this, git provides a --depth flag for git clone — this creates what is known as a shallow clone. The manual has this to say about --depth (emphasis mine).

--depth <depth>

Create a shallow clone with a history truncated to the specified number of revisions. A shallow repository has a_number of limitations_(you cannot clone or fetch from it, nor push from nor into it), but is adequate if you are only interested in the recent history of a large project with a long history, and would want to send in fixes as patches.

What we can do is to combine this with the --branch flag to checkout a specific branch with minimal history, minimizing the size. This however limits us: we can no longer checkout tags, or arbitrary revisions.

Note: Leaving out --branch will simply checkout the tip of the default branch (typically master).

Doing this we see the following results:

Project Clone Time Difference Size Difference
Zend Framework 2 18s -7s 76MB -42MB
Symfony 5s -10s 31MB -25MB
Drupal 14s -13s 91MB -47MB
CakePHP 6s -7s 21MB -27MB
Zend Framework 1 10s -5s 168MB -34MB
Ruby on Rails 12s -13s 56MB -73MB
These were obtained by running:

$ git clone --depth 1 --branch <branch> https://github.com/project/repo.git

These are (mostly) fairly significant differences. So we're done, right? Deploys are faster, and we're all good. Yeah… no.

Shallow clones also bring their own pitfalls.

If you want to switch branches, you must re-clone a new shallow clone. This would usually be handled by using symlinks to move between repositories, something like this:

/var/www
/d2340654-master
/a61f576e-feature-foo
/current -> /var/www/d2340654-master

Where /var/www/current is a symlink to the currently deployed code. When you change branches, re-clone and then

$ cd /var/www
$ rm -Rf ./current & ln -s ./a61f576e-feature-foo ./current`

If you simply want to update the current branch, you would do:

$ git fetch origin
$ git reset --hard origin/<branch>

However, if someone has rebased, this can break the shallow clone as the only commit it knows about is the currently checked out one. Rebasing will cause that commit to be missing on the remote server, meaning that they no longer have any commits in common upon which to create a diff between the two.

This causes an error that looks like:

Git Error: fatal did not find object for shallow

When this happens, you must again create a new shallow clone. Depending on your workflow, this may happen more or less frequently for you.

If you frequently rebase then I would recommend not using shallow clones as the quick updates will fail more frequently, causing you to re-clone (albeit shallow) more often than if you did a full clone initially. Alternatively, you could potentially play with the --depth to find a happy medium between the number of times you break the clone with rebases, and the size of the cloned repo.

Shallow clones aren't perfect, but they can provide a measurable, and noticeable improvement to deployment times during critical scaling situations.

About Davey Shafik

Davey Shafik is a full time PHP Developer with 12 years experience in PHP and related technologies. A Community Engineer for Engine Yard, he has written three books (so far!), numerous articles and spoken at conferences the globe over.

Davey is best known for his books, the Zend PHP 5 Certification Study Guide and PHP Master: Write Cutting Edge Code, and as the originator of PHP Archive (PHAR) for PHP 5.3.