Sunday, August 1, 2010

CVS vs Git: Local disk usage

I was a bit skeptical about the local disk usage with Git, as every Git clone is a full-fledged repository with complete history, and apparently Git stores entire snapshots and not the deltas.

Everyone seemed to be claiming that Git is quite efficient in terms of storage space required, and I also found the following statistic on the web
"The Mozilla CVS repository was 2.7GB, imported to Subversion it grew to 8.2GB. Under Git, it shrunk to 450MB. Given that a Mozilla checkout is around 350MB, its fairly nice to have the whole project history (from 1998) in only slightly more space."
Source: http://keithp.com/blog/Repository_Formats_Matter/

But I was still a bit skeptical... :)

So I downloaded some of the JDT source from Eclipse Git repositories and compared the disk usage under Git with CVS. Here are the numbers. For these selected projects, Git on an average takes less than three times the space required by CVS. In my opinion this cost is nothing as compared to the benefits of having the entire history locally.

6 comments:

  1. I have not look into this, just a few guess:

    1) The files use CVS keyword ($Id$, $Date$, ...). They are free in CVS, but costy in Git.

    2) You have never run `git gc ; git repack -Ad`. This should compact the git repository after the initial sync. (the eclipse jgit client is less efficient then the C git client)

    ReplyDelete
  2. I'd also be curious what the statistic looks like after a repack. I also think that 3x the space is a totally worthwhile tradeoff, but I'm curious how anybody could have claimed it does the Mozilla repository in only 450 MB given your numbers...

    ReplyDelete
  3. I just cloned the repositories using eGit client, I did not bother to see what it does under the hood. So I do not know whether a repack was done or not...

    @Adrian org.eclipse.jdt.doc.user has quite a few images and takes 5x the space. Projects containing only source code, take about 2x space e.g. the *tests* projects. Maybe Mozilla repository has mostly source code.

    ReplyDelete
  4. Hi,

    I wondered what you are reporting: the size of the workspace without repository, workspace only, or repository only? I cloned two of the projects to figure that out (by comparing my numbers to your) but my measurements do not come close to yours:

    I separated overhead from project code with this rule. For git: only files in .git directory are repository overhead, everything else is project source code. For cvs: any directory named CVS and files under it are repository overhead, everything else is project source code.

    I used the following linux command to count bytes in project code:
    for cvs: find . -type f | grep -v CVS | xargs wc -m | grep total
    for git: find . -type f | grep -v .git | xargs wc -m | grep total

    I measured TOTAL disk usage in bytes using du:
    du -sb


    I used linux diff to compare checkout/clone directories to find out if there were any project files in one but not the other.


    I also used a recent tag in each project to be certain I check out the same in each.

    project: org.eclipse.jdt.core.test.model:
    tag: v_B01 (last commit: jul 6, 2010).
    diffs: yes, but very small count.
    project code size byte count: 19M
    cvs total bytes: 19.6M, overhead .6M
    git total bytes: 25.2M, overhead: 6.2M
    cvs total disk usage: 37M
    git total disk usage: 38M
    total CVS directories: 2100

    org.eclipse.jdt.ui.tests.refactoring:
    tag: v20100423-0800, last commit: 4/22/10
    diff: none
    project code size byte count: 1.3M
    cvs total bytes: 4.2M, overhead 2.9M
    git total bytes: 7.0M. overhead 5.7M
    cvs total disk usage: 44M
    git total disk usage: 30M
    total CVS directories: 4854

    Note that disk usage includes the disk consumption of directories and files. An explanation for the inordinate amount of disk usage for CVS may be related to the number of directories cvs creates on checkout.

    So, what sizes are your reporting?

    ReplyDelete
  5. Remember you aren't comparing apples to apples. But apples to organes. When with CVS you don't have the full history, so you aren't comparing them equally.

    The other advantage you didn't mention with Git is speed. It is BLAZINGLY fast with pushes, checkouts, and pulls. Branching is a breaze as well. Disk spaces is one comparison, but the other advantages git have out weigh some of the initial negatives.

    ReplyDelete
  6. @John Thanks for looking at the numbers this closely!

    My goal was just to see how much total disk space was consumed for the projects I work on, hence I just measured the size of the entire project directory which includes the source files and the overhead for CVS/Git (the generated class files are not included).

    I guess the best thing to capture would be: (Git Overhead - CVS Overhead) / (Project Source Size). I mean for a project of size 10 MB it hardly matters if the overhead is 10KB or 100KB.

    @David I just wanted to make sure that the disk usage does not increase by say 100x ;-)

    ReplyDelete