Revision Control with Arch: Introduction to Arch
Arch quickly is becoming one of the most powerful tools in the free software developer's collection. This is the first in a series of three articles that teaches basic use of Arch for distributed development, to manage shared archives and script automated systems around Arch projects.
This article shows you how to get code from a public Arch archive, contribute changesets upstream and make a local branch of a project for disconnected use. In addition, it provides techniques to improve performance of both local and remote archives.
Revision control is the business of change management within a project. The ability to examine work done on a project, compare development paths and replicate and undo changes is a fundamental part of free software development. With so many potential contributors and such rapid release of changes, the tools developers use to manipulate these changes have had to evolve quickly.
Early revision control was handled with tape backups. Old versions of a project would be dragged out of backup archives and compared line by line with the new copy. The process of restoring a backup from tape is not quick, so this is not an efficient method by any means.
To work around this lag, many developers kept old copies of files around for comparison, and this was soon integrated into early development tools. File-based revision control, such as that used by the Emacs editor, uses numbered backup files so you can compare foo.c~7~ with foo.c~8~ to see what changed. Versioned backup files even were integrated into the filesystem on some early proprietary operating systems.
For nearly two decades, the preferred format for third-party contributions to free software projects has been a patch file, sometimes called a diff. Given two files, the diff program generates a listing that highlights the differences between them. To apply the changes specified in the diff output, a user need only run it through the patch program.
In the 1990s, the Concurrent Versions System (CVS) became the default for managing the changes of a core group of developers. CVS stores a list of patches along with attribution information and a changelog. A primitive system of branching and merging allows users to experiment with various lines of development and then fold successful efforts back into the main project.
CVS has its limitations, and they are becoming a burden for many projects. First, it does not store any metadata changes, such as the permissions of a file or the renaming of a file. In addition, check-ins are not grouped together in a set, making it difficult to examine a change that spanned multiple files and directories. Finally, nearly all operations on a remote CVS repository require that a new connection be opened to the server, making it difficult for disconnected use.
Efforts such as the Subversion Project have come a long way toward fixing the flaws found in CVS. Subversion is effectively a CVS++, and it supports file metadata change logging and atomic check-ins. What it still requires is a centralized server on the network that all clients connect to for revision management operations.
A new generation of revision control systems has sprung up in the past few years, all operating on a distributed model. Distributed revision control systems do away with a single centralized repository in favor of a peer-to-peer architecture. Each developer keeps a repository, and the tools allow easy manipulation of changes between systems over the network.
Projects such as Monotone, DARCS and Arch are finding popularity in a world where free software development happens outside of well-connected universities, and laptops are much more common.
One of the most promising distributed systems today is GNU Arch. Arch handles disconnected use by encouraging users to create archives on their local machines, and it provides powerful tools for manipulating projects between archives. Arch lacks any sort of dedicated server process and uses a portable subset of filesystem operations to manipulate the archive. Archives are simply directories that can be made available over the network using your preferred remote filesystem protocol. In addition, Arch supports archive access over HTTP, FTP and SFTP.
One advantage to not having a dedicated dæmon is that no new code is given privilege on your server machine. Thus, your security concerns are with your SSH dæmon or Web server, which most system administrators already are keeping tabs on.
Another advantage is that for most tasks no root privilege is needed to make use of Arch. Developers can begin using it on their own machines and publish archives without even installing Arch on the Web server machine. This affects the pattern of adoption as well. Using CVS or Subversion is a top-down decision made for an entire project team, although Arch can be adopted by one or two developers at a time until everyone in the group is up to speed.
Arch was originally a set of shell scripts and wrappers around Tom Lord's hackerlabs libraries. The name of the program in those days was larch, and it was more than a little clumsy to use. The client now has been entirely rewritten in C and is called tla, which stands for Tom Lord's Arch. The interface is still not perfect, but it is good enough for regular use by a skilled developer. Packages of tla are available for most GNU/Linux distributions (see the on-line Resources).
Once you have tla installed, it's good to test it by checking out some code. Arch stores your data in a directory known as an archive. Within the archive, data is organized into nested categories: projects (the name of the work as a whole), branches (a particular thread of development or other descriptive term) and versions (a simple numerical indicator you can use to indicate how far a specific branch has progressed).
The first step to getting some code is to register a public archive so that Arch associates a name with the archive location:
$ tla register-archive http://www.lnx-bbc.org/arch
You should now see the lnx-bbc-devel@zork.net--gar archive listed when you run tla archives. If you're curious about what projects are stored in there, you can use the tla abrowse command to get a full list:
$ tla abrowse lnx-bbc-devel@zork.net--gar lnx-bbc-devel@zork.net--gar lnx-bbc lnx-bbc--research lnx-bbc--research--0.0 base-0 .. patch-10 lnx-bbc--stable lnx-bbc--stable--2.1 base-0 .. patch-29 scripts scripts--gargoyle-bin scripts--gargoyle-bin--1.0 base-0 .. patch-7
This listing tells us that the lnx-bbc-devel@zork.net--gar archive has two projects, lnx-bbc and scripts. The lnx-bbc project has two branches, research and stable. The lnx-bbc--research branch has only one version (0.0) and that version has had ten changes recorded in the archive. The lnx-bbc--stable branch has only one version (2.1) with 29 changesets.
Because you now have the LNX-BBC public archive registered in your local listing, you can check out a copy of the LNX-BBC stable branch:
$ tla get \ lnx-bbc-devel@zork.net--gar/lnx-bbc--stable lnxbbc
Once it finishes downloading and applying patchsets, you should have a directory named lnxbbc/ that is full of files. To simulate a change in the code, cd into lnxbbc/ and edit robots.txt to add a new comment somewhere.
Now that you have made a change, running tla what-changed should print M robots.txt to indicate that robots.txt has been modified. To get the details of the change, you can run tla what-changed --diffs, which should print out a diff file ready to be sent back to the project's development group:
--- orig/robots.txt +++ mod/robots.txt @@ -1,3 +1,5 @@ +# Welcome, robots! + User-agent: * Disallow: /garchive/ Disallow: /cgi-bin/
The drawback to this is that the diff does not indicate metadata changes. Moved files will not be listed, and new files will not be created when another developer runs this diff through patch. In order to submit a more complicated change to the project maintainers, you must generate a changeset.
In Arch, a changeset is represented as a directory tree full of bookkeeping files, patches, new files and removed files. The best contribution technique is to create a changeset directory and then tar it up for delivery:
$ tla changes -o ,,new-robot-comment $ tar czvf my-changes.tar.gz ,,new-robot-comment/
Arch ignores files beginning with two commas, an equal sign and a few other special characters. By using a ,, at the start of our changeset directory name, we avoid the annoyance of Arch complaining that our new directory doesn't exist in the archive. It is probably good practice to use your e-mail address or some other identifier in the tarball filename and changeset directory name.
Now and then you'll want to download the latest changes to the project. This is as simple as running tla update from inside the checked-out copy.
Arch first runs tla undo to set aside your local changes before applying new changesets. Once all the patches have been applied, it runs tla redo to re-apply your local changes.
All of the tla commands introduced above require a functioning network connection to the lnx-bbc.org system that hosts the archive. For disconnected use, you need to create a local archive and then make a branch within it.
Before you can begin working in a read-write archive, you must identify yourself to tla:
$ tla my-id "J. Random Hacker <jrh@zork.net>"
Once you have entered your e-mail address, it is time to create an archive for your projects. Arch lets you make many archives, but you can keep as many projects and branches as you like in the same archive.
Archive names have two parts, separated by two hyphens: the first is your e-mail address, and the second is some identifier. Many people like to use the four-digit year as the identifier and roll over to a new archive each year:
$ tla make-archive -l jrh@zork.net--2004 ~/ARCHIVE $ tla my-default-archive jrh@zork.net--2004
The my-default-archive command makes certain operations on the local archive easier to type.
Arch encourages developers to fork and merge projects using branches. Branches are the primary mechanism for moving code from one archive to another, even over a network. You can use a branch for a complete code fork to pursue an entirely new line of development, or you can use a branch to cache a copy of a project on your laptop so that you can work for a while in an environment that lacks network access.
Published branches are also the primary development communications mechanism for developers who use Arch. Instead of mailing large changeset tarballs or patch files around, a contributor most likely would set up a branch to make local changes and then invite the upstream developers to merge those changes back into the main project. This is where the decentralized and democratic nature of Arch's design shines. Any developer can join the development effort without needing special privilege in the core team's archive.
Before you can branch the lnx-bbc project, you have to set up a space for the project in your archive. The format for a project identifier is similar to that of the archive name: the category (or project name), two dashes, the branch name, two dashes and the version number. It is most likely Tom Lord's experience as a LISP hacker that informed his decision to use these dashes:
$ tla archive-setup lnx-bbc--robot-branch--0.0
This creates a category called lnx-bbc, a branch called robot-branch and a version called 0.0. You did not need to specify jrh@zork.net--2004/ in front of the project name because that is your default archive.
Finally, it is time to tag off the branch from the remote archive. This means the robot-branch begins as a tag pointing to a particular revision of a project in the lnx-bbc-devel@zork.net--gar archive, and all local changes start from that point:
$ tla tag \ lnx-bbc-devel@zork.net--gar/lnx-bbc--stable--2.1 \ lnx-bbc--robot-branch--0.0
At this point, running tla abrowse should show your default archive as follows:
jrh@zork.net--2004 lnx-bbc lnx-bbc--robot-branch lnx-bbc--robot-branch--0.0 base-0
You are now ready to check out a copy of your new branch:
$ tla get lnx-bbc--robot-branch robot-branch
At this point, you can go into the robot-branch directory and make some changes:
$ chmod 444 index.txt $ tla mv faq.txt robofaq.txt $ echo "ROBOT TIME" > robot-time $ tla add robot-time $ tla rm ports.txt
The tla mv command renames a file in such a way that Arch keeps track of the change. It is important to use this command in place of the standard mv. The tla add command prepares a new file to be inserted into the archive, and tla rm schedules removal of a file.
All of these changes can be checked in to your local branch now:
$ tla commit
Your preferred text editor (as specified in the $EDITOR environment variable) will be started up with a template for your check-in log. Once you have filled out the log entry, saving and exiting finalizes the commit.
Now running tla abrowse shows that you have two revisions of the robot branch in the archive, base-0 and patch-1:
jrh@zork.net--2004 lnx-bbc lnx-bbc--robot-branch lnx-bbc--robot-branch--0.0 base-0 .. patch-1
Of course, while you work on your branch, development may have continued on the original archive. Running tla update fetches changes only from your local branch and not the original project. To fold in changes from upstream, you need to star-merge:
$ tla star-merge \ lnx-bbc-devel@zork.net--gar/lnx-bbc--stable--2.1
In the event of conflicts (situations where both your branch and the upstream project have changes to the same lines of code), Arch uses the standard patch method of creating .orig and .rej files for each file that has conflicts. It is a good idea to use the find utility to seek out any rejects before committing your star-merge.
You may have noticed that revisions are named either base-0 or patch-#, where # is the number of patches to base-0 that must be applied. Arch uses a log-structured archive format, so that archive operations only ever add information to a project. This means that for big projects with many revisions, it can take a long time for certain tasks.
To speed up operations, you can make a snapshot of a given revision. Arch snapshots are simply a compressed tarball of a checked-out revision. When a checkout or other operation is performed, Arch looks for the highest-numbered snapshot and applies any necessary patches from there:
$ tla cacherev
Once this is finished, you can run tla cachedrevs to see what revisions have snapshots within your archive:
lnx-bbc--robot-branch--0.0--base-0 lnx-bbc--robot-branch--0.0--patch-1
Because you do not always have access to create snapshots in an archive, it can be useful to make a local cache to speed up file operations. Arch provides a second kind of cache, called a library, that stores copies of checked-out files from various revisions. This is especially helpful for remote archives, because it means you do not even need to download the base snapshot revision before applying changesets:
$ mkdir ~/LIBRARY $ tla my-revision-library ~/LIBRARY $ tla library-config --greedy ~/LIBRARY $ tla library-add \ lnx-bbc-devel@zork.net--gar/lnx-bbc--stable--2.1
This library is not small, with the example above comprising over 78MB as of June 2004. The advantage over a slow link, however, is well worth the trouble. In addition, laptops often have slow ATA hard drives, and involved archive operations can be a drag as the drivers use up plenty of CPU cycles. A greedy (auto-updating) Arch library can make your revision control operations quicker and more responsive, even for local archives.
In the next article in this series, you'll learn how to make publicly available mirrors so that upstream developers can star-merge back from your branches. In addition, you'll learn how to cherry-pick changesets from a busy branch and how to use GnuPG to sign your changesets cryptographically for security purposes.
The third and final installment of this series will describe centralized development techniques with Arch. You'll learn how to manage a shared access archive using OpenSSH's SFTP protocol and how to write scripts to perform automated tasks on your archives.
Resources for this article: /article/7752.
Nick Moffitt is a Linux professional living in the San Francisco Bay Area. He is the build engineer for the LNX-BBC Bootable Business Card distribution of GNU/Linux and the author of the GAR build system. When not hacking, he studies the history of urban public transportation.