Version control according to myself
===================================

* what do we want to record ?

All SCM systems are able to record the insertion and deletion of some
lines of code.  But most users of those systems have noticed that this
is not enough.

For example, keeping track of the move of a C function from one C file
to another, or even within a single file, cannot be accurately
described by just a deletion in one place and an insertion in another
place.  When this is done, and we have to merge our line of
development against another one, in which changes were done to the
implementation of same function, we get a useless conflict, whose sole
cause is an implementation "detail".

We thus have a 3rd essential operation to take into account, and it
was enough for me to start thinking about the problem of
version-control.  But looking at darcs I realized it was just one more
special case: renaming a token (something that's supported by darcs)
is another one.  And we can think of many more.

I hereby start an incomplete list, to be extended.

- code insertion [2]
- code removal
- code movement
- scoped token renaming [1]
- reordering of a function's arguments
- code duplication [3]
- file splits [4]

[1] includes darc's global token replacement, but also renaming of
local variables, OO methods, C static's, and the like, as well as file
renaming.  This means changing a specific (set of) declarations and
the uses of the declared item, restricting the substitution to correct
occurences only, providing a way by which some otherwise-valid
occurences may be untouched by the transformation, and other
complicated stuff.  This may be complicated by generated tokens,
either generated at build-time using preprocessor concatenation, or at
runtime (think dynamic linking through dlopen/dlsym, and lisp-like and
other introspective languages.

[2] not only _lines_ of code !

[3] often enough cut'n'paste is used to start similar code using a
template

[4] that's mostly a combination of move (of semantic contents) and
copy (of headers inclusion and the like)


* what's so special about files ?

Files have been taken as the unit of version-control by most SCM
systems.  Many of them actually use existing file-level tools
underneeth, as exemplified by CVS' use of RCS.  This approach was so
much undiscussed that the simple concept of renaming/moving a
versionned-control file has caused much trouble to many implementors,
when it was implemented at all - here again the venerable CVS comes to
mind, but many other SCM systems, including commercial ones, suffer
from this problem.

Still looking at how to solve the general issue of code moves, I
meditated on the way SCCS stores the history of a file.  Basically, on
first checkin the original file content appears as-is within the SCCS
history file, and this content is then chunked into blocks as the need
arises, making use of #if-like constructs to make it possible to
restore a given version of the file.

Whereas this structure does not possibly handle moving a block of code
from one place to another (since the reading of an SCCS archive and
the resolution of those #if's is completely linear), the idea of
splitting a block on-demand into sub-blocks seems to provide a key to
the problem.  If we only store blocks of code, and make each revision
of a file be a list of the blocks that make it up, we've basically
solved the problem of moving a block _within_ a file.

But since many sequences of blocks will be common to many revisions,
we'll have to handle a hierarchy of blocks.  A given function will be
seen internally as a block made of a sequence of blocks, each of which
being possibly either literal code, or sequences of other blocks.

And then, what about the file ?  Well, it can surely be seen as just a
block in this hierarchy.  And then, seeing the directory as a block
made of (named) file blocks, and the project as a block containing
files and directories, will allow us to handle cross-files moves of
our code items.

As a goodie we can even note that moving a file inside the hierarchy
has become exactly similar as moving a code statement.  Renaming a
file is dependant on how a directory gets represented, but I guess
it's mostly similar to renaming a function.  Similarly, uses of the
file's name will have to be kept in sync


* necessary knowledge of the structure of what we deal with

Many of the features mentionned above suppose that the succession of
operations done on the code between 2 checkins gets tracked, probably
by the editor program, otherwise it has to be guessed, which is a
costly and not-completely-automatable process.

Many of those features (mostly those in [1]) also need information
about the language(s) you're dealing with, or they'll not be able to
work correctly.

In short, this is really no piece of cake :)


Yann Dirson <ydirson@altern.org> - 2003-08-06