Version control according to myself =================================== * what do we want to record ? All SCM systems are able to record the insertion and deletion of some lines of code. But most users of those systems have noticed that this is not enough. For example, keeping track of the move of a C function from one C file to another, or even within a single file, cannot be accurately described by just a deletion in one place and an insertion in another place. When this is done, and we have to merge our line of development against another one, in which changes were done to the implementation of same function, we get a useless conflict, whose sole cause is an implementation "detail". We thus have a 3rd essential operation to take into account, and it was enough for me to start thinking about the problem of version-control. But looking at darcs I realized it was just one more special case: renaming a token (something that's supported by darcs) is another one. And we can think of many more. I hereby start an incomplete list, to be extended. - code insertion [2] - code removal - code movement - scoped token renaming [1] - reordering of a function's arguments - code duplication [3] - file splits [4] [1] includes darc's global token replacement, but also renaming of local variables, OO methods, C static's, and the like, as well as file renaming. This means changing a specific (set of) declarations and the uses of the declared item, restricting the substitution to correct occurences only, providing a way by which some otherwise-valid occurences may be untouched by the transformation, and other complicated stuff. This may be complicated by generated tokens, either generated at build-time using preprocessor concatenation, or at runtime (think dynamic linking through dlopen/dlsym, and lisp-like and other introspective languages. [2] not only _lines_ of code ! [3] often enough cut'n'paste is used to start similar code using a template [4] that's mostly a combination of move (of semantic contents) and copy (of headers inclusion and the like) * what's so special about files ? Files have been taken as the unit of version-control by most SCM systems. Many of them actually use existing file-level tools underneeth, as exemplified by CVS' use of RCS. This approach was so much undiscussed that the simple concept of renaming/moving a versionned-control file has caused much trouble to many implementors, when it was implemented at all - here again the venerable CVS comes to mind, but many other SCM systems, including commercial ones, suffer from this problem. Still looking at how to solve the general issue of code moves, I meditated on the way SCCS stores the history of a file. Basically, on first checkin the original file content appears as-is within the SCCS history file, and this content is then chunked into blocks as the need arises, making use of #if-like constructs to make it possible to restore a given version of the file. Whereas this structure does not possibly handle moving a block of code from one place to another (since the reading of an SCCS archive and the resolution of those #if's is completely linear), the idea of splitting a block on-demand into sub-blocks seems to provide a key to the problem. If we only store blocks of code, and make each revision of a file be a list of the blocks that make it up, we've basically solved the problem of moving a block _within_ a file. But since many sequences of blocks will be common to many revisions, we'll have to handle a hierarchy of blocks. A given function will be seen internally as a block made of a sequence of blocks, each of which being possibly either literal code, or sequences of other blocks. And then, what about the file ? Well, it can surely be seen as just a block in this hierarchy. And then, seeing the directory as a block made of (named) file blocks, and the project as a block containing files and directories, will allow us to handle cross-files moves of our code items. As a goodie we can even note that moving a file inside the hierarchy has become exactly similar as moving a code statement. Renaming a file is dependant on how a directory gets represented, but I guess it's mostly similar to renaming a function. Similarly, uses of the file's name will have to be kept in sync * necessary knowledge of the structure of what we deal with Many of the features mentionned above suppose that the succession of operations done on the code between 2 checkins gets tracked, probably by the editor program, otherwise it has to be guessed, which is a costly and not-completely-automatable process. Many of those features (mostly those in [1]) also need information about the language(s) you're dealing with, or they'll not be able to work correctly. In short, this is really no piece of cake :) Yann Dirson - 2003-08-06