Journal Articles

CVu Journal Vol 28, #5 - November 2016 + Programming Topics
Browse in : All > Journals > CVu > 285 (9)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Commit Checklist

Author: Martin Moene

Date: 06 November 2016 17:01:15 +00:00 or Sun, 06 November 2016 17:01:15 +00:00

Summary: Chris Oldwood goes through the motions of version control.

Body: 

Once I’ve finished the inner development cycle for a new feature or bug fix – write tests, code, build and run – it’s time to think about committing those changes and therefore publishing them to a wider audience, i.e. my team. With the fun part out of the way, it’s quite easy to rush this last stage and therefore publish a change that falls short in some way, either from a short or longer term perspective. At its most disruptive, it could be the code which is broken through lack of proper integration or a partial (or overzealous) change-set that excludes (or unexpectedly includes) files or code that was never intended to be published. On the flip-side, the actual code might be good but the cohesiveness or surrounding documentation may be lacking, therefore making future software archaeology [1] harder than it need be.

VCS tool differences

Although the mental checklist I follow when committing any change is pretty much the same, there are some significant differences that have emerged due to the version control product I might be using. For the purposes of this article, I’ve tried to talk about ‘publishing’ a change when I mean the act of making it visible to the outside world. In a classic centralised VCS like Subversion the act of committing and publishing are one and the same (private branches notwithstanding), whereas for a distributed VCS like Git they are distinct steps. The latter (DVCS) brings a number of affordances over the former (CVCS) around committing that are particularly useful at commit time.

Hence the general order that these steps are done are somewhat dependent on the VCS tool in use. For example with a CVCS like Subversion where you have (had) no ability to commit to an integration branch locally first, you need to consider how you’ll address the implicit merge that will occur as part of the integration step. In contrast with Git, you can focus on the steps one at a time – get your commit sorted, then integrate other’s work, and finally tidy up before publishing.

In this article I’ve ordered the steps more like the CVCS world (review, integrate, then commit & publish) rather than the DVCS style (review, commit, integrate, then publish). Either way, the checklist I’ll describe is essentially portable, it’s just a little easier when you have more time between getting your house in order and showing it to the world at large.

Preparing the change-set

The first thing to do after hacking around the codebase trying to get the bug fixed or the feature added is to ensure the set of changes that will comprise the eventual commit is ‘clean’. This means that what we eventually commit is what we intended to commit – no unexpected side-effects caused by transient workarounds, accidents or interference from the IDE and tools.

As a starter for ten, the code needs to compile and the tests need to pass. It’s all too easy to make localised changes and run just the handful of tests that you believe is required to verify the change is working. Before committing, we need to take a step back as we switch mind-set from creator to publisher so, after a deep breath, I need to run the build script which should clean the workspace, compile the code and run all the necessary tests to give me the confidence I need to promote the change. It’s important that the build script I use is the same one that my team-mates and the build server will use so that we all agree on what process is used to create the final deliverable.

I make an extra special effort to deep cleanse my workspace before doing the final build because too often in the past I’ve been caught out by stray artefacts leftover from development that have somehow managed to taint the change and make it appear to work. The CI server should be working from a fresh workspace and therefore I try and do the same, albeit by cleaning up rather than creating a fresh one. With Git, it’s as easy as git clean -fdx, but with other VCS tools I manually write a script [2] to do it.

Reviewing the change-set

With a (hopefully) minimal set of edited files and folders in my working copy, the next step is to verify that only the code I intended to change exists there. During development or manual testing I may have commented out some code, tweaked a default value or adjusted the configuration to get things into a known state for such testing. Or perhaps a tool did something I didn’t realise and I need to back that out because it happened by accident.

Alternatively I may have added a new setting that I defaulted for the tests but which needs applying to other environments too. With the rise of automated refactoring tools and smarter IDEs many of the changes we make might now be done for us. However, a tool like this will only go so far and try to keep the code intact after each small edit, but it’s common for a single small refactoring, such as the deletion of code, to unearth further similar refactorings. Static code analysis is usually limited to advising on the current snapshot instead of chasing the turtles all the way down, so we still have to do that bit manually.

In essence the review step may need to be done a number of times as we identify a problem with the change, make a correction, potentially re-build and re-run the tests, and then review the change once more in the context of the revised change-set.

Reviewing the code structure

My reviewing process probably follows that of most other people –initially I look at the set of files that have been added, modified or deleted. Depending on the type of change it might be mostly source code or project files / package configuration. Whilst scanning the list of files, I’m looking for anything that’s obviously out of place, such as a project file change when I haven’t added a new class, or a source file in a namespace that I wouldn’t expect to have changed that might indicate some unintended coupling. Design smells that come out of this structural review usually get logged in my notebook [3] for later consideration (the discipline of staying out of the bigger rabbit holes is a tricky one to master).

The converse is also true – there may be files I’d expect to change if I changed at least one of them. For example adding a new setting to the DEV configuration probably means I need to do the same for other environments. Another is the updating of project dependencies as the version number usually causes a ripple across a couple of files such as the package configuration and project file. The same occurs at a higher level too – if a production code project file changes the associated test ones should too.

One mistake that’s less of an issue with modern VCS tools is accidentally checking in temporary files because now they tend to support an exclusion list (e.g. .gitignore). Before this feature, you needed to be careful you didn’t accidentally check-in some binary artefacts or per-user tool configuration files as it could cause very odd behaviour for your team mates. My clean script would ensure they were deleted before the review thereby side-stepping the issue, but tool upgrades and refactoring meant that the script still needed revisiting now-and-then and so I’d still be on the lookout at this stage for files which have escaped the ignore list.

Reviewing the code itself

Naturally the part we are probably most interested in is reviewing the actual code. First we need to ensure that what we think we changed was what we actually changed and that are no hangovers from spikes, testing, etc. I may have also commented out some code with the intention of deleting it before publication and so I need to make sure I actually did this as checking in commented out code is almost always a mistake.

In the pre-commit review, I’m not looking at the code changes from the perspective of ‘does it work, does the design make sense and does it meet the guidelines’, I’m looking at it to see if the diff makes sense to the future reader. When the software archaeologist is looking back over the revision history they are often looking at the evolution of change rather than each snapshot in isolation. This is why you often hear the advice about not checking in functional and formatting style changes in the same commit – it makes it hard to work out what the former is if it’s buried in the latter. That doesn’t mean I don’t care about the more traditional style of code review, it’s just that I would have done all that before I even get to the commit stage.

I tend to look at diffs in two different ways depending on the change and how much I want to see of the surrounding context. The classic patch format where you only get to see the actual change and the odd line of context either side is great for simple changes where I just expect some in-place editing to have occurred. In large files this avoids all the scrolling around or navigating you might need to do as it usually fits on a single screen. Sadly many diff tools show the patch format highlighted on a per-line basis and so simple character changes can be harder to spot, but they are great for quickly eyeballing change sets you do frequently, such as updating 3rd party package versions.

The other diff view I use is the more fully featured 2-way (and 3-way for a merge) diff tool that shows the entire file with highlighting around the changes, e.g. KDiff3. I find these tend to make small intra-line changes and block movements a little more visible. When the change is a refactoring, I’m often more interested in seeing all the surrounding code because if a tool is involved it will limit its scope and seeing the change in situ often points out new inconsistencies that have resulted from the surgical precision of the tool.

Whilst I’m happy with the command line for many VCS tasks, diff’ing is one place where the GUI still does it for me. This includes both the structural and code reviews. I’ve found tools like TortoiseSvn (Git) or Git Extensions allow me to quickly execute different actions to clean up a change set. For example I might need to revert a few mistakes, scan the commit log of others, the odd blame here-and-there, and finally diff the lot.

Yes, you read that right, I always diff every change I commit. If that commit contains a lot of edits then it’s a lot of files to diff, but I decided I’m not going to compromise on this principle because it takes time. In fact the more files in the change-set the more inclined I feel to review it thoroughly exactly because there is a greater chance of something untoward creeping in. Instead what it’s taught me is that it’s preferable to work on smaller units of change (which fits the modern development process nicely), but also to find an easier way to quickly review changes. The patch-style diff format presented as a single unified diff pays off handsomely here for boiler-plate changes as you can very quickly eyeball the entire set and commit it if turns out to be clean.

Splitting up changes

Even when you are doing continuous integration where you are committing incremental changes every few minutes or hours, it’s still possible that a part of your overall change ends up fixing something tangential to the task at hand. I often try and use different tools or techniques when I can to exercise the codebase and tools in different ways to find those less obvious non-functional bugs. For example running a build or test script via a relative path often shows up poor assumptions about the relative path to any dependent scripts or tools. Whilst not essential to the product they are jarring and interrupt your flow.

When these kinds of minor bugs show up, I like to fix them right away as long as they really are trivial. This means that the entire change set will probably include two different logical changes – one for the task I’m doing and the other for the bug I fixed whilst developing it. Logically they are distinct changes and therefore I’d look to commit them separately. However, this extends further than per-file changes: it may be two independent changes within the same source file. Historically you’d either have to manually split these commits out or commit them together and make it clear via the commit message that there were two different motives for the change. Luckily modern tools like Git make it possible to easily to split a single bunch of edits to one workspace into separate coherent commits.

Although cherry picking changes across branches is generally frowned upon (mind you, so is branching itself these days), this has often been down to commits being overly large and failing to adhere to the Separation of Concerns. By ensuring unrelated changes are kept committed separately you make it much easier to surgically insert a fix from one branch into another should the need ever arise.

Integrating upstream changes

Once I’m happy that the changes I’ve made locally complete the task at hand I then need to see what’s changed in the world around me. Working directly on the main integration branch [4] in small increments ensures that whatever has changed is likely to be quite small and therefore the chance for conflict is minimal – any unexpected merge should virtually always be trivial and handled automatically by the VCS tool.

It’s imperative that I pull the latest upstream changes locally first, so that I can resolve any syntactic and semantic conflicts before pushing back. The former are fairly easy to spot when working with compiled languages, whereas the latter are somewhat trickier and rely on failing tests for the early warning siren. Whilst I could just push my changes and let a gated build inform me of any problems, I prefer to avoid context switching and so proactively integrate so that when I eventually manage to publish I’m pretty certain it won’t come back to me for remediation.

Integration is one area where the centralised and distributed VCS products differ greatly. When using a centralised VCS, there was always the risk that pulling the latest changes to your workspace would invoke a complex merge which, if messed up, could cause you to corrupt your own changes. This reason alone convinced me to go back to single branch development and keep my commits small and focused. On the few occasions where the prospect of losing my work could have been expensive, I would use the ‘private branch on demand’ feature (e.g. branch from working copy in Subversion) to create a backup in the VCS. Then I’d update to the head and merge my private branch back in, safe in the knowledge that I could repeat it as many times as needed to resolve any conflicts. In the distributed world this kind of behaviour is inherent by the way the tool works (you commit to your private fork first) and so it’s safer by default.

Handling merges

This in turn means you have more flexibility about how you handle any merge conflicts directly on an integration branch. In the DVCS world, you can pull the latest changes and do the bare minimum to commit your changes on top of it. Then you can resolve any syntactic and semantic problems in your own time and commit them separately as fixes to the merge. At this point you then have the choice of whether to leave history intact and publish separate commits (merge + fixes) or squash them into a single change and publish it as if nothing went wrong. For me, the decision of which approach to take depends heavily on whether the break was significant or not and therefore may have significance in the future too.

For example, if someone just renamed a method in a refactoring then I squash as it’s simply a matter of poor timing, whereas if a design change has caused the rework it means my change was based on a different view of the world and so may hide further semantic problems. Ideally every commit to an integration branch should stand on its own two feet (i.e. the build and tests pass) but when you can publish the fix at the same time as the break the observable outcome is essentially the same as so it feels acceptable to break the rules in the rare case that it happens.

In the CVCS world you do not have this luxury as what you publish to an integration branch is immediately visible, hence I might do a ‘what if?’ merge first to test the water and if it looks dicey spin up a private branch for safety. This way I can guarantee that I’ll only be publishing a consistent change and the private branch contains the gory details for posterity.

Noise reduction

As I mentioned earlier I like to keep my workspace as clean as possible and therefore before integrating I’ll definitely do a "deep clean’ to ensure that the subsequent build that follows the integration is as close to the build machine's as possible. This means that I'll ensure all transient files in the workspace are removed, e.g. cached NuGet and Node packages, as I want the best chance possible that any problem is strictly down to the integration itself and not some stray environmental issue caused by detritus.

In essence in a Git based project this boils down to the following one-liner:

call Clean --all && git pull --rebase && call Build

Documenting the changes

With the set of changes reviewed and integrated the last big hurdle is to document them. When committing, I need to provide a message that tells my future readers something about the change that will enlighten them to its purpose. The same is true of any comment, the writer needs to avoid saying what the code (and diff) already says and to instead focus on the rationale – the why.

It’s not quite as black & white as that, though, because trawling through the revisions trying to find related changes is much better with some tools than others. For example the ‘annotate’ feature (or in its more witch-hunt friendly guise – blame) makes it easier to walk the history of a single file using the code itself for navigation. However, if you’re using a lesser tool or doing some archaeology around a wider change then the commit message can become more significant in tracking down ‘an interesting set of changes’.

Hence it helps if the message is broken into (at least) two sections – a short summary and then a more detailed explanation. The summary should be short, ideally fitting a single line of text, so that when you are scanning the list of recent commits that one line tells you in an instant what it’s about. Curiously there appears to be some disagreement about which tense you should write your message in but I’ve always found the past tense perfectly adequate. Fortunately this two-part format has been brought to prominence by the likes of Git and so it’s becoming the norm.

The start of any commit message I write nearly always contains some kind of reference to the feature I’m working on. For example if the user stories are being tracked in an enterprise tool like JIRA then I’ll put the ticket number in, e.g. #PRJ-1234. Once again, somewhat fortuitously this practice has been adopted by other tools like GitHub as a way to link an issue with a commit. Linking the commit in this way means that I probably have less to document because much of commentary and rationale will already be in another tool.

Even if the project is not using a formal feature tracking tool, e.g. just a Trello board with simple task descriptions, I’ll still use the card number as a message prefix. [5] This is because committing a feature in small chunks makes the set of related commits harder to find. Sometimes the team will review a change after it’s already been committed and so the reference number makes it much easier to find all the related changes for the feature. If a change doesn’t have a reference then it should be something so trivial as to be easily discernible from the one-line summary alone.

Naturally the body of the commit message will contain further details about why the change was made. If it’s an informal bug fix, it might contain the basic problem; if a new feature then a bit about what support was added; and if it’s tidying up then why the code is now redundant. Sometimes I’ll include other little notes, such as if the commit isn’t standalone for some reason. Whilst it’s great if you can pick any arbitrary revision and expect it to be self-consistent, there are occasionally times when we have to check-in partial changes or changes that may not strictly work due to some environmental concerns. In these cases, the extra note helps the reader know that there are dragons lurking and therefore they should look further afield to find where they were eventually slayed.

Once again, I personally find a UI advantageous for this part of the task, not least because it normally contains a spell checker to ensure that the basic text has no obvious typos. Whilst it’s not an everyday occurrence searching the commit messages for a particular word, or phrase, it is useful and so it helps enormously if you do not have to compensate for spelling mistakes. A modern commit dialog also tends to be aware of the folders, files and even the source code that has changed and so can provide auto-completion of terms that might be relevant, such as the name of a method that was edited.

One thing I do miss about working with a CVCS is that you can often still edit a commit message after the fact. With a DVCS the message is part of the immutable chain of revisions and therefore attempting to fix it leads you into the kind of murky waters that you really want to avoid. If the changes only exist as local commits, you still have time to make the correction, but once it’s published you pretty much just have to live with it.

Publishing the change-set

Phew! After all those considerations it’s finally time to unleash my handiwork on the world at large. If I’m on Git then I’ll do the final push, whereas with Subversion or TFS it will have happened when I hit ‘OK’ on the commit dialog. Either way the change is done, so barring any unforeseen weirdness shown up by the build pipeline, it’s time pick another task and go round the development loop once more.

References

[1] ‘In The Toolbox: Software Archaeology’, C Vu 26-1, http://www.chrisoldwood.com/articles/in-the-toolbox-software-archaeology.html

[2] ‘Cleaning the Workspace’, Chris Oldwood,http://chrisoldwood.blogspot.co.uk/2014/01/cleaning-workspace.html

[3] ‘In The Toolbox: Pen & Paper’, C Vu 25-4,http://www.chrisoldwood.com/articles/in-the-toolbox-pen-and-paper.html

[4] ‘Branching Strategies’, Chris Oldwood, Overload 121,http://www.chrisoldwood.com/articles/branching-strategies.html

[5] ‘In The Toolbox: Feature Tracking’, C Vu 26-3,http://www.chrisoldwood.com/articles/in-the-toolbox-feature-tracking.html

Notes: 

More fields may be available via dynamicdata ..