Journal Articles

CVu Journal Vol 27, #4 - September2015 + Programming Topics
Browse in : All > Journals > CVu > 274 (13)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Refactoring Guided by Duplo

Author: Martin Moene

Date: 11 September 2015 06:58:59 +01:00 or Fri, 11 September 2015 06:58:59 +01:00

Summary: Thaddaeus Frogley gets to grips with duplicated code.

Body: 

Reducing the amount of duplication within a code base can be a good proxy metric for improving the code base. Reducing duplication reduces total amount of code, in turn reducing executable size, and compile time, as well as making the code base easier to understand, and easier to modify. Smaller code bases have been shown to have less bugs than larger code bases. Defect counts go up in direct proportion to the number of lines of code.

Duplo is an open source implementation of the technique described in the paper ‘A Language Independent Approach for Detecting Duplicated Code’ [1]. It can be used to quickly identify code duplication, which can lead to refactoring opportunities that improve quality and reduce code size, resulting in an easier to maintain, and more efficient code base.

Getting Duplo

Duplo can be found on SourceForge: http://duplo.sourceforge.net

And Daniel Lidstrom maintains a version on github: https://github.com/dlidstrom/Duplo.git

For the purposes of this article, we will be using my fork: https://github.com/codemonkey-uk/Duplo.git

To download & build (on a unix-like machine):

  git clone https://github.com/codemonkey-uk/  Duplo.git
  cd Duplo/
  make

A project file for Microsoft Visual Studio is also included in the repository.

Generating a report

Duplo works from an explicit list of source files. For C++, on a unix-like system that could be generated like so:

  find . | grep -e \.h$ -e \.cpp$ > filelist.txt

For C# you might do it like this:

  find . -iname "*.cs" > filelist.txt

Or on a Windows based machine you could do it like so:

  dir /s /b /a-d *.cpp *.h > files.lst

Unless you have a codebase measured in millions of lines, you probably want to start by analysing your whole codebase. The algorithm used by Duplo scales fairly well (see Table 1).

Performance Measurements
System Files LOCs Time Hardware
3D Game Engine 275 12211 4 sec 3.4 GHZ P4
Quake 2 266 102740 58 sec 3.4 GHZ P4
Computer Game 5639 754320 34 min 3.4 GHZ P4
Linux Kernel 2.6.11.10 17034 4184356 16h 3.4 GHZ P4
Table 1

Once you have a list of source files, however you generate it, you can run Duplo from the command line. Duplo produces two sets of output. It writes a report to a file, containing all the duplicate blocks found. It also produces a summary list of files with a count of duplicate blocks in each. The duplicates report is written to a file, named via the command line. The summary is written to stdout. This can be seen by using the tool:

  ./duplo files.txt report.txt

Since files with no duplication are listed in the summary as ‘nothing found’, and files containing duplications are listed as having ‘found N block(s), the summary can be easily filtered:

 ./duplo files.txt report.txt | grep “\\d*\\sblock"

And sorted:

 ./duplo files.txt report.txt | grep “\\d*\\sblock"
 | sort -rnk3

But hold on! This sort falls over on file names with spaces. To get around that problem I introduced a colon following the ‘found’ in my fork of the project, so sorting the results of project that has spaces in its file names or paths still works:

  ./duplo files.txt report.txt | sort -t':' -rnk2

Examining the summary for files with a lot of duplication can provide some quick and easy wins. Files with high internal duplication can be less difficult to reason about refactoring, but be warned some algorithms, such as those found in lexer/parser code can contain high levels of local duplication that is actually very hard to remove. Feel free to remove files from the source list that produce false positives and re-run the tool to update the reports.

Another source of false positives (arguably) is the block of preprocessor include, or import directives found at the top of source files. Duplo supports filtering them out automatically with the -ip command line:

  ./duplo -ip files.txt report.txt

Duplo also supports reducing the amount of duplication reported by increasing the minimum number of lines of similarity (-ml), and minimum number of characters on a line (-mc) for a line to be counted. Both can be useful in helping target the worst offending sections of the codebase.

The default minimum lines of similarity is 4. This can be generate a lot of false positives on code bases where function arguments in both body and declaration are written out one per line.

  ./duplo -ml 8 files.txt report.txt

Refactoring

One of the simplest forms of duplication to eliminate is whole unit duplication: When there are multiple copies of the same method or type existing in the codebase under different names. These can usually be safely eliminated by simply removing all but one, and updating the references to the others in dependant code.

As with any code change, take care especially of any static or global state the code being refactored touches – either directly or indirectly. Having the codebase under test is good advice here, but should be considered as an extra layer of security, not a replacement for actually thinking about the problem and fully understanding what the code does, and how it’s used.

When duplication is found within the body of multiple methods of the same class: Extract Method, or Extract Class can be used. Where the duplicated functionality is pure (lacking in side effects, not modifying external state) this is trivial, but be careful of extracting methods that appear to do the same thing, are logically the same, but do have side effects, modifying different state.

When duplication like this is found across multiple classes, extracting the method to a common base class, or introducing a new class to the hierarchy, can be effective. But again, be wary of state modification.

Ultimately, however, each piece of duplication identified in a duplo report has to be considered on a case by case basis, and refactoring considered on their individual merits. Reducing duplication is a good rule of thumb, but is always a proxy for some other metric of improvement. Be it making the code easier to work with, or reducing the size of the resulting executable – always keep the end goal in mind: the creation of working software.

Reference

[1] http://www.iam.unibe.ch/~scg/Archive/Papers/Duca99bCodeDuplication.pdf

Notes: 

More fields may be available via dynamicdata ..