Journal Articles
Browse in : |
All
> Journals
> Overload
> o151
(6)
All > Topics > Programming (877) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: Do Repeat Yourself
Author: Bob Schmidt
Date: 05 June 2019 19:18:34 +01:00 or Wed, 05 June 2019 19:18:34 +01:00
Summary: Software developers are well aware of the ‘DRY Principle’. Lucian Radu Teodorescu investigates when this common wisdom does not always hold.
Body:
If you are a software developer, chances are that you heard about the DRY principle: “Don’t repeat yourselfâ€[Hunt99]. Actually, chances are that you’ve heard it multiple times; probably many, many times. If you do a quick Internet search, you see that this phrase is repeated ad nauseam. But how come a mantra that preaches no repetition is repeated – ironically – so many times? Starting from this paradox, this article analyses why sometimes repetition is vital for people and also useful for software development.
The name of the game
Software development is a knowledge acquisition process [Henney19]. It’s not enough to write code for machines to understand; we need also people to be able to understand it and reason about it. It’s mostly a social activity. It’s not enough for the actual co-workers to understand your code, future co-workers also need to understand the code. Furthermore, if you understand your code now, you may not be able to do it 6 months in the future – that’s how volatile is the understanding of the code.
Any fool can write code that a computer can understand. Good programmers write code that humans can understand
~ Martin Fowler
The main bottleneck of software development is the understanding capacity of programmers. If, following Kevlin Henney, we rename the term code into codified knowledge [Henney19], then the fundamental problem is arranging this knowledge in such a way that it allows easy acquisition by humans and easy reasoning on it.
There are many aspects of organizing this knowledge, but for the purpose of this article, we are concerned only about the use of repetition.
Other forms of knowledge representations
Let us take verbal communication as the primary form of interacting with knowledge. First, there is the actual verbal communication, then there is the non-verbal one. The non-verbal communication often repeats the verbal communication; it’s used most of the time to strengthen the message expressed through words.
Looking at the language itself, we find that it’s highly redundant. Some very common examples of redundancy in English include: plural and gender concordance, the third person singular -s, subject-predicate inversion (in the presence of an interrogative word), etc. It seems that humans are better equipped to understand messages with a lot of redundancy. If people find that processing natural language is easier in the presence of redundancy, why would we want to remove redundancy from the software that people are supposed to read?
Let’s go further in our analysis of repetition in discourse. Within rhetoric, repetition is an important strategy for producing emphasis, clarity, amplification or emotional effect. It can be of letters/syllables/sounds, words, clauses or ideas. For example, one can see a lot of repetition in the following speech:
We shall not flag or fail. We shall go on to the end. We shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills. We shall never surrender.
~ Winston Churchill
One can say that the text is just a big repetition of the same idea. How would a DRY fan ‘refactor’ this text? Probably something like the following:
There are a few things we shall do: go on to the end, fight – in France, on the seas, in the air (with growing confidence and growing strength), beaches, landing grounds, fields, streets, hills – go to the end, defend our island (whatever the cost may be); but never surrender, flag or fail.
~ DRY enthusiast
And, because refactoring is an iterative process, after a few shake-ups of the text, we would arrive at:
We shall fight and never surrender.
~ DRY enthusiast
For fun, and for the sake of repetition, let’s have 3 more examples:
O Romeo, Romeo! Wherefore art thou Romeo?
~ William Shakespeare
Becomes:
Hey, Romeo! Wherefore art thou?
And:
I am so happy. I got love, I got work, I got money, friends, and time. And you alive and be home soon.
~ Alice Walker
Becomes:
I’ve got happiness, love, work, money, friends, time, you alive; reaching home soon.
And finally:
Happy families are all alike; every unhappy family is unhappy in its own way.
~ Leo Tolstoy
Becomes:
Happy families are all alike; the others not.
And, because the last quote was from Anna Karenina, I would like to stress one more point. How would an author construct such a complex novel if it didn’t have repetition? How can one construct characters without repeating types of behaviors? How can one distinguish between main characters and other characters without repeating the names of the main characters more? Imagine every name in Anna Karenina written only once, every detail about the world that Tolstoy created appearing only once.
Ignoring its aesthetic value, we can always think of a novel as a knowledge source, and thus similar in some ways with our code. When writing a novel, the author typically wants the readers to understand and remember, if not all, at least some key aspects of the novel. That is exactly what software authors aim for.
In fiction repetition is useful in:
- emphasising what’s important
- keeping certain aspects fresh in the reader’s memory
- simplifying reading.
The same benefits can apply to repetition in code. Religiously eliminating repetition would remove these benefits as well, making the code harder to read.
Memory and learning
Tell the audience what you’re going to say, say it; then tell them what you’ve said
~ Dale Carnegie
Repetition plays a key role in how our memory works. Both in terms of acquiring new memories and using those memories. And this is extremely important if we want to improve our knowledge acquisition process.
Psychologists and neuroscientists differentiate between long-term memory and short-term memory [Foster09]. For the purpose of this article, we could consider working memory to be a synonym for short-term memory. This long-term and short-term memory would correspond to external storage and CPU registers, in computing parlance.
Since ancient times the best method to acquire knowledge in the long-term memory is rehearsal:
Repetition is the mother of all learning
~ Ancient proverb
Not only does it provide means to remember facts, but repetition also plays an important role in what’s important and what’s not. This is how we teach our children; we repeat a lot the facts the child needs to learn, and we repeat more often the more important facts.
Just like with CPU registers, the most important thing about working memory is that it is very limited. Early models claimed 7 different things (plus or minus 2); recent studies claim that without any grouping tricks, the memory is generally limited to 4 different things. [Mastin]
The other problem with short-term memory is that it easily decays (in the order of seconds). To keep things in short-term memory (e.g., in focus), we need to constantly repeat those things. [Mastin]
Let’s take an example. Let’s assume that we are exploring a new codebase and we have 100 functions of equal importance. We need to find 3 functions that match the given criteria. Without any form of grouping, or repeating what’s essential, we would iterate over the space of functions trying to capture the needed functions. But, the problem is that after a few functions visited, our memory is filled with unimportant stuff. We constantly defocus, and our search procedure is hard. If the important information is repeated just enough, and/or if we have some sort of grouping, it would be much easier for us to find what we are looking for.
When repetition is preferable
The reader must have repeatedly seen the downsides of repetition, so repeating them here would not be beneficial (pun intended). Instead, we are shall enumerate some of the benefits of repetition:
- Emphasizes important aspects of the code. Indeed, if the readers of the code see that a certain principle/pattern/design choice is applied several times, they can easily reach the conclusion that the principle/pattern/design choice is important. Conversely, if an important decision is not repeated at all, but there are other constructs/patterns repeated, then the importance of the decision can easily slip past the reader.
- Ease the learning. Repetition is the mother of all learning.
- Create coherency. If all items in a group have completely different characteristics, then the group is not coherent at all. To make a group coherent is to give all the elements in a group a certain characteristic. That is, to repeat the characteristic.
- Keeps abstractions at the same level. Refactoring techniques that aim to avoid repetition often make the new abstractions operate at different levels; this is typically bad for reading the code. If we want to keep the code at the same abstraction level, sometimes we need to duplicate some code.
- Efficiency. Sometimes, to achieve maximum efficiency, certain (low-level) code snippets need to be duplicated.
In the following subsections, we offer examples of when repetition applied in programming is good. However, as all design choices have both pros and cons, we also briefly indicate how not to apply the advice over-zealously.
Repetition and code documentation
Code documentation is essentially repetition. It repeats (to a certain degree) what the code is saying, but in a manner that is more understandable by people. We all agree that code documentation is good, therefore, a form of repetition is good.
Then, we have repetition inside the documentation itself. For example, if we have an important architectural decision that we want the readers of the documentation to keep in mind, we should repeat it each time it provides insight into why certain things are designed in a certain way.
People should use repetition inside code documentation to highlight what’s important.
However… don’t overdo it. Avoid documenting things that frequently change. Avoid repeating ad nauseam decisions that are not important.
Repetition in style
It’s often a good idea to have a consistent style. But a consistent style can only be produced by repeating the same stylistic elements, so repetition is essential to a consistent style.
Style can apply to a variety of things: from formatting the code, to the way architectural decisions are made. All of them are important, but I would argue that the latter part is more important than the first one. There are only a few things that can damage understandability more than having a set of incoherent decisions. To come back to the ‘codified knowledge’ interpretation, having inconsistent knowledge is very harmful.
However… don’t overdo it. I’ve seen a lot of time spent in minor formatting style debates. Stylistic unity is good, but that doesn’t mean that we have to burn a developer at the stake when they add spaces in the wrong place. Don’t be dogmatic on this; use tools like clang-format
to take the burden off developers.
Repetition in naming
Let’s assume that one is writing code for a system based on the Model-View-Controller pattern. Naming all the model classes with the ‘Model’ suffix, all the view classes with the ‘View’ suffix and all the Controller classes with the ‘Controller’ suffix is generally a good idea. It provides coherence within the 3 groups of classes, and it makes it easier for readers to understand the code. Just by looking at the name of such a class, the reader can have a basic understanding of what the class does, without looking at the details.
Indeed, psychologies would label this naming repetition as a mnemonic system – a learning technique that aids information retention or retrieval in human memory.
However… don’t overdo it. If mnemonics are good, it doesn’t mean that we should heavily use identifier naming conventions all over the place. Form should never outlive content. For example, Hungarian notation is heavily criticized in modern software literature. [Martin09]
Don’t complicate algorithms to avoid repetition
At the function level, we often don’t encounter pure repetition. Two functions that look very similar can have slight differences. If two functions are 90% the same, we cannot avoid repetition by simply reusing the code. We have to carefully separate the commonalities from the differences.
The main problem is the common part is too often interleaved with specifics of the two functions we want to collapse. How would we create a common function that can behave differently between the two cases? Often, we add parameters to the common function and pepper its body with if
statements. And often the common function becomes more complicated than any of the original functions.
As I’m writing these lines, I can almost hear the Clean Code [Martin09] fans screaming in my ear: you should create new abstraction classes that implement different policies and pass them to your function. This may work in some cases, but my experience so far is that is seldom a better choice. Two problems with this approach are that we increase the overall complexity of the code (each new abstraction increases complexity) and that it makes the functions hard to follow (the reader may have to jump between different abstractions). But most of the time, a bigger problem arises: to make it work properly, one needs to mix different abstraction level (see the following subsection); this increases a lot the overall complexity.
Abstractions are best to be created as a result of the design process, not as a by-product of eliminating duplication.
There are a lot of cases in which two functions that are 90% identical should be kept separate. It’s just easier to understand them independently. If you really want people to read them together, you can add a comment explaining that they are linked, and they do almost the same thing.
Take for example the two functions from Listing 1; it’s a scoped down example, but it should be enough to prove our point. The only difference between the two functions is the else break;
line. How would one unify the two functions without creating additional if
clauses and without adding parameters that reflect implementation details? Would the code be more readable?
template <class II, class OI, class UOp, class P> OI transform_if(II first1, II last1, OI result, UOp op, P pred) { while (first1 != last1) { if (pred(*first1)) { *result = op(*first1); ++result; } ++first1; } return result; } template <class II, class OI, class UOp, class P> OI transform_while(II first1, II last1, OI result, UOp op, P pred) { while (first1 != last1) { if (pred(*first1)) { *result = op(*first1); ++result; } else break; ++first1; } return result; } |
Listing 1 |
Similar ideas can also be found (and better presented) in [tef18] and [Metz16]. I think this entire section can be reduced to the following two quotes:
The problem with always using an abstraction is that you’re pre-emptively guessing which parts of the codebase need to change together. “Don’t Repeat Yourself†will lead to a rigid, tightly coupled mess of code. Repeating yourself is the best way to discover which abstractions, if any, you actually need.
~ tef
Duplication is far cheaper than the wrong abstraction
~ Sandi Metz
However… don’t overdo it. Sometimes you can shift the abstractions in such a way in which you can eliminate the duplication; analyze each situation separately and don’t religiously decide to duplicate code or avoid duplication.
Avoid mixing different abstraction levels
Two functions that do the same thing should not be combined if they operate at different abstraction levels or they belong to unrelated modules. It adds a great burden on the developer who needs to keep changing the context to properly understand the code.
For example, summing numbers and summing back accounts are two completely different things; one should not combine the functions that perform the summation.
Let us take another example that created a lot of heat in the last couple of months. [Aras18] [Niebler18]. We aim to print the first N Pythagorean triples (computed in a naive way). A simple C-style solution to this problem is presented in Listing 2. It uses an imperative, plain C-style with one abstraction level.
int i = 0; for (int z = 1; ; ++z) for (int x = 1; x <= z; ++x) for (int y = x; y <= z; ++y) if (x*x + y*y == z*z) { printf("%d, %d, %d\n", x, y, z); if (++i == n) return; } |
Listing 2 |
With the C++20 ranges feature, Eric Niebler proposes the implementation from Listing 3 (comments stripped out), arguing for more genericity [Niebler18].
template<Semiregular T> struct maybe_view : view_interface<maybe_view<T>> { maybe_view() = default; maybe_view(T t) : data_(std::move(t)) { } T const *begin() const noexcept { return data_ ? &*data_ : nullptr; } T const *end() const noexcept { return data_ ? &*data_ + 1 : nullptr; } private: optional<T> data_{}; }; inline constexpr auto for_each = []<Range R, Iterator I = iterator_t<R>, IndirectUnaryInvocable<I> Fun>(R&& r, Fun fun) requires Range<indirect_result_t<Fun, I>> { return std::forward<R>(r) | view::transform(std::move(fun)) | view::join; }; inline constexpr auto yield_if = []<Semiregular T>(bool b, T x) { return b ? maybe_view{std::move(x)} : maybe_view<T>{}; }; using view::iota; auto triples = for_each(iota(1), [](int z) { return for_each(iota(1, z+1), [=](int x) { return for_each(iota(x, z+1), [=](int y) { return yield_if(x*x + y*y == z*z, make_tuple(x, y, z)); }); }); }); for(auto triple : triples | view::take(10)) { cout << '(' << get<0>(triple) << ',' << get<1>(triple) << ',' << get<2>(triple) << ')' << '\n'; } |
Listing 3 |
I believe that all readers would consider the latter code much harder to read. There are multiple reasons why this second version is more complex, but one of them is too much change in the abstraction level. Let’s analyze this.
The code in Listing 3 mixes imperative style (see return
statements), with functional style (see piping operator), with more mathematical abstractions (Semiregular, iota), range-specific abstractions (transform
, join
, take
), range building blocks abstractions (view_interface
) and C++ in-depth abstractions (IndirectUnaryInvocable
, concepts, move semantics). Too many abstraction levels. If you saw a view::transform
, a view::join
and a view::take
in the same code, it would be fine, even if you type more: all the abstractions are at the same level; but don’t mix the levels too much.
A common side effect of using multiple abstraction levels in the same code is the need for more code to bridge between the abstractions. Having too much plumbing code is a good indication that there are multiple abstraction levels involved. And overall, this will make the understanding of the code much harder.
Besides understandability costs, the ranges solution also have pretty high compilation-time costs as Aras points out [Aras18].
Related to this, overuse of generics in the name of eliminating duplicates can lead to major pain points. I had the misfortune to see a lot of cases in which templates are used in the name of genericity, and eliminating duplicates, but if you would just write the code without templates, with all the duplication, it would be far smaller than the code with templates.
However… don’t overdo it. Taking the advice in this section too dogmatically would prevent you from creating any abstraction or very little abstraction. Of course, software without good abstraction is bad software.
Repeating the data
Repetition can happen at the code level, but also on the data level. There are cases in which repeating the data leads to a cleaner design and/or improved efficiency.
Such is the case with multithreaded code. Instead of having multiple threads accessing the same data source, with the possibility of data-races and with mutexes (read bottleneck instead of mutex), it’s sometimes much simpler to duplicate the data. If each thread would have a copy of the data, then there would be no race conditions when accessing the data, and no need to protect the data access. In this case, synchronizing the data between thread can be done by sending messages from one thread to another (which typically involves other data copies).
Another case in which data repetition is used is for pure performance reasons. Cache locality is typically important for performance critical code, and cache locality often involves data copies. The classical example is improving the reads from external memory: one can often cache it in memory, and then, based on the algorithm, cache it in L2, L1 and CPU registers. Read duplicate it instead of cache it.
However… don’t overdo it. Of course, both of the cases described here should not be applied blindly. One should typically have a good design/measurements before applying the techniques described here.
Conclusions
Andrew Hunt justifies the DRY principle mainly by the need to avoid maintenance work [Hunt99]. But we agreed that writing and maintaining code is not the most important part of a programmer’s job; instead, reading, understanding and reasoning about the code is far more important. And repetition can help with this. Therefore, the DRY principle is not as justified as one would believe. And, again, ironically, it should not be repeated as often.
The purpose of this article was not to convince the reader of how bad the DRY principle is; in general, this can be a good principle. The goal was to draw attention to the fact that applying the principle doctrinally can be harmful. The reader, who is or aspires to be a virtuous programmer, needs to balance the pros and cons when applying this principle. Therefore, it gives me great pleasure to end with Aristotle’s golden rule:
Virtue is the golden mean between two vices, the one of excess and the other of deficiency.
~ Aristotle
References
[Aras18] Aras Pranckevi?ius (2018), ‘Modern’ C++ Lamentations, http://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/
[Foster09] Jonathan K. Foster (2009), Memory: A Very Short Introduction, Oxford University Press
[Henney19] Kevlin Henney (2019) ‘What do you mean?’, ACCU Conference 2019, https://www.youtube.com/watch?v=ndnvOElnyUg
[Hunt99] Andrew Hunt and David Thomas (1999), The Pragmatic Programmer: From Journeyman to Master, Addison-Wesley Professional
[Martin09] Robert C. Martin ed. (2009), Clean Code: A Handbook of Agile Software Craftsmanship, Pearson Education
[Mastin] Luke Mastin, ‘Short-term (working) memory’, http://www.human-memory.net/types_short.html
[Metz16] Sandi Metz (2016), ‘The wrong abstraction’, https://www.sandimetz.com/blog/2016/1/20/the-wrong-abstraction
[Niebler18] Eric Niebler (2018), ‘Standard Ranges’, http://ericniebler.com/2018/12/05/standard-ranges/
[tef18] tef (2018), ‘Repeat yourself, do more than one thing, and rewrite everything’, https://programmingisterrible.com/post/176657481103/repeat-yourself-do-more-than-one-thing-and
has a PhD in programming languages and is a Software Architect at Garmin. In his spare time, he is working on his own programming language and he is improving his Chuck Norris debugging skills: staring at the code until all the bugs flee in horror.
Notes:
More fields may be available via dynamicdata ..