Browse in : |
All
> Topics
> Programming
All > Journal Columns > Code Critique All > Journals > CVu > 302 Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: Code Critique Competition 111
Author: Bob Schmidt
Date: 04 May 2018 17:23:55 +01:00 or Fri, 04 May 2018 17:23:55 +01:00
Summary: Set and collated by Roger Orr. A book prize is awarded for the best entry.
Body:
Please note that participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.
Note: If you would rather not have your critique visible online, please inform me. (Email addresses are not publicly visible.)
Last issue’s code
I’ve written a simple program to print the ten most common words in a text file supplied as the first argument to the program. I’ve tried to make it pretty fast by avoiding copying of strings. Please can you review the code for any problems or improvements.
What would you comment on and why?
Listing 1 contains the code. (Note: if you want to try compiling this on a pre-C++17 compiler you can replace string_view
with string
and most of the issues with the code remain unchanged.)
#include <algorithm> #include <fstream> #include <iostream> #include <map> #include <sstream> #include <string_view> #include <unordered_map> #include <vector> int main(int argc, char **argv) { std::unordered_map<std::string_view, size_t> words; std::ifstream ifs{argv[1]}; std::string ss{ std::istreambuf_iterator<char>(ifs), std::istreambuf_iterator<char>()}; auto *start = ss.data(); bool inword{}; for (auto &ch : ss) { bool letter = ('a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z'); if (inword != letter) { if (inword) { std::string_view word( start, &ch - start); ++words[word]; } else { start = &ch; } inword = !inword; } } std::map<size_t, std::string_view> m; for (auto &entry : words) { auto it = m.lower_bound(entry.second); if (it != m.begin() || m.empty()) { m.insert(it, {entry.second, entry.first}); if (m.size() > 10) { m.erase(m.begin()); } } } for (auto &entry : m) { std::cout << entry.first << ": " << entry.second << '\n'; } } |
Listing 1 |
Critiques
Paul Floyd
There’s been some discussion on the mailing list about accessing C++17 compilers. This is indeed fairly difficult.
[Ed: I apologise for the trouble I caused by using C++17 features in this critique.]
For the most part I don’t use Windows for compiling so I haven’t tried any compilers on that platform.
On macOS, the current XCode (9.2, with Apple clang 9.0.0) was able to compile it using -std=c++1z.
On Linux, I build clang and GCC regularly, both the SVN head versions. Early on I gave up trying to build libc++, so I couldn’t compile this code with clang, though I admit I gave up quickly. GCC, given enough options, does the trick. I was also able to build the code on Solaris, again with GCC built from source.
For those of us that like to more than just a second opinion, I was also able to build it on FreeBSD. It wouldn’t compile with clang 4.0.0 that is packaged on FreeBSD 11.1 but it compiled OK with clang 7.0.0 built from source.
Wrapping up on the language versions, when compiled with std::string
instead of std::string_view
I see about a 10% performance degradation. That means string_view
is quite a nice improvement – the code is essentially the same (no obscure optimisation tricks) for a non-negligible speed gain.
Getting back to the code.
clang complains about a lack of parenthesis in the and-or-and expression for ascii letters. GCC complained that argc
is unused. Nothing too serious.
There is no check that more than one argument has been provided. The dynamic analysis tools that I tried didn’t complain, but this should be fixed for production code.
For my testing with a text file, I used a small file containing some random French and then the text of Tom Sawyer.
I have a big beef with the following code:
bool letter = ('a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z');
Is this an early word counting Brexit? This treats any non-English ascii letters as separators, so “près†is considered as “pr†and “sâ€
std::isalpha
would be better. I did try fiddling with LC_ALL
and std::locale
, but didn’t manage to get it to work. Unfortunately there is a price to pay for this (and a heavy price if internally all of the strings need to be converted to a unicode representation). When I switched to using isalpha
, the Valgrind callgrind instruction count went up from about 162 million to 181 million.
There are two possible approaches for splitting the lines into words.
- Look for the letters and assume that everything else is whitespace or punctuation.
- Look for the whitespace and punctuation and whitespace, and assume that everything else is a letter.
I prefer the second approach as there is a lot less punctuation and whitespace than there are letters. It also makes it easier to include things like apostrophes so that “fo’c’s’le†gets treated as one word and not 4.
I’m not sure if it is intentional, but the m
reverse map from word counts to words is a plain map. This means that for any words that have the same word count, only the last one will be recorded.
Moving on to performance. I don’t think that it’s a good idea to just look at some code and then make changes and hope that it will run faster. What is the performance requirement? You need to know how big the problem is. Is the text going to be 1kbyte, 1Mbyte 1Gbyte or more? And how big is the expected vocabulary in the text? 1000, 100,000, 10,000,000 words? Is the vocabulary fixed in advance? If it is fixed, then I would suggest using a perfect hash and a lookup table. Does the text have a special distribution (like with medical or legal vocabulary), or does it just a ‘plain text’ distribution?
My expectation is that the bottleneck will be the time taken to read the input file. Callgrind confirms this: 80% if the time is spent in ostreambuf_iterator::__pad_and_output
. That probably means that any code changes will be ‘premature optimisations’ and the most fruitful avenue would be to use something like asynchronous I/O. I would expect that the word count map ‘words’ will have an upper limit in the region of 10 to 20 thousand elements.
I think that it is a waste of time to limit the reverse map of counts to the top 10 words. Even if the input text is large, say 600,000 words (a doorstopper like War and Peace), then the largest size of the reverse dictionary will only be around 1100 elements. (The worst case for the ‘words’ map is that it contains entries with counts of 1, 2, 3, … . The sum of these counts is that for an arithmetic progression, n/2(n+1)). This means that the size of the reverse map is O(Sqrt[Word Count]), and consequently will never be very large.
If performance really is crucial, then it will probably be faster to replace m
with a table e.g., a std::vector
. The vector should be resized to 10, and contain the running top 10 words in descending order. For the map, it takes around 3 comparisons to determine whether to insert the word. With a table you just have to compare with the last element. If it is less than or equal, do nothing. If it is more, do a binary search and insert the new word. I expect that insertions will be very rare.
It would be possible to replace the two maps with a single ‘bimap’. However, I would advise against doing that if performance is important. It may simplify the code, but my experience is that it doesn’t do the performance any favours.
James Holland
I am afraid that I don’t know the finer points of std::string_view
and so I have replaced them with std::string
. I hope others will be able to provide help and advice on this topic.
The main problem with the student’s code is that should more than one word occur the same number of times within the file, only the last one encountered will be recognised. This is because the map the student is using can only map the number of occurrences to a single word. When inserting a word into the map, any existing word with the same number of occurrences will be overwritten and, therefore, lost. Fortunately, this situation is easily remedied. All that needs to be done is to replace the std::map
with an std::multimap
. When this is done, all words will be added to the std::multimap
and so will be counted. It is interesting to note that, with the amended program, of the words that occur equally in tenth position, it is not possible to predict which will be printed if their number, plus the ones that occur less frequently, exceed 10. This is because the order in which the words are stored within the std::unordered_map
is not defined. This is the nature of an unordered map! This behaviour may or may not matter for the student’s application. A more sophisticated version of the code could display which words occur equally often.
The student’s code makes use of quite a complicated expression to determine whether a character is an upper or lower case letter. There is a function available to determine this, namely isalpha()
. This function returns a non-zero value if the character is an alphabetic character and zero otherwise. This function should be used in preference to the student’s code as there is no guarantee that upper or lower case letters are contiguous.
The student has devised a clever way of separating the words within the file. It would be difficult to find a more efficient method, I suggest. The code is quite difficult to understand, however. I wonder if a simpler method exists, perhaps taking advantage of an existing library. From my brief investigations, it would appear that most libraries that extract words (or tokens) from strings are defined in terms of what constitutes the delimiters between words. In contrast, the student’s code defines what a word is. This makes selecting a suitable algorithm a little difficult. Regex is very flexible and so is a likely candidate. In this case, the separator is anything that is not a series of alphabetic characters. This can be realised in only four lines of code (as shown in the listing below), compared to the student’s 12 (depending on how they are counted). But what about efficiency? My measurements suggest that for my test file the student’s code takes about 6 ms to execute whereas the code using regex takes about 30 ms. I did say the student’s code would be hard to beat. I suspect regex was designed to crack tougher nuts than this and is not particularly fast for simple cases. Let’s now have a look at how the most frequent 10 words are obtained.
Again, the student has developed a bespoke algorithm to extract words from the unordered map and insert them into an std::map
. This is a clever design that takes a little study to determine how it works. Some experimentation would be required to assess its efficiency and whether it could be improved. I offer a more direct approach that makes use of the standard library. It may not be as fast as the student’s code but is simpler to understand, I suggest. The idea is to copy the information in the unordered map to an std::vector
and then to perform a partial sort of the vector to obtain the 10 most popular words which are then printed on the screen. It is interesting to note that there are no explicit loops or if
statements in the code I offer.
#include <algorithm> #include <fstream> #include <iostream> #include <map> #include <sstream> #include <unordered_map> #include <vector> #include <regex> int main(int argc, char **argv) { std::ifstream ifs(argv[1]); std::string ss{std::istreambuf_iterator<char>(ifs), std::istreambuf_iterator<char>()}; std::unordered_map<std::string, size_t> words; std::regex separator("[^[:alpha:]]+"); std::sregex_token_iterator word(ss.cbegin(), ss.cend(), separator, -1); std::sregex_token_iterator end; std::for_each(word, end, [&words](std::string w){++words[w];}); std::vector<std::pair<std::string, size_t>> v(words.begin(), words.end()); const int top = 10; std::partial_sort(v.begin(), v.begin() + top, v.end(), [](auto a, auto b){ return a.second > b.second; }); std::for_each(v.begin(), v.begin() + top, [](auto p){std::cout << p.second << ": " << p.first << '\n';}); }
[Ed: this is problematic with fewer than 10 distinct words.]
In the case where efficiency is important, further work could be carried out to find the most efficient method of performing the various functions of the code. Ensuring strings do not reallocate memory while characters are added to them would, I suspect, be something worth investigating. It is important to time the execution of the code as trying to guess its efficiency is unreliable.
Balog Pal
This is new, starting from code without an attached “it’s broken, does weird things†but instead supposedly working and looking for review. So let’s play this as an actual work-place review, not using the compiler, just our eyes and minds.
First I just look from afar: noticing that all we have is main
and all the code is sitting there. I guess that would break most usual style guides and maybe incite shock to some reviewers. Not me. Plenty of the previous CC entries had unnecessary fragmentation, nonsense functions, 1-shot headers, where all the task could be perfectly handled in just main
or 1–2 functions tops. I postpone the verdict for later, but at glance this one is okay to be like that: no repeated stuff that makes extraction mandatory and the sub-parts do not really form abstraction. I mean they do, but not on the level that is needed to improve readability. We have four parts that are trivial to see. And even if they were not, a single line of comment would do a better job than a function, that will add noise and extra need to pass the state. So it is not suggested now. It still can be easily done later if a reason emerges, like someone is willing to write a test case for some part.
The other thing I note at a glance that we have a nice alphabetic list of standard includes but no using namespace std
; or even some using
declarations. In large projects, the directive is frowned upon for good reason. Especially if the project is old and has parts written before the standard. For this one, I don’t see a single reason not to have it and get rid of those ugly std::
prefixes.
Still at glance I gather understanding on how the problem is solved and implemented: part 1 reads the file into memory, part 2 isolates the words and builds a map of word/word-count, part 3 selects the top 10 entries with the biggest count into another collection and the final part outputs that collection. This looks like a sensible plan that is fit as a solution, so we can move on to the details. Just have to note a stray line: the first in main()
belongs to part 2 and shall be moved down to next section.
Part 1, where we load the file content into memory. That raises my first concern. The preamble just mentions ‘text file’. On a live review, I start asking for the missing info on the intended input, execution environment and consequences of being unable to provide an answer. As the file may be large enough not to fit in the address space, let alone the available memory. And if it fits, but just barely, it may use enough so our further collections will not fit, or lead to poor performance. If we get the answer “yeah, we are on 64-bit platform with 16G memory and the intended files are few megs, 100M tops, if we run out of memory it’s operator errorâ€, we can continue. Otherwise I’ll require a different approach where parts 1 and 2 are put together and words are extracted reading small chunks. Using string
instead of string_view
, but the rest may fly. Review aborted and waiting a second round.
Approving the idea of having the file in memory still leaves us with the lines implementing that. First we just used argv[1]
before checking argc
, having UB if launched without argument. We did not check if the read was successful, that is rude. And we did not deal with the possibly running out of the memory. The latter throws exception that in the current state would cause terminate()
. That might be considered the expected behavior, but in my book playing nice means catching the problem and clearly informing the user. The easiest way is to just surround the code in a try
block and report either just generic failure or add what()
from the exception too. For the argc
and file error cases, we can make an early exit or throw an exception to be reported in the way just described. For the stream we can even turn on .exceptions()
instead of the check, though I would stay with the check.
That leaves just one remaining note, why we use std::string
here. I see no good reason and suggest vector<char>
instead along the guidelines to have that as default. The following code needs no change for that, just the type of ss
.
And one extra question to the client, as I never use streams this way IRL, is why we used istreambuf_iterator
instead of istream_iterator
, which would be intuitive, then make decisions upon the answer.
Finally onward to part 2 (did we really use 400 words on the 2 lines of source in part 1?). My style is to initialize bool
with = false
instead of {}
. While start does not need an initialization value, that holds meaning only when inword
is true. Guess it was motivated by urge to use auto
. That is my favorite keyword in C++, but here char *
is what we want really. letter
should be const
. And I will ask for the excuse for not using isalpha()
from the standard (that I assume unlikely to pass).
The state machine is sound except for the termination. We use address of ch
, so double check that we run the loop with &
. Flipping inword
is up to style, for myself I’d use = true
/false
in the 2 blocks, but that form also passes. At the map insertion I would not use a temporary, as the initialization should fit fine inside the []
, but if we do, I’d add move()
. Despite it feeling redundant unless we use string_view
– but it may work better with different key type like string
, and cannot hurt.
Thinking of the termination condition, we reveal a serious bug: if the file finishes in a letter, then we drop the last word. Guess the client tested it only with files having \n
or whitespace at the end. To correct we either need to check inword
after the loop and process the last word duplicating the inner code (avoidable by putting it in lambda) – or dodge the problem by adding a non-letter character to ss
before the loop (blindly or with a check).
Before moving on, it is worth pondering why unordered_map
is good for us. And if it is, we still may need to tweak it by setting the bucket number or something. string_view
as the key type is also worth questioning: if the expected files are like English text, it will have many short words. For those, a string with small-string optimization that can cover most cases may work better. The client did mention performance as a goal. It’s not hard to make the key type into a typedef
and put the line with the map
insert into a separate function, with variants for std::string_view
and std::string
and possible others, then measure.
Part 3 where we need to fish out the 10 winners. I wonder if there isn’t an algorithm for that, need to check nth_element
, partial_sort
... at cppreference for the former I spot partial_sort_copy
. What IMHO describes exactly what we’re after. And it uses an Input Iterator for source that fits our map. And can take predicate to compare the count. Having that the client code can be written defensively against this, I would not bet on success, as IMO a simple input with a/2 b/1 would output just a, and if we see one cockroach.... it’s probably wiser to move to another neighborhood. Especially if we have an offer at hand.
At part 4 I miss a const
after auto
(at least consistent with part 3 :), otherwise fine for me if we change '\n'
to endl
, at least I’m confident what the latter does without looking up. The original aim did not state the order it wanted the output. For me it would be more intuitive to have it start with the most frequent and go in descending order. It’s easy to iterate the map by rbegin()
. But if we use the algorithm suggested for part 3 the result will naturally look like that and the current form will be good.
I’m also fine with no return
at end of main
. Instead on conclusion I must state that a live interview with actual talk with the client and getting answers instead of guessing is way more fun, and less cumbersome. But still an interesting exercise.
Jason Spencer
This is a nice program in terms of design and container usage. There are just a few things I’d change for correct behaviour and one almost ‘free’ optimisation.
In terms of coding:
#include <sstream>
,#include <vector>
,#include <algorithm>
are not required. I can’t see any stand-alone algorithms called. There are calls to member function equivalents, e.g.m.lower_bound
, instead ofstd::lower_bound
, but they’re covered by the other headers.#include <iterator>
should be added forstd::istreambuf_iterator
.#include <string>
should be added forstd::string
.argc
must be checked for the correct number of arguments and usage printed if no input file is specified.ifs
must be checked for errors after it is created (and the file opened and read) – the file may not exist, may not be readable etc..- It’s a question of style, but I prefer explicitly stating the expected value in the declaration of
bool inword;
rather than using the default value. - I’m not a fan of mixing
auto *start = ss.data();
and the&ch
in therange
-for
loop – I know it’ll work, but technically arange
-for
loop gets values by calling.begin()
and.end()
member functions, if they both exist, orstd::begin(..)
andstd::end(..)
otherwise. Yes, these all return typeCharT *
, so you can do the subtraction to supply the value for the second argument of thestd::string_view
ctor, but that’s co-incidental. I’d consider using a for loop which isn’t arange
-for
loop and initialise the start variable fromstd::begin(..)
and go from there. - The
auto & ch : ss
should beconst auto & ch : ss
as we’re not going to change the contents ofss
. If this change is madeauto *start = ss.data();
should also becomeconst auto *start = ss.data();
. - Consider using
std::isalpha
instead of the comparisons and boolean logic setting theletter
variable. There’s minimal performance benefit to the logic and it’s less expressive. If the logic must stay then put brackets around the<=
clauses, as well as the&&
grouping. Although the precedence is correct in the program the brackets make the expectation explicit. - Prefer
for (const auto &entry : words)
overfor (auto &entry : words)
. We’re not changing the value. - Prefer
for (const auto &entry : m) {
overfor (auto &entry : m) {
. We’re not changing the value. - Consider not re-using the variable name
entry
as it can be a little confusing.
In terms of correctness:
- If the file ends on a word (and no non-alpha characters follow up until EOF) then that last word isn’t counted. The solution is to either append a non-alpha character to
ss
after it is read, or put this code right after the word search loop:if(inword) ++words[{ start, &ss[ss.size()] - start }];
Personally, I’d implement the non-range-for loop described earlier in the
ss.data()
comment and incorporate an end-of-loop correction into that somehow. - The
std::map
section of the code doesn’t necessarily find the top ten words. Since we’re using astd::map
(and not astd::multimap
) then the key has to be unique – so if two words have the same frequency only the first word is included in the map as later insert calls do not overwrite it. Which word is the first word is difficult to say as it’ll be the first one from therange
-for
traversal of the wordsstd::unordered_map
– andstd::unordered_map
does not maintain an order to the elements. In fact it’ll be a function of insertion order, the data structure used instd::unordered_map
and the bucket count.There is a solution to this through the use of
std::multimap
instead – in fact it could almost be a drop-in replacement. However, because of the way the map is trimmed withif (m.size() > 10)
if we have nine ties for first place, and three for second place, only one of the second place entries will be listed (the first one in alphabetical order). Similarly, if there is a 3-way tie for 1st, 3-way for 2nd, 3-way for 3rd and 3-way for 4th, only one of the 4th place values will be printed. - In the
std::map
section the testif (it != m.begin() || m.empty()) {
is wrong – it should beif (it != m.begin() || (m.size()<10)) {
, otherwise under some conditions words with frequencies that don’t go straight to the top of the leader-board will get lost.To print the words of the top ten highest frequencies (printing all ties) we could use a map of vectors:
std::map<size_t, std::vector<std::string_view> > m; for (const auto &entry : words) { auto it = m.lower_bound(entry.second); if (it != m.begin() || (m.size()<10)) { if(it->first==entry.second) { it->second.push_back(entry.first); } else { m.insert(it, { entry.second, { {entry.first} } } ); if (m.size() > 10) { m.erase(m.begin()); } } } } for (const auto &entry : m) { std::cout << entry.first << ":"; for(const auto & w : entry.second) std::cout << ' ' << w; std::cout << '\n'; }
In terms of behaviour:
- The word matching is case sensitive. The matching can be made case-insensitive by converting all characters read to lower case before they’re put into
ss
– with an iterator decorator. Alternatively, in the instantiation ofstd::unordered_map
theKeyEqual
andHash
type template arguments (the third and fourth template arguments) could be replaced with types that ignore case. - Word hyphenation is ignored. Should every word of “state-of-the-art’ be counted separately?
- Apostrophes are also ignored. “won’t†and “won†should not both increment the “won" count in my opinion. But if apostrophes are considered part of a word then “people’s†and “people†wouldn’t both increment people, and perhaps they should.
- We’re printing (or at least trying to) the top ten most frequent words. Shouldn’t the ten be configurable at runtime?
In terms of performance:
The ‘free’ optimisation is to use istream::read
to read into a buffer, rather than use std::istreambuf_iterator
instances to construct ss
. Specifically, the file size must be found first, and a string of that size constructed, then the contents of the dummy string are overwritten with the file contents:
size_t getFileSize(std::ifstream & f) { auto prev_pos = f.tellg(); f.seekg(0, std::ios::end); auto file_size = f.tellg(); f.seekg(prev_pos); return file_size; } void usage(const char * execname) { std::cerr << execname << " input_file_name" << '\n'; } int main(..) { const char * execname = argv[0]; if(argc!=2) { usage(execname); return 1; } const char * infilename = argv[1]; std::ifstream ifs{argv[1]}; if(!ifs) { std::cerr << "Could not open file " << infilename << '\n'; return 2; } size_t filesize = getFileSize(ifs); std::string ss(filesize, ' '); ifs.read(ss.data(), filesize); auto * start = ss.data();
This is significantly faster than the previous approach to reading the file as the runtime is probably reading in chunks from the file – the exact way is implementation independent – it may be standard block reads or it could be that the file is mmaped. But we’re telling the runtime that we’re reading sequentially and in large chunks so the OS can read ahead and have the next data ready for the program.
We’re also avoiding the string having to resize as data is read as we’ve provided the size in the constructor. Further, some OSes, such as Linux, don’t actually allocate memory pages until they are touched – by initialising ss
to filesize
many space characters we’re sure the memory is allocated and available.
Running this on a file which consists of ten copies of War and Peace (a 42MB file with 5.7 million words) the execution time is 74% of the original implementation. It’s actually less than that, but once the file resides in the OS’s filesystem cache the original program sees significant performance benefits to repeated executions.
In terms of design:
While the loading of the whole file into memory is doable on today’s computers I’d suggest reading in a chunk of text at a time, extracting the words, storing only unique words (and only once), updating the tally, and getting the next chunk. Drop ss
altogether and extract words from the chunks into std::string
and store those directly in the std::unordered_map
.
This is for two reasons – firstly, memory scalability (and a measurable performance improvement due to a lower cache load), and secondly to make the code more flexible by allowing incremental processing. The data may be coming from a network stream, and we won’t know all the input data at the start, but we may want to know the ten most common words so far. Storing only one copy of every word means we’re looking at most around 80000 words (a large English vocabulary size) held in memory, but never more than if we were storing the entire file in memory. In some cases, for example if we were looking for the ten most traded stocks in a ticker stream, we can know the entire vocabulary ahead of time.
A back-of-the-laptop implementation (using std::string
instead of std::string_stream
in the unique-word implementation, henceforth locally known as the store-only-unique-words-and-read-in-blocks or SOUWARIB version) had an execution time of between 70% of the original program, or almost as low as 50%, depending on the file size.
With regards to performance the elephant in the room is the STL implementation, I believe. CLang, GCC and Visual Studio STL implementations can vary internally quite a lot.
Doing some microbenchmarking on the SOUWARIB version, the bottleneck (for GCC 7.3.0 and libstdc++ 6.3.0) is by far the lookups in std::unordered_map
. [1] has an excellent discussion of the different implementations each compiler uses for unordered associative maps.
The thing about the off-the-shelf implementations is that they have to be flexible and behave well enough in most circumstances. But we have a smaller problem space – for example, we have an upper limit to the number of unique words (ie the size of the OED, or thereabouts, or the number of symbols on a stock exchange), so we can allocate a fixed number of buckets and forgo the re-sizing and re-bucketing, and we can skip the NVI pattern [2] that an STL implementation might use to decrease compile times. We also know that we won’t be removing any elements from our map, and so on..
Now for the parental guidance bit – this code is highly experimental, and somewhat of a hack, but it’s a proof of concept. This is a drop in replacement for std::unordered_map
, but only insofar as the features that are used in the code for this critique – it’s missing most of the features of a general purpose implementation (and I’ve also edited out a few things for brevity).
#include <functional> // std::hash #include <vector> #include <utility> // std::pair #include <iterator> // std::next #include <boost/iterator/transform_iterator.hpp> #include <boost/iterator/filter_iterator.hpp> template <typename KeyT, typename T, class Hash = std::hash<KeyT>, class KeyEqual = std::equal_to<KeyT> > class my_hashmap { protected: typedef std::pair<const KeyT, T> value_type; typedef std::size_t hash_type; class nodeT { protected: alignas(alignof(value_type)) uint8_t data [sizeof(value_type)]; hash_type key_hash; //MSB is set flag, rest is next offset size_t next_and_set = 0; constexpr static decltype(next_and_set) SET_MASK = (1ULL << ((sizeof(next_and_set)<<3)-1)); constexpr static decltype(next_and_set) NEXT_MASK = ~SET_MASK; void set(bool newval) noexcept { next_and_set = (next_and_set & NEXT_MASK) | (newval?SET_MASK:0); } public: bool is_set() const noexcept { return next_and_set & SET_MASK; } bool isSameKey( const KeyT & other, const hash_type & other_hash ) const { return ( is_set() && (key_hash==other_hash) && KeyEqual()(get_cvalue().first, other) ); } value_type & get_value() noexcept { return *reinterpret_cast<value_type*> (&data); } const value_type & get_cvalue() const noexcept { return *reinterpret_cast<const value_type*>(&data); } size_t get_next_index() const noexcept { return next_and_set & NEXT_MASK; } void set_next_index(size_t next) noexcept { next_and_set = (next_and_set&SET_MASK) | (next & NEXT_MASK); } nodeT() = default; nodeT(const KeyT & key, const hash_type new_hash, const T & val): key_hash(new_hash) { new (&data) value_type{key,val}; set(true); } nodeT(const nodeT & o) : key_hash(o.key_hash), next_and_set(o.next_and_set) { if(o.is_set()) new (&data) value_type(o.get_cvalue()); } // assign (copy|move) op, move ctor // omitted for brevity ~nodeT() { if(is_set()) reinterpret_cast<value_type*>(&data) ->~value_type(); } }; const size_t n_buckets; std::vector < nodeT > storage; typedef typename decltype(storage):: iterator storage_iter; public: my_hashmap( size_t log2_num_buckets ) : n_buckets(1<<log2_num_buckets), storage(n_buckets) { } T & operator[] ( const KeyT & key ) { // no exception safety hash_type keyhash = Hash()(key); size_t bucket_index = keyhash & (n_buckets-1); size_t last_index = bucket_index; nodeT * node = & storage[bucket_index]; if ( node->is_set() ) { bool found = false; while ( (!(found = node->isSameKey(key, keyhash))) && node->get_next_index() ) node = & storage[(last_index = node->get_next_index())]; if(!found) { storage.emplace_back(key, keyhash, T{}); node = & storage[last_index]; node->set_next_index(storage.size() -1); node = &storage[node->get_next_index()]; } } else { // bucket not set yet node->~nodeT(); new (node) nodeT(key, keyhash, T{}); } return node->get_value().second; } constexpr static auto get_value_only = [](nodeT & n) -> value_type & { return n.get_value(); }; constexpr static auto is_set_pred = [](nodeT & n) -> bool { return n.is_set(); }; typedef boost::filter_iterator < decltype(is_set_pred), storage_iter > filtered_it_t; typedef boost::transform_iterator < decltype(get_value_only), filtered_it_t> value_iterator; value_iterator begin() { return value_iterator( filtered_it_t( is_set_pred, std::begin(storage), std::end(storage)), get_value_only ); } value_iterator end() { return value_iterator( filtered_it_t( is_set_pred, std::end(storage), std::end(storage)), get_value_only ); } };
There’s much wrong with this – from the lack of exception safety, risk of memory leaks, to the use of a power of two for the number of buckets (often number of buckets is a prime to limit bucket collisions). It really is just a back-of-the-laptop proof-of-concept. The point is that by dropping this in to SOUWARIB, the execution time is now around 70% of the SOUWARIB with std::unordered_map
and around 50% of the original program.
With some help from Hardware Performance Counters [3] (link is for Intel, but AMD and ARM have equivalents) it can be shown that the performance jump in SOUWARIB+my_hashmap over SOUWARIB+std::unordered_map
is due to significantly less memory accesses. Microbenchmarking with [4] shows that the cache hit ratio remains about the same in the word extraction loop but the number of accesses approximately halve.
For those not wanting to re-implement unordered_map
or other STL functionality, have a look at EASTL [5] as there is some mildly non-standard, but otherwise very fast STL equivalent functionality in there.
References
[1] http://bannalia.blogspot.co.uk/2013/10/implementation-of-c-unordered.html
[2] https://en.wikipedia.org/wiki/Non-virtual_interface_pattern
[3] https://software.intel.com/en-us/articles/intel-performance-counter-monitor
[4] https://github.com/jasonspencer/CPP_LPE_wrap
[5] https://github.com/electronicarts/EASTL
Commentary
There’s a lot in this, seemingly quite simple, critique. The first design issue is that of choosing to use string_view
. For those not familiar with this feature, new in C++17, it is a non-owning view over a sequence of characters (it’s loosely a wrapper for pointer + length). The advantage, from a performance point of view, is that there is no need to copy the data going to make up each word but it can be used in place. However, this has two disadvantages here. The first is that the original data must not be disposed of before the last use of any of the string_view
s referring to it. In this case it means we must leave the whole source file in memory as we don’t know which portions of it are in use by the unordered_map
. The choice is predicated on the decision that the program can only deal with data that can be read into memory in one go.
The second disadvantage here is that the keys in the map refer to memory with little cache locality and this has a negative effect on performance with most modern hardware. as Jason demonstrates towards the end of his critique reduction in memory access cost is often the most significant change that you can do to improve the performance of a program limited by compute time.
The subsidiary design problems mostly concern boundary conditions:
- what should be done about punctuation and capitalisation
- in some languages you may also have to consider accents: Paul in particular hit this problem
- what to do with a word ‘in progress’ at the end of the file
- what to do with ‘ties’ for counts
The entries between them covered most of these issues – for example all suggested std::isalpha()
rather than hand-crafted comparisons.
Pal expressed a concern over using a temporary for the call to insert
. Here for example emplace_hint
could be used instead (as Scott Meyers writes in Effective Modern C++: ‘Consider emplacement instead of insertion’).
As Jason states, the STL containers are good general purpose containers; the implementors work hard to ensure that they deliver the best overall performance for a range of possible use cases. In any specific program, especially where you the programmer may have additional knowledge about the data sets being processed, it may well be possible to craft a tailored collection or algorithm that out performs the one in the standard library.
However, it is also likely that any replacement you provide will contain its own bugs and it’s also important to verify that your proposed solution is in fact faster than the standard library component as these have in general had a lot of design put into them over many years!
I personally find myself a little torn over the style of using {}
for initialising variables. One advantage is consistency: you can use the same syntax for a variety of different data types without having to read and parse the various possible initialisation values. One disadvantage is that it relies on the reader knowing what the value is for each type – bool inword{false}
makes this explicit. Alternatively, using the syntax bool inword = false
has the advantage that it is more like the syntax that would be used in some other programming languages. I suspect disagreements about the best style will continue as it does not seem that there is an obvious reason to prefer one style.
Lack of error handling is a continual problem with the code critique challenges – as it is in quite a lot of other code. Anything that relies on external inputs, such as the presence of command line arguments and the existence of files, should have some sort of check as these are the sort of things that are very likely to go wrong and can also waste a surprisingly large amount of time in resolving.
The Winner of CC 111
The four critiques all did a good job of identifying a number of things to improve in the submitted code. James suggestion for using regex to do the parsing does help to produce more robust and flexible code (although unfortunately this seems to come at a price in terms of performance.) Paul described some of the problems with handling French (other languages may have similar issues) and suggested switching to a whitespace/punctuation based approach.
Pal’s intuition that there might be a standard algorithm already written for use in the final stage was sound; he found this in partial_sort_copy()
. Few people know the whole set of standard algorithms so it is well worth having a quick look when you are doing something relatively straightforward to avoid having to re-invent the wheel!
However, I think that overall Jason did the best job of identifying both some small nits but also the larger design issues and pointing towards the sorts of things that need considering when you’ve ‘tried to make it pretty fast’. It will depend on the specific use case whether the performance needs in this case would justify the extra development cost of writing a bespoke hash map, and one of the problems with the code critique format – as Pal said – is that there is no mechanism for asking those sorts of questions. So, congratulations to Jason and thanks to the other entrants.
I note that all of these entrants have submitted several times before; can I encourage those of you who haven’t quite got round to sending in an entry to give it a go?
Code Critique 112
(Submissions to scc@accu.org by Jun 1st)
I’m playing around with simple increasing sequences of small numbers and seeing what various different generators produce. I had problems that sometimes the result was wrong – for example if I produce a number that overflows the integer size I’m using – and so I added a postcondition check that the output sequence is increasing. However, I’ve found the check doesn’t always ‘fire’ – it works with gcc with or without optimisation but fails with MSVC unless I turn on optimisation. I’ve produced a really simple example that still shows the same problem, can you help me see what’s wrong?
Expected output, showing the postcondition firing:
1,2,3,4,5,6,7,8,9,10, 1,3,6,10,15,21,28,36,45,55, 1,1,2,3,5,8,13,21,34,55, 1,5,14,30,55,91,140,204,285,385, Postcondition failed! 1,5,14,30,55,91,-116,-52,29,-127,
Can you help (and perhaps identify one or two other issues en route?) The code is in Listing 2.
#include <algorithm> #include <functional> #include <iostream> #include <iterator> #include <vector> class postcondition { public: postcondition(std::function<bool()> check) : check_(check) {} ~postcondition() { if (!check_()) std::cerr << "Postcondition failed!\n"; } private: std::function<bool()> check_; }; template<typename T> std::vector<T> get_values(int n, std::function<T(T)> generator) { std::vector<T> v; auto is_increasing = [&v]() { return is_sorted(v.begin(), v.end()); }; postcondition _(is_increasing); T j = 0; for (int i = 0; i != n; ++i) { j = generator(j); v.push_back(j); } return v; } using out = std::ostream_iterator<int>; template <typename T> void sequence() { auto const & result = get_values<T>(10, [](T i) { return i + 1; }); std::copy(result.begin(), result.end(), out(std::cout, ",")); std::cout << std::endl; } template <typename T> void triangular() { T i{}; auto const & result = get_values<T>(10, [&i](T j) { return j + ++i; }); std::copy(result.begin(), result.end(), out (std::cout, ",")); std::cout << std::endl; } template <typename T> void fibonacci() { std::copy(result.begin(), result.end(), out (std::cout, ",")); std::cout << std::endl; } template <typename T> void sum_squares() { T i{1}; auto const & result = get_values<T>(10, [&i](T j) { return j + i * i++; }); std::copy(result.begin(), result.end(), out (std::cout, ",")); std::cout << std::endl; } int main() { sequence<int>(); triangular<int>(); fibonacci<int>(); sum_squares<int>(); sum_squares<char>(); // overflow expected } |
Listing 2 |
[Web Editor's Note: The print and PDF copies of CVU 30.2 contain several places where the number of this and the next Code Critique are incorrect. They (hopefull) have been corrected here.
Notes:
More fields may be available via dynamicdata ..