Journal Articles
Browse in : |
All
> Journals
> CVu
> 316
(11)
All > Journal Columns > Code Critique (70) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: Code Critique Competition 121
Author: Bob Schmidt
Date: 03 January 2020 17:13:00 +00:00 or Fri, 03 January 2020 17:13:00 +00:00
Summary: Set and collated by Roger Orr. A book prize is awarded for the best entry.
Body:
Please note that participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.
Note: if you would rather not have your critique visible online please inform me. (Email addresses are not publicly visible.)
Last issue’s code
I’m doing some simple text processing to extract sentences from a text document and produce each one on a separate line. I’m getting a space at the start of the lines and wonder what’s the best way to change the regexes to remove it?"
$ cat input.txt This is the first sentence. And the second. This is the last one. $ sentences input.txt This is the first sentence. And the second. This is the last one.
Listing 1 contains the code. As always, there are a number of other things you might want to draw to the author’s attention as well as the presenting problem.
#include <fstream> #include <iostream> #include <iterator> #include <sstream> #include <string> #include <regex> using namespace std; int main(int argc, char** argv) { std::ifstream ifs(argv[1]); std::stringstream ss; char ch; while (ifs >> std::noskipws >> ch) ss << ch; string s = ss.str(); smatch m; while (regex_search(s, m, regex("[\\s\\S]*?\\."))) { std::string sentence(m[0]); std::regex whitepace("[\\s]+"); std::regex_replace( std::ostream_iterator<char>(std::cout), sentence.begin(), sentence.end(), whitepace, " "); std::cout << std::endl; s = m.suffix().str(); } } |
Listing 1 |
Critiques
Pete Cordell <accu@codalogic.com>
Being one of the few people in the world who actually likes regular expressions, I thought I’d have a play with this Code Critique.
In the order that thoughts arose in the file, I noticed that the code used the common using namespace std;
line but then prefixed many functions and classes with std::
anyway. I’d prefer the code to either rely on the directive or not, and it should commit to one way or the other.
I opted for not using the directive. That said, these days I feel that in a modern language strings should be treated as first class types similar to bool
, int
and float
. I therefore changed using namespace std;
to using std::string;
.
This does make the primary loop where the majority of the regex work is done somewhat noisy. As much by way of experiment I added statements such as using std::smatch;
at the beginning of the function to see how it looked. I actually quite like this as it gives the reader some foresight that these are the sorts of toys this function is going to be dealing with.
Going back to the function declaration for main()
, g++ 9.2.0 on Wandbox complained that argc
wasn’t used. The simple way to fix this is to do use int main(int, char** argv)
.
The reading of the input into a string can actually be done as a one liner:
string s((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>())
This looks like some weird incantation to me but I guess there are only so many ways that an input stream can be used to initialise a string, so a reader maybe able to guess what was going on even if they weren’t sure.
The core regex
in the example is "[\\s\\S]*?\\."
. This relies on the non-default non-greedy matching specification "*?"
. To me, this is an advanced regex
feature and best avoided if possible. (If nothing else, the syntax looks scary.) The desired goal of the regex
is to match anything up to and including the first encountered full-stop. I think this is better expressed using a negative character class match such as "[^.]*\\."
.
The "[^.]*\\."
expression also captures leading whitespace. My simple solution to this is to change the expression to "\\s*([^.]*\\.)"
. This skips over the leading whitespace and creates another capture group for what we want. So where in the loop we previously had m[0]
for what the whole regular expression captured, I changed this to m[1]
.
The "[\\s]+"
regex
for whitespace is over complicated and can simply be done as "\\s+"
.
The last line of the original loop was s = m.suffix().str();
. Creating a new string for the subsequent iteration feels like the wrong way to do things. My solution to this was to set up a string
iterator of the form string::const_iterator is( s.cbegin() )
and then use that in the loop condition as regex_search(is, s.cend(), m, …
. At the end of the loopb the iterator is updated using is += m.length()
. Whether this is the idiomatic way to do this I don’t know. I look forward to finding out. [Editor: you could use is = m.suffix().first;
]
My modified code is:
#include <fstream> #include <iostream> #include <iterator> #include <sstream> #include <string> #include <regex> using std::string; int main(int , char** argv) { using std::smatch; using std::regex_search; using std::regex; using std::regex_replace; std::ifstream ifs( argv[1] ); string s((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>()); smatch m; string::const_iterator is( s.cbegin() ); while (regex_search(is, s.cend(), m, regex("\\s*([^.]*\\.)"))) { string sentence(m[1]); regex whitepace("\\s+"); regex_replace( std::ostream_iterator<char>(std::cout), sentence.begin(), sentence.end(), whitepace, " "); std::cout << std::endl; is += m.length(); } }
James Holland <James.Holland@babcockinternational.com>
In order to make corrections, it is necessary to understand how the existing program works. Regular expressions can be a little difficult to fathom and so I will need to take things one step at a time. The first statement of interest is the regex
parameter within the control section of the while
loop. Enclosed within the square brackets are two shortcut characters. The first is \\s
and denotes a space character. The second is \\S
and denotes any character that is not a space. Either of these shortcuts are accepted as a match because they are enclosed in square brackets. This has the effect of matching any character including spaces.
The next regex
character is *
. This has the effect of including any number of characters in a single match. Sentences consist of many characters and so this required. The next regex
character is ?
, which means stop trying to include characters in the match as soon as a match is found. Lastly, \.
signifies that a match is to end in a full stop.
The problem with this regular expression is that it will match sentences with leading spaces as well as sentences without leading spaces. This can be remedied by adding \S
to the beginning of the expression. This requires a match not to start with a space. The software now provides the desired result.
Incidentally, (.|\n)
could be used in place of [\\s\\S]
. The parentheses are required to obtain the correct association of operators but introduces a capture group. To tell the regex
that a capture group is not required, ?:
is inserted after the (
character giving (?:.|\n)
. It is possible to make the value of reg
slightly simpler by using raw literals. This removes the need for the ‘escape’ characters and becomes R"(\S[\s\S]*?\.)"
and is probably the shortest representation.
It should be noted that in the provided program regex_replace()
removes all occurrences of one or more spaces from the sentence and replaces each occurrence with a single space.
There are still some oddities with the program, however.
There is no need to prefix standard library components, such as cout
and ifstream
, with the namespace std
as the using
directive has been used to make the contents of the std
namespace available. Also, there is no need to read the contents of the file into a stringstream
and then copy the stringstream
to a string
. The file can be read into a string
directly. I show how this can be done in my version of the program.
The regex
object within the control section of the while
loop is created for every iteration of the loop and wastes significant time. The creation of the regex
object should be brought outside the while
loop and so created only once. The same applies to the whitespace
regex
.
Using regex_search
repeatedly within a loop is not good practice. Although it appears to work in this application, there are regex
iterators available that are designed to find every match within a source string
. I have adopted this approach in my version of the program.
To make a more self-documenting program, const
could be applied to identifiers that do not change their state after initialisation. Such objects include whitespace
and sentence
. It is not necessary to flush cout
after writing every sentence and so a simple '\n'
would suffice instead of endl
.
It is also possible to add regex_constants::optimize
to the regex
constructors to optimise for speed but I did not find any improvement in this particular application. Incidentally, the original (corrected) code took about 6.2 seconds to extract 4800 sentences from a test file (but not to display them). The version I provide took 0.1 seconds. An impressive improvement.
#include <fstream> #include <iostream> #include <iterator> #include <regex> #include <sstream> #include <string> using namespace std; int main(int arg, char** argv) { ifstream ifs(argv[1]); const string s((istreambuf_iterator(ifs)), {}); const regex reg(R"(\S[\s\S]*?\.)"); const sregex_iterator end; const regex whitespace("[\\s]+"); for (sregex_iterator pos(s.cbegin(), s.cend(), reg); pos != end; ++pos) { const string sentence(pos->str()); regex_replace( ostream_iterator<char>(cout), sentence.begin(), sentence.end(), whitespace, " "); cout << '\n'; } }
Hans Vredeveld <accu@closingbrace.nl>
The last character of a sentence is a period. After this period, we first get some whitespace before the next sentence starts. The regular expression in the regex_search
statement treats this whitespace as part of this next sentence. An easy solution is to consume whitespace at the beginning of a sentence separately and not output it:
In the regex_search
, change the regular expression to R"([\s]*([\s\S]*?\.))"
.
In the body of the while
-loop, initialize sentence with m[1]
, instead of m[0]
.
By the way, I also changed the string literal to a raw string literal for readability. In regular expressions, the backslash is used a lot and this avoids having to double it in the string literal.
Now that we have solved the main issue, let’s see what else we can improve. The first thing is the inconsistent use of functions and types from the std
namespace. Most of the time they are used by prefixing them with std::
, but on some occasions (declaration of s
and m
, use of the unnamed regex
object in the while
-statement) they are found because of the using-directive at line 7. I have a strong preference to not use using
-directives and would remove it. The types and functions will have to be prefixed with std::
. Note that, because of ADL, regex_search
still will be found in namespace std
without prefixing it with the namespace.
Leaving the code that reads the file as is for now, we see a while
-loop that loops over the string s
that is defined before the loop and updated in the last statement of the loop body. We can replace this by a for
-loop with an sregex_iterator
as loop variable:
std::string s = ss.str(); std::regex re(R"([\s]*([\s\S]*?\.))"); for (std::sregex_iterator it(s.begin(), s.end(), re); it != std::sregex_iterator(); ++it)
The initialization of sentence has to be updated to use *it
instead of m
(that is no longer used).
This change also improves performance. A crude performance measurement with a large input file (the original test input concatenated 1,000 times) shows that execution time was reduced from roughly 1.5 seconds to 0.6 seconds (on a Linux laptop with a 4-core Intel i7@2.20GHz and 16GB).
In the loop bodyb a sequence of whitespace characters in the sentence is replaced by a single space for each sentence separately. It is more efficient to do this once on the entire input before entering the loop:
std::regex whitespace(R"([\s]+)"); std::string s = std::regex_replace(ss.str(), whitespace, " ");
The loop body is now reduced to simply streaming to std::cout
(also std::endl
is replaced by '\n'
to avoid flushing the buffer after each line):
std::cout << (*it)[1] << '\n';
As for the execution time, we are now down to an execution time of about 0.08 seconds on the same input as above.
If we include the reading from file, we now fill the string s
by reading the file one character at a time, including whitespaces, and then replace each sequence of whitespaces by a single space. We can combine these actions by reading one word (non-whitespace characters delimited by whitespace characters) at a time, while skipping whitespace and precede the words with a space when putting them in the stringstream
. To keep the scope of the variable containing the words local to the loop, we change the while
-loop to a for
-loop and declare the variable in the init
expression of the for
-statement. The execution time with the large input file now goes down to about 0.03 seconds.
After all this refactoring, the entire program looks as follows:
#include <fstream> #include <iostream> #include <sstream> #include <string> #include <regex> int main(int argc, char** argv) { std::ifstream ifs(argv[1]); std::stringstream ss; for (std::string word; ifs >> word;) ss << ' ' << word; std::string s = ss.str(); std::regex re(R"([\s]*([\s\S]*?\.))"); for (std::sregex_iterator it(s.begin(), s.end(), re); it != std::sregex_iterator(); ++it) std::cout << (*it)[1] << '\n'; }
There are still some opportunities for change/improvement left. I will just remark on them here.
We now have a program with only one regular expression in it. We could replace this by using std::find
, std::string::find
or friends, but then we would have to manually skip over the spaces between sentences. That would make our program more complex, and I like the simplicity that we have now with the regular expression. Also, it would be interesting to see what compile-time regular expressions would do for this code.
With a very small change to the regular expression, the program can also handle sentences ending in an exclamation point or question mark. It will be more difficult to handle the decimal point in numbers with a fractional part properly.
When reading the file, the program does not do any error handling of its own. Most of the errors that need to be handled are already handled by std::ifstream
, but that doesn’t give any feedback to the user. The one case that is not handled and that leads to undefined behaviour, is when argc == 0
. In that case, argv[1]
is undefined. (If argc == 1
, argv[1]
is a null pointer.)
Commentary
This critique was in part generated by some discussion about the inefficiency of the C++ regex
library. While I believe that there are genuine issues with the specification of the library, this critique demonstrates that at least some of the problems are due to inappropriate use of the library.
One key item to improve the use of regex
is the understanding that there are two parts to the process – firstly generating the regular expression and second applying it to the target. In this critique, the creation of the regex
objects each time round the loop is particularly troubling. There was thought given, when the regex
library was written, to supporting regular expressions in an idiomatic C++ manner, and some of the solutions provided above demonstrate this.
The other part of the commentary is to recommend using raw string literals (available since C++11) for strings where the number of escape characters would otherwise make the string unreadable. This is very common with regex
expressions as use of the backslash is common, and each use must be escaped when using a standard string literal.
Regular expressions were the first motivating example in the original proposal, N2146, including a line from a C++ program which read:
"('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|"
While Hans noted that the code invoked undefined behaviour if argc
was 0, no-one pointed out that the program crashes (typically) if argc
was 1 (as the argument is expected to point to a null terminated character string). I would like to have seen this case handled, for example by writing a usage message, to improve the usability of the program.
The winner of CC 120
All the critiques resolved the presenting issue of the leading spaces – although in slightly different fashions. I’m not sure there is necessarily a ‘best way’ – I think it depends what fits in best with the direction of the solution.
The solutions also improved the performance of the original code; I did some basic performance measurement of the three solutions and Pete and James’s solutions seem of similar speed while Hans’s is noticeably faster (a result of the change he made to perform whitespace handling on input)
Pete made a good call to remove the use of the non-greedy matching specification ("*?"
) in the original regex
; I agree with him that this can make the regular expression harder to understand for the reader.
I think this critique is a good place to use sregex_iterator
and I was pleased to see two solutions did so. I think this better expresses the intent of the example; it may also be more performant (but this depends upon the standard library implementation being used.)
James’s solution nicely illustrates the use of CTAD (Class Template Argument Deduction) in the initialisation of the string s
using istream_iterator
– you may need to enable experimental C++ support in your compiler to get this to compile pending the C++20 standard.
Hans also pointed out a few problems with the code – such as in handling sentences ending with punctuation and the lack of error handling and so I am awarding him the prize for this issue’s critique.
Code Critique 121
(Submissions to scc@accu.org by February 1st)
I wanted to offload logging to the console from my main processing threads, so I thought I’d try to write a simple asynchronous logger class to help me do that (and hopefully learn something about threading while I’m doing it). Unfortunately it only prints the first line (or sometimes the first couple) and I’m not really sure how best to debug this as the program doesn’t seem to behave quite the same way in the debugger.
$ queue.exe main thread Argument 0 = queue.exe
or sometimes only:
$ queue.exe main thread
The coding is in three listings:
- Listing 2 contains async logger.h
- Listing 3 contains async logger.cpp
- Listing 4 contains queue.cpp.
#pragma once #include <iostream> #include <queue> #include <string> #include <thread> using std::thread; class async_logger { bool active_; std::queue<std::string> queue_; thread thread { [this]() { run(); } }; void run(); public: async_logger(); ~async_logger(); void log(const std::string &str); }; |
Listing 2 |
#include "async_logger.h" #include <iostream> using std::thread; // This runs in a dedicated thread void async_logger::run() { while (active_) { if (!queue_.empty()) { std::cout << queue_.front() << '\n'; queue_.pop(); } } } async_logger::async_logger() { active_ = true; thread_.detach(); } async_logger::~async_logger() { active_ = false; } // queue for processing on the other thread void async_logger::log(const std::string &str) { queue_.emplace(str); } |
Listing 3 |
#include "async_logger.h" int main(int argc, char **argv) { async_logger logger; logger.log("main thread"); thread test1([&logger]() { logger.log("testing thread 1"); }); for (int idx = 0; idx < argc; ++idx) { logger.log("Argument " + std::to_string(idx) + " = " + argv[idx]); } logger.log("main ending"); test1.join(); } |
Listing 4 |
You can also get the current problem from the accu-general mail list (next entry is posted around the last issue’s deadline) or from the ACCU website (http://accu.org/index.php/journal). This particularly helps overseas members who typically get the magazine much later than members in the UK and Europe.
Roger has been programming for over 20 years, most recently in C++ and Java for various investment banks in Canary Wharf and the City. He joined ACCU in 1999 and the BSI C++ panel in 2002.
Notes:
More fields may be available via dynamicdata ..