Journal Articles
Browse in : |
All
> Journals
> CVu
> 284
(10)
All > Topics > Programming (877) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: An Introduction to OpenMP
Author: Martin Moene
Date: 06 September 2016 16:16:39 +01:00 or Tue, 06 September 2016 16:16:39 +01:00
Summary: Silas S. Brown dabbles in multiprocessing to speed up his calculations.
Body:
If you use a CPU that was manufactured during the last few years, then the chances are it has more than one core, most likely two or four. Multi-core programming can be difficult (I would certainly recommend putting in a little effort to make sure you’re using a fast-enough algorithm on one core first), but it was made easier by GCC’s adoption of the OpenMP (Open Multi-Processing) standard since version 4.2 (2007). If you use a recent version of GCC, you might have OpenMP without knowing it. Try:
gcc my-program.c -fopenmp
and see whether or not it calls it an unknown option. (I do this in a script to decide what compilation options to use on a deployment machine.)
Adding OpenMP directives to a program can be surprisingly simple. Consider a for
loop:
for (int i=0; i < nItems; i++) process_item(i);
If process_item
looks only at item i
and nothing else (no memory conflicts) then all you need to add before the for
is:
#pragma omp parallel for
and, by default, the OpenMP library will find out at runtime how many cores are available on the CPU, split into that number of threads, divide nItems
by the number of threads, let each thread process its ‘chunk’ of the items, and wait for them to finish. It will also add code to let the user override the number of threads at runtime by setting an environment variable (OMP_NUM_THREADS
). This is all rather powerful just for one #pragma
. Of course, if that code is compiled without OpenMP support, the pragma will be ignored and the code will run sequentially. But some compilers warn about unknown pragmas, so to suppress this warning you could wrap the pragma in an ifdef
:
#ifdef _OPENMP #pragma omp parallel for #endif
which you can even extend to let you use macros to control exactly which parts of your program are parallelised:
#define Parallelise_The_XYZ_Loop 1 ... #if defined(_OPENMP) && Parallelise_The_XYZ_Loop #pragma omp parallel for #endif
Since the extra work of creating and managing the threads has an overhead, you should only use parallel for
if you’re sure the benefits will be worth the overhead. For very short loops, you might actually slow the program down. Always measure to check you are actually getting a speed increase.
By default, the loop counter and any variable you declare inside the loop will be private to that thread, but other variables will be shared, so if you want to change them you had better write a critical
section to ensure only one thread at a time can get in:
#pragma omp critical update_a_shared_variable();
critical
is not needed if all you’re doing is writing to an array when the element number you write to is the item number you’re processing, as the other threads won’t be writing to the same element. But it is often needed in other shared-variable circumstances; you are going to have to think.
One pattern that is often seen in OpenMP programming is to check if a shared variable needs updating, then enter a critical
section and repeat the check:
if (shared_variable_needs_updating()) #pragma omp critical if (shared_variable_needs_updating()) update_a_shared_variable();
The second check is there in case another thread beats us to it with updating the shared variable. For example, this might be used if the shared variable is ‘best solution found so far’: just because we found a better solution outside the critical
section doesn’t mean nobody else posted an even better one just before we entered it. We could save the extra comparison by entering the critical
section unconditionally and THEN making the comparison, but that would be inefficient because it would hold up other threads unnecessarily.
One trick that might be useful during debugging is to add default(none)
to the end of the parallel for
pragma. That tells OpenMP to refrain from its default behaviour of making variables within the loop private to each thread and other variables shared, and forces you to declare the shared/private status of each variable explicitly. If you haven’t done so, you get some handy error messages pointing out each variable referred to from the parallel section. This can be very useful indeed when retro-fitting OpenMP to existing code and the loop is too large for you to be sure you’ve noticed everything.
parallel for
can take only normal
for loops that count items as they go; trying to be more ‘clever’ with the for
statement will not work with OpenMP. You may use the continue
statement in a parallel for
, but not break
(unless it’s inside another loop etc that’s nested inside the parallel one), and not return
. This is for obvious reasons: there would be no way for the OpenMP libraries to make sure that break
or return
stops other iterations of the loop if some other thread is already running away with them.
By default, parallel for
assumes that each loop iteration will be roughly equal, and so it splits the number of required iterations evenly among the threads. You could instead add schedule(dynamic)
to the pragma to take the alternative approach of sending just one iteration at a time to each thread (so for example if there are four cores, the first four iterations will be started on immediately, and as soon as one of the cores finishes its iteration it will be given the fifth iteration to do), but that tends to work well only if each iteration is quite long; if iterations are short then the overheads of managing the dynamic schedule
slow things down. You can however do your own scheduling: instead of using parallel for
, just say:
#pragma omp parallel some_function_or_block()
which will run N identical copies of some_function_or_block()
; these copies will then need to work out amongst themselves which one does which section of work. To help with this, omp.h defines the functions omp_get_thread_num()
and omp_get_num_threads()
: the thread number will be between 0 and threads-1 inclusive. Since I like to make sure my programs can still compile even if OpenMP is not present, I do this:
#ifdef _OPENMP #include <omp.h> #else #define omp_get_num_threads() 1 #define omp_get_thread_num() 0 #endif
You have to be careful, when dividing your work units by the number of threads, to make sure no work is left out due to the division result being rounded down. If your units are fairly even then it’s probably best just to use OMP’s own parallel for
which does all the work for you.
Signals are usually sent to an arbitrary thread, so the best thing to do in a signal handler is probably just to set a flag which all threads regularly check.
OpenMP works in C++ as well, but if you are using a lot of objects then you might need to be even more careful of where you put your critical
sections.
Besides GCC, other compilers that support OpenMP include Visual C++ (from its 2005 version onward) and the Intel compiler, but I haven’t tried these. Clang 3.7 supports it, but some older Macs (e.g. OS X 10.7) have both Clang and GCC installed where the GCC supports OpenMP but the Clang does not. OpenMP implementations are generally limited to multicore CPUs with shared memory, as in a modern multicore desktop; more advanced approaches are needed if you’re running on a supercomputer or cluster that does not share its memory between all the cores, or if you want to run your processing on graphics cards (GPUs).
On slightly older Apple computers, there’s some strange bug that means you can’t call memcpy()
from inside a function that uses OpenMP: you have to wrap that memcpy()
into another function of your own and call that. But the function you wrap it in can be ‘inline’ so you don’t actually lose anything. If you get other problems on Apple, try:
#define _FORTIFY_SOURCE 0
as a workaround.
Finally, if you are cross-compiling for Windows using MingW, you might want to use the -static
flag to make sure the .exe file doesn’t depend on OpenMP and threading DLLs. Windows .exe files are easier to distribute if they don’t need DLLs.
Notes:
More fields may be available via dynamicdata ..