Journal Articles

CVu Journal Vol 28, #4 - September 2016 + Programming Topics

Browse in :

All > Journals > CVu > 284 (10)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: An Introduction to OpenMP

Author: Martin Moene

Date: 06 September 2016 16:16:39 +01:00 or Tue, 06 September 2016 16:16:39 +01:00

Summary: Silas S. Brown dabbles in multiprocessing to speed up his calculations.

Body:

If you use a CPU that was manufactured during the last few years, then the chances are it has more than one core, most likely two or four. Multi-core programming can be difficult (I would certainly recommend putting in a little effort to make sure youâ€™re using a fast-enough algorithm on one core first), but it was made easier by GCCâ€™s adoption of the OpenMP (Open Multi-Processing) standard since version 4.2 (2007). If you use a recent version of GCC, you might have OpenMP without knowing it. Try:

  gcc my-program.c -fopenmp

and see whether or not it calls it an unknown option. (I do this in a script to decide what compilation options to use on a deployment machine.)

Adding OpenMP directives to a program can be surprisingly simple. Consider a for loop:

  for (int i=0; i < nItems; i++)
    process_item(i);

If process_item looks only at item i and nothing else (no memory conflicts) then all you need to add before the for is:

  #pragma omp parallel for

and, by default, the OpenMP library will find out at runtime how many cores are available on the CPU, split into that number of threads, divide nItems by the number of threads, let each thread process its â€˜chunkâ€™ of the items, and wait for them to finish. It will also add code to let the user override the number of threads at runtime by setting an environment variable (OMP_NUM_THREADS). This is all rather powerful just for one #pragma. Of course, if that code is compiled without OpenMP support, the pragma will be ignored and the code will run sequentially. But some compilers warn about unknown pragmas, so to suppress this warning you could wrap the pragma in an ifdef:

  #ifdef _OPENMP
  #pragma omp parallel for
  #endif

which you can even extend to let you use macros to control exactly which parts of your program are parallelised:

  #define Parallelise_The_XYZ_Loop 1
  ...
  #if defined(_OPENMP) && Parallelise_The_XYZ_Loop
  #pragma omp parallel for
  #endif

Since the extra work of creating and managing the threads has an overhead, you should only use parallel for if youâ€™re sure the benefits will be worth the overhead. For very short loops, you might actually slow the program down. Always measure to check you are actually getting a speed increase.

By default, the loop counter and any variable you declare inside the loop will be private to that thread, but other variables will be shared, so if you want to change them you had better write a critical section to ensure only one thread at a time can get in:

  #pragma omp critical
  update_a_shared_variable();

critical is not needed if all youâ€™re doing is writing to an array when the element number you write to is the item number youâ€™re processing, as the other threads wonâ€™t be writing to the same element. But it is often needed in other shared-variable circumstances; you are going to have to think.

One pattern that is often seen in OpenMP programming is to check if a shared variable needs updating, then enter a critical section and repeat the check:

  if (shared_variable_needs_updating())
    #pragma omp critical
    if (shared_variable_needs_updating())
      update_a_shared_variable();

The second check is there in case another thread beats us to it with updating the shared variable. For example, this might be used if the shared variable is â€˜best solution found so farâ€™: just because we found a better solution outside the critical section doesnâ€™t mean nobody else posted an even better one just before we entered it. We could save the extra comparison by entering the critical section unconditionally and THEN making the comparison, but that would be inefficient because it would hold up other threads unnecessarily.

One trick that might be useful during debugging is to add default(none) to the end of the parallel for pragma. That tells OpenMP to refrain from its default behaviour of making variables within the loop private to each thread and other variables shared, and forces you to declare the shared/private status of each variable explicitly. If you havenâ€™t done so, you get some handy error messages pointing out each variable referred to from the parallel section. This can be very useful indeed when retro-fitting OpenMP to existing code and the loop is too large for you to be sure youâ€™ve noticed everything.

parallel for can take only normal for loops that count items as they go; trying to be more â€˜cleverâ€™ with the for statement will not work with OpenMP. You may use the continue statement in a parallel for, but not break (unless itâ€™s inside another loop etc thatâ€™s nested inside the parallel one), and not return. This is for obvious reasons: there would be no way for the OpenMP libraries to make sure that break or return stops other iterations of the loop if some other thread is already running away with them.

By default, parallel for assumes that each loop iteration will be roughly equal, and so it splits the number of required iterations evenly among the threads. You could instead add schedule(dynamic) to the pragma to take the alternative approach of sending just one iteration at a time to each thread (so for example if there are four cores, the first four iterations will be started on immediately, and as soon as one of the cores finishes its iteration it will be given the fifth iteration to do), but that tends to work well only if each iteration is quite long; if iterations are short then the overheads of managing the dynamic schedule slow things down. You can however do your own scheduling: instead of using parallel for, just say:

  #pragma omp parallel
  some_function_or_block()

which will run N identical copies of some_function_or_block(); these copies will then need to work out amongst themselves which one does which section of work. To help with this, omp.h defines the functions omp_get_thread_num() and omp_get_num_threads(): the thread number will be between 0 and threads-1 inclusive. Since I like to make sure my programs can still compile even if OpenMP is not present, I do this:

  #ifdef _OPENMP
  #include <omp.h>
  #else
  #define omp_get_num_threads() 1
  #define omp_get_thread_num() 0
  #endif

You have to be careful, when dividing your work units by the number of threads, to make sure no work is left out due to the division result being rounded down. If your units are fairly even then itâ€™s probably best just to use OMPâ€™s own parallel for which does all the work for you.

Signals are usually sent to an arbitrary thread, so the best thing to do in a signal handler is probably just to set a flag which all threads regularly check.

OpenMP works in C++ as well, but if you are using a lot of objects then you might need to be even more careful of where you put your critical sections.

Besides GCC, other compilers that support OpenMP include Visual C++ (from its 2005 version onward) and the Intel compiler, but I havenâ€™t tried these. Clang 3.7 supports it, but some older Macs (e.g. OS X 10.7) have both Clang and GCC installed where the GCC supports OpenMP but the Clang does not. OpenMP implementations are generally limited to multicore CPUs with shared memory, as in a modern multicore desktop; more advanced approaches are needed if youâ€™re running on a supercomputer or cluster that does not share its memory between all the cores, or if you want to run your processing on graphics cards (GPUs).

On slightly older Apple computers, thereâ€™s some strange bug that means you canâ€™t call memcpy() from inside a function that uses OpenMP: you have to wrap that memcpy() into another function of your own and call that. But the function you wrap it in can be â€˜inlineâ€™ so you donâ€™t actually lose anything. If you get other problems on Apple, try:

  #define _FORTIFY_SOURCE 0

as a workaround.

Finally, if you are cross-compiling for Windows using MingW, you might want to use the -static flag to make sure the .exe file doesnâ€™t depend on OpenMP and threading DLLs. Windows .exe files are easier to distribute if they donâ€™t need DLLs.

Notes:

More fields may be available via dynamicdata ..