Selective Parallelization using C++ Template Metaprogramming

Navin Mohan
4 min readApr 20, 2019


This is the continuation of my previous post and I recommend reading it first.

Photo by Tine Ivanič on Unsplash

In the last post, we stopped after parallelizing the for-loop which was the performance bottleneck. We also managed to get a decent boost in performance thanks to OpenMP.

With all that good stuff, we might have overlooked a few potential issues. Firstly, the fact that we’re using a vector for our benchmarking. If we swap it out for say a std::list, the performance starts to drop.

Let’s take a closer look at the parallel implementation.

// map function 
omp parallel for schedule(static)
for(size_t i = 0; i < len; i++){
auto new_ele = func(*(it+i));
*(new_it+i) = new_ele;
// filter function
#pragma omp parallel for schedule(static)
for(size_t i = 0; i < len; i++){
#pragma omp critical
{new_arr.insert(new_arr.end(), *(it+i));}
// reduce function
#pragma omp parallel for schedule(static) reduction(+:initial)
for(size_t i = 0; i < len; i++){
initial += func(*(it+i));

In all the three implementations, we’re relying on random-access of the elements inside the container.

It is clearly stated in the documentation that a std::list is not meant for random-access. That’s the reason why we aren’t getting the expected performance from our parallel loops.

We have two options here. Either make our code more generic or let the compiler decide on whether or not to use the parallel implementation depending on the container.

For now, let’s focus on the latter.

Selectively Picking a Parallel Implementation

Before we jump in, let’s define the behavior that we want.

  • We want the compiler to pick our parallel implementation whenever a supported container is used and it should fall back to normal implementation with any other container.
  • There should not be any run-time overhead.

The required behavior seems pretty familiar, isn’t it?

It’s just conditional branching. We could simply use the typeid operator from the standard typeinfo header to construct an if-else block.

#include <iostream>
#include <typeinfo>
template <typename T>
void fun(T a){
if(typeid(a) == typeid(int)) std::cout << "int\n";
else std::cout << typeid(a).name() << std::endl;
int main(){
int k = 10;
float p = 1.2;
return 0;
/* Output:
** int
** f

The above example demonstrates how we can use typeid operator to perform the type checking.

We could come up with something like this.

if(typeid(container) == typeid(supported_container)){
// do parallel
// do normal

Any optimizing compiler should probably eliminate the condition as one of them will be dead code.

But can we do better?

Yes, welcome to the world of template metaprogramming!

Checking Type Equality at Compile Time

Going by the rules of template metaprogramming, we should let the compiler generate the required source code during compile time. In our case, we want the compiler to make that decision based on the container used. So we need some kind of a type checking mechanism.

It’ll be convenient if we could just give in two types and get a boolean as the result. Let’s see how we do just that in plain C++.

In simple terms, we need to run the type checking expression but this time during compilation. I know that you’re thinking about constexpr.

template <typename T, typename U>
constexpr bool type_eq(){
return typeid(T) == typeid(U);
type_eq<int,int>(); // evaluates to true
type_eq<int,float>(); // evaluates to false

And that should work fine.

We could also do the same without a constexpr, using partial template specialization.

template <typename T, typename U>
struct is_same{
static const bool result = false;
template <typename T>
struct is_same<T,T>{
static const bool result = true;
is_same<int,int>::result // true
is_same<int,float>::result // false

With that our type checking part is sorted.

Partial Template Specialization

As we have seen earlier, partial template specialization lets us customize class templates for a new set of template arguments. We will be using this technique together with the type checking to get the defined behavior.

For it to work we will be needing both the parallel and serial implementations to coexist as two partial specializations.

Since C++ does not support partial specialization for function templates, we can wrap them in a struct.

Let’s take a look at the implementation of the map function.

template <typename type, typename container, bool parallel>
struct map_impl{}; // we won't be using this one
// parallel version where parallel = true
template <typename type, typename container>
struct map_impl<type,container,true>{
static container map(container& arr, type func(type)){
// the parallel code
// serial version where parallel = false
template <typename type, typename container>
struct map_impl<type,container,false>{
static container map(container& arr, type func(type)){
// the serial code

We now have both the implementations available as partial specializations.

Before we update the main API we need one more utility function.

// checks if the container is one of the supported types
template <typename T1,typename T2>
constexpr bool is_parallel_supported_type(){
return type_eq<std::vector<T1>,T2>(); // other types can be added here if need

It’s time to update the main API of the library.

// namespace transform
template <typename type, typename container>
container map(container& arr, type func(type)){
return map_impl::map<type,container,is_parallel_supported_type<type,container>()>::map(arr,func);

As we can see the main API just became a wrapper function. Don’t worry about the additional function call, the optimizer should inline this one.

Wrapping up

Now we have both parallel and serial code that the compiler is able to select appropriately depending on our choice of container. All of this without any additional run-time overhead.

The compiler used: GCC 7.3

The code is available here.



Navin Mohan