Saturday, June 3, 2017

constexpr quicksort in C++17

So I've written about compile time quick sort twice before (2011 and 2015,) but now when C++17 support is becoming available, I thought I'd try it again.

I'll be taking advantage of std::integral_constant<type, value> a lot. It has the advantage that it encodes the value in the type directly, while still allowing arithmetics as if it was a constant of the type. Unfortunately, it's rather much to write, so to save typing, a convenient variable template is introduced:

1
2
template <auto N>
std::integral_constant<decltype(N), N> c;

This brings a convenient short hand notation c<3U> meaning an lvalue of type std::integral_constant<unsigned int, 3U>. Line 1 shows template <auto>, which is a very handy new feature in C++17 for non-type template parameters, where the type is deduced. You can see example usage in the Compiler Explorer (thanks @mattgodbolt.)

Before we can go on to sorting, we need a way to express a range to sort, and to operate on those ranges. I've used a simple type_list template:

1
2
3
4
5
template <typename ... T>
struct type_list
{
    constexpr type_list(T...) {}
};

The constructor is there to take advantage of another new C++17 feature: automatic template parameter deduction. It's possible to write type_list{c<3>, c<8>} to construct an instance of type_list<std::integral_constant<int, 3>, std::integral_constant<int, 8>>. Here, (line 4 above) the actual values aren't used in the type_list, it's the types alone that are interesting. The values are just used to guide the compiler to deduce the types correctly. The same technique can be used in more conventional programming, for example std::pair{3, 'c'} constructs a std::pair<int, char> which holds the values 3 and 'c'.

Now we need a few functions to operate on the type lists:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
template <typename T, typename ... Ts>
constexpr auto head(type_list<T, Ts...>)
{
    return T{};
}

template <typename T, typename ... Ts>
constexpr auto tail(type_list<T, Ts...>)
{
    return type_list{Ts{}...};
}

template <typename ... Ts, typename ... Us>
constexpr auto operator|(type_list<Ts...>, type_list<Us...>)
{
    return type_list{Ts{}..., Us{}...};
}

Both tail() and the concatenating operator|() use the automatic template parameter deduction to construct the returned type_list. Here's a compiler explorer to play around a bit with them.

Now comes the matter of partitioning a type_list into two lists based on comparison with a pivot element. The easiest way to do this is a classic recursion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
template <typename Compare, typename P, typename ... Ts>
constexpr auto partition(Compare compare, P pivot, type_list<Ts...> tl)
{
    if constexpr (sizeof...(Ts) == 0)
    {
        return std::pair(type_list{}, type_list{});
    }
    else
    {
        constexpr auto h = head(tl);
        constexpr auto r = partition(compare, pivot, tail(tl));
        if constexpr (compare(h, pivot))
        {
            return std::pair(type_list{h} | r.first, r.second);
        }
        else
        {
            return std::pair(r.first, type_list{h} | r.second);
        }
    }
}

if constexpr is a new C++17 construction (often referred to as constexpr-if,) that is a major blessing when doing template programming. Unlike an ordinary if statement, if constexpr only generates code for the branch selected by the constexpr condition.

Above, the else branch doesn't compile for an empty type_list<> tl, so an ordinary if statement would give a compilation error. In C++14 and earlier, it would be necessary to write two separate partition functions, one that matches the empty list, and one for the general case.

So, given a compare function, a pivot value, and a type_list<Ts...>, the function partition returns a pair of type lists. The first containing the Ts that are true for compare(pivot, t), and the second containing the other Ts. The compare function can be an ordinary lambda. In C++17, lambdas are constexpr (when possible.)

Checkout this Compiler Explorer to play with it.

Now all bits necessary for doing quick sort are in place, and it's an almost embarrassingly simple textbook-style recursive solution:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template <typename Compare, typename ... T>
constexpr auto sort(type_list<T...> tl, Compare compare)
{
    if constexpr (sizeof...(T) == 0)
    {
        return type_list{};
    }
    else
    {
        constexpr auto pivot = head(tl);
        constexpr auto r = partition(compare, pivot, tail(tl));
        return sort(r.first, compare) | type_list{pivot} | sort(r.second, compare);
    }
}

Here's a compiler explorer that converts the sorted type_list<> into a std::array<>, just to visualise the data generated. You may notice that the optimisation level chosen is -O0, and yet no code is generated to produce the sorted array.

As before, the usefulness of this is very limited, but it is kind of cool, isn't it?

Sunday, January 8, 2017

Higher order functions as an enabler for lazy evaluation

Yesterdays post about Generating lambdas for clarity and performance showed how to make use of higher order functions to improve clarity while giving the optimiser a chance to improve performance, but the example used retained the original inflexible design.

Here's a short recap. Given a struct Employee, there is a function that filters out developers based on their salary. The original code looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct Employee
{
  using Number = unsigned;
  Number id;
  unsigned salary;
  std::string title;
  std::string name;
};

using staff = std::vector<Employee>;

auto devs_who_make_at_least(const staff& s, unsigned floor)
{
  std::vector<staff::value_type const*> rv;
  for (auto& e : s)
  {
    if (e.salary >= floor && e.title == "Developer")
    {
      rv.push_back(&e);
    }
  }
  return rv;
}

The code is by no means difficult to understand, but it lacks in flexibility, and loops like this spread over many places in the code base becomes a maintenance liability.

Enter a kind of filter function template, that like the devs_who_make_at_least() function, returns a vector of pointers to the elements given, but using a provided predicate instead of a hard coded condition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
template <typename Container, typename Pred>
auto elements_matching(Container const& c, Pred&& p)
{
  std::vector<typename Container::value_type const*> rv;
  for (auto& e : c)
  {
    if (p(e))
    {
      rv.push_back(&e);
    }
  }
  return rv;
}

This is more generic. It works on any STL-like container and accepts any predicate that works on the value type. A few sample higher order functions were introduced to select elements just as in the original example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
auto earns_at_least(unsigned floor)
{
  return [floor](const Employee& e) { return e.salary >= floor; };
}

auto has_title(const std::string& s)
{
  return [s](const Employee& e) { return e.title == s; };
}

template <typename T, typename U>
auto pred_and(T&& t, U&& u)
{
  return [t,u](auto&& x) { return t(x) && u(x); };
}

Here pred_and() is used to create a new predicate from two others, returning true only if both contained predicates return true. Creating a vector of pointers to all developers that earn 1M or more simply becomes:

1
2
3
auto megadevs = elements_matching(employees,
                                  pred_and(earns_at_least(1000000), 
                                           has_title("Developer")));

This was also shown to give performance gains over the original code.

So far so good, but what if the desired result is not a vector of pointers? What if we just want to calculate the accumulated salary of those selected? It doesn't make sense to pay the cost of populating an array for that. What if not all developers are needed, but you just need to browse through the selected ones until some other criteria is met? Then it is unnecessary to have populated the vector with all of them.

Enter lazy evaluation. The idea is to create yet another level of indirection, and defer calling the predicate until the actual iteration is done.

What is wanted is something like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
auto megadevs = filter(employees,
                       pred_and(earns_at_least(1000000), 
                                has_title("Developer")));

// No iteration over employees yet
// No vector created.

uint64_t sum = 0;
for (auto i = megadevs.begin();// calls pred and increments until true or end
     i != megadevs.end();      // simple comparison
     ++i)                      // calls pred on incremented until true or end
{
  sum += i->salary;            // refers to element in employees
}

From this a design emerges. The type of megadevs, returned by the call to filter(), must have member functions begin() and end(). The iterators returned must have access to the iterators into employees, and must have access to the predicate.

Here's an outline for the type returned by filter():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template <typename Container, typename Pred>
class filter_t
{
  using CI = typename Container::const_iterator;
 public:
  class iterator;

  template <typename P>
  filter_t(Container& c_, P&& p_)
   : c(c_), p(std::forward<P>(p_))  { }

  iterator begin() {
    auto i = c.begin();
    auto e = c.end();
    while (i != e && !p(*i))
      ++i;
    return { i, e, p };
  }

  iterator end() { return { c.end(), c.end(), p }; }
private:
  const Container& c;
  Pred             p;
};

Each instance must refer to the original container, and must access the predicate, so these types are template parameters. The alias CI is just there for typographical purposes, making blog code shorter. The class iterator is deferred for a while. I'll get back to it soon. The interesting part is the member function begin(). We want it to create an iterator that refers to the first element in the container that fulfils the predicate, or end if there is no such element. It does this by checking the predicate and incrementing until true. The returned filter iterator must have the found iterator, the end, and the predicate, otherwise it cannot implement its increment operators.
The natural implementation would be to use std::find_if() instead of incrementing and checking manually, but this was unexpectedly slow.
Now we can look at the iterator class:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
template <typename Container, typename Pred>
class filter_t<Container, Pred>::iterator
{
public:
  using value_type = typename std::iterator_traits<CI>::value_type;
  using reference = typename std::iterator_traits<CI>::reference;;
  using pointer = typename std::iterator_traits<CI>::pointer;
  using difference_type = typename std::iterator_traits<CI>::difference_type;
  using iterator_category = std::forward_iterator_tag;

  iterator(CI b, CI e, Pred& p_) : iter(b), iend(e), p(p_) { }

  reference operator*() const { return *iter; }

  iterator& operator++()
  {
    ++iter;
    while (iter != iend && !p(*iter))
      ++iter;
    return *this;
  }

  iterator operator++(int)
  {
    iterator rv(*this);
    ++*this;
    return rv;
  }

  bool operator==(const iterator& rh) const
  {
    return iter == rh.iter;
  }

  bool operator!=(const iterator& rh) const
  {
    return iter != rh.iter;
  }
private:
  CI       iter;
  CI const iend;
  Pred&    p;
};

It's quite a code block, but most is obvious. The aliases on lines 5-9 are there to make the iterator cooperate with algorithms in the standard library. See std::iterator_traits<> for details.

The other interesting part is operator++() on lines 15-21. As when created from the begin() member function of filter_t<>, it searches for the next iterator in the referenced container until the predicate matches or the end is reached. Also here, a hand written loop is used, after performance measurements showed an unexpected need.

With this in place, the filter() function template becomes nearly trivial:

1
2
3
4
5
6
template <typename Container, typename Predicate>
inline filter_t<Container, std::remove_reference_t<Predicate>>
filter(Container& c, Predicate&& p)
{
  return { c, std::forward<Predicate>(p) };
}

Here's a good time to warn that this is a proof of concept code, intended to show the idea. For real world use, you would for example need variants with mutable access and variants that take ownership of the container if it is an rvalue, otherwise you cannot safely chain operations. These variants change nothing in performance, but they do make the implementation rather more complex.

Now is a good time to see if this did any good. A slight modification is made from the sample program of yesterday. The program populates a vector of 5000 employees. 1/4 of them with title = "Developer", and an unrealistically uniform random distribution salary in the range 50000 - 250000. With those the program did 500000 loops, each filtering out the developers that make 150000 or more and calculates the accumulated salary of the first 500.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
auto devs = filter(employees,
                   pred_and(earns_at_least(150000), 
                            has_title("Developer")));

int count = 500;
uint64_t sum = 0;
for (auto & d : devs)
{
  sum += d.salary;
  if (--count == 0) break;
}

Using g++ 6.2 and -O3, perf yields this result:

 Performance counter stats for './a.out':

   10 814 982 186      branch-instructions                                         
      752 594 101      branch-misses             #    6,96% of all branches         
   30 224 548 004      cpu-cycles                                                  
   29 445 937 781      instructions              #    0,97  insn per cycle         

     10,177892229 seconds time elapsed

Slightly rewriting the test program to add using the vector version from yesterday as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
auto devs = elements_matching(employees,
                              pred_and(earns_at_least(150000), 
                                       has_title("Developer")));

int count = 500;
uint64_t sum = 0;
for (auto & d : devs)
{
  sum += d->salary;
  if (--count == 0) break;
}

gives the result:

 Performance counter stats for './a.out':

   14 881 795 600      branch-instructions                                         
    1 061 800 168      branch-misses             #    7,13% of all branches         
   45 275 693 140      cpu-cycles                                                  
   47 380 071 935      instructions              #    1,05  insn per cycle         

     15,215398962 seconds time elapsed

Here the advantage of lazy evaluation becomes obvious. We're only interested in the first 500 matches, so not populating a vector with the remaining is a gain. However, in all fairness, it should be said that there is a cost to this generality. The filter iterator shows much worse branch prediction results, and for large data sets, this can be noticeable.

There it is. Higher order functions are by no means a necessity for lazy evaluation, but when you have them, it's not a huge job to implement, and using it becomes natural. The lazy evaluation can give large performance gains, but there is an added complexity which may back fire.

Expanding further on this sooner or later leads to a range library, but that is another story.

Saturday, January 7, 2017

Generate lambdas for clarity and performance

Higher order functions, functions that operate on other functions or returns functions, are familiar to those who have had some experience with functional programming, but they often seems magical to those who have not. Some of those with experience of using higher order functions have a gut feeling that they are expensive to use and prefer to avoid them.

[ edited 2017-01-08: added performance numbers for gcc at the bottom of the post ]
[ edited 2017-01-14: Perfect forwarding of functions in higher order function ]

Take this somewhat silly traditional C++ code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct Employee
{
  using Number = unsigned;
  Number id;
  unsigned salary;
  std::string title;
  std::string name;
};

using staff = std::vector<Employee>;

auto devs_who_make_at_least(const staff& s, unsigned floor)
{
  std::vector<staff::value_type const*> rv;
  for (auto& e : s)
  {
    if (e.salary >= floor && e.title == "Developer")
    {
      rv.push_back(&e);
    }
  }
  return rv;
}

It's not very complex, and most C++ developers understand what it does in a few seconds. If this is the only place in the entire code base that elements are selected from a vector of Employees, then all is well. But what if there are several places? The code becomes cluttered with almost identical copies of the same loop. This screams algorithm. None in the standard library is a perfect fit, though. Let's hand roll a simple one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
template <typename Container, typename Pred>
auto elements_matching(Container const& c, Pred&& p)
{
  std::vector<typename Container::value_type const*> rv;
  for (auto& e : c)
  {
    if (p(e))
    {
      rv.push_back(&e);
    }
  }
  return rv;
}


It's not ideal, but it works with any container type and returns a vector with pointers to the elements in the container that matches the predicate.

With this, the devs_who_make_at_least() function becomes simple:


1
2
3
4
5
6
7
auto devs_who_make_at_least(const staff& s, unsigned floor)
{
  auto pred = [floor](const Employee& e) {
    return e.salary >= floor && e.title == "Developer";
  };
  return elements_matching(s,pred);
}

This is cool, but what if there's more than one place in the code where you want to select employees based on salary or title?

C++14 introduced the auto return type, which is great for writing functions that generates lambdas.

Take for example:

1
2
3
4
5
6
7
8
9
auto earns_at_least(unsigned floor)
{
  return [floor](const Employee& e) { return e.salary >= floor; };
}

auto has_title(const std::string& s)
{
  return [s](const Employee& e) { return e.title == s; };
}

Each of the above functions returns a lambda that has captured the function parameter. These can be quite expressively used with the elements_matching algorithm introduced earlier:

1
2
3
4
staff employees;

devs = elements_matching(employees, has_title("Developer"));
megaearners = elements_matching(employees, earns_at_least(1000000));

This is powerful on its own right. The code is easy to read. Most, even very inexperienced C++ developers, grasp the intent directly, even if the mechanism may be unclear.

But what if we want more than one criteria, like selecting developers with a certain salary? Let's introduce an and higher order function, that returns true if two contained predicates both return true.

1
2
3
4
5
6
template <typename T, typename U>
auto pred_and(T&& t, U&& u)
{
  return [t = std::forward<T>(t),u = std::forward<U>(u)](auto&& x)
         { return t(x) && u(x); };
}

This function template pred_and(), creates a new predicate using two predicates. It will be true if and only if both t(x) and u(x) evaluates to true. It naturally short circuits the logic, so that if t(x) is false, u(x) is never evaluated.

Now finding the super well paid developers who make more that 1M becomes so easy it doesn't even need a separate function anymore.

1
2
3
auto megadevs = elements_matching(employees,
                                  pred_and(earns_at_least(1000000), 
                                           has_title("Developer")));

So what about performance, then? Surely this magic has a cost in confusing the optimiser?

I created a small test program that populates a vector of 5000 employees. 1/4 of them with title = "Developer", and an unrealistically uniform random distribution salary in the range 50000 - 250000. With those the program did 500000 loops, each filtering out the developers that make 150000 or more. This was build with clang++ 3.9.

The output from the linux tool perf are,

first from the hand made loop at the top of this post:

 Performance counter stats for './a.out':

   31 078 321 937      branch-instructions                                  
    2 055 192 607      branch-misses             #    6,61% of all branches         
   97 906 147 917      cpu-cycles                                           
  111 202 137 490      instructions              #    1,14  insn per cycle  

     32,974801101 seconds time elapsed

Then from using the algorithm and higher order function:

 Performance counter stats for './a.out':

   14 258 216 181      branch-instructions                                  
    1 513 180 390      branch-misses             #   10,61% of all branches         
   54 898 012 739      cpu-cycles                                           
   42 665 314 382      instructions              #    0,78  insn per cycle  

     18,747075752 seconds time elapsed

I must admit I was puzzled by this huge performance advantage of using higher order functions, especially when there has been no attempt to optimise it at all. It turns out there's a small mistake in the hand written loop. It compares the member .title, which is a std::string with the string literal "Developer". This means it must make character by character comparison. The higher order function has_title() captures a std::string, and equal comparison of std::string begins with checking their lengths. If the lengths are different there's no reason to look at the characters. That is the only reason I've seen for the huge gain.

Changing the hand written loop to compare title with a std::string gives this result:

 Performance counter stats for './a.out':

   13 971 299 475      branch-instructions                                  
    1 449 744 816      branch-misses             #   10,38% of all branches         
   51 964 400 535      cpu-cycles                                           
   39 992 503 929      instructions              #    0,77  insn per cycle  

     17,448599597 seconds time elapsed

So, it is better performing. Not much. But it is. However, the hand written loop gets copied all over the code base, the algorithm and higher order functions do not. The performance bug could've been made in the has_title() function as well, but it would be one place to fix, not many. Likewise, the elements_matching() algorithm could be optimised, for example with partial loop unrolling to do several comparisons per revolution. That too would be an effort spent once in one place, and not all over the code.

That was the results with clang++-3.9. Let's see how g++ 6.2 fares in comparison. First the higher order function version:

 Performance counter stats for './a.out':

   14 006 122 557      branch-instructions                                      
    1 027 464 337      branch-misses             #    7,34% of all branches         
   43 066 722 628      cpu-cycles                                               
   44 394 114 000      instructions              #    1,03  insn per cycle      

     14,505986613 seconds time elapsed

And then the hand written loop, with the fix to compare with std::string:

 Performance counter stats for './a.out':

   14 039 057 224      branch-instructions                                       
    1 852 376 974      branch-misses             #   13,19% of all branches         
   58 797 998 124      cpu-cycles                                                
   40 858 411 471      instructions              #    0,69  insn per cycle       

     19,743206041 seconds time elapsed

Here it's clear that gcc does a slightly worse job than clang with optimising the hand written loop, but does a substantially better job at optimising the higher order function version.

As can be seen, the exact result does vary between compilers, but the chances of losing any performance to speak of when using higher order functions are small, and there is the potential for very noticeable performance gains.

So, use algorithms and use higher order functions, for code clarity and for performance.

Thursday, December 29, 2016

Serializing structs with C++17 structured bindings

Serializing data in C++ is a surprisingly difficult problem. There are many libraries for it with varying degrees of finesse, power and ease of use. C++17 offers an unexpected simplification with structured bindings. The simplification does not lead to a universal solution, but it's applicability is wide none the less.

First, what is this serialization problem? It may be the need to transfer data over a network or storing on disk, or a test frame work that wants to display unexpected values in a human readable form, and doing all this for structured types with unknown contents. This post will focus on visualizing data for a user, but the underlying problem is the for networking or storage, so there is no fundamental reason why the technique shouldn't be applicable in other domains.

A common technique in many unit testing frame works is to use compile time reflection to find out if a type offers an output stream operator and call it when available, and print a hex dump of the memory that the object occupies otherwise. This works quite well for simple data, but if a member of a struct is, say, a std::string,  you don't get much useful information.

Antony Polukhin, in his talk "C++14 Reflections Without Macros Markup Nor External Tooling: Metaprogramming Tricks For POD TYPES" at CPPCon 2016, showed impressive tricks for creating tuples matching a struct to grab the data members. At the end of the presentation he hinted at alternative C++17 solutions, and this post follows up on that.

Stating the problem

What I want to do, is to create an operator<<(std::ostream&, T&) for all types T that do not already have such an operator, and for which I can enumerate the data members and call the stream insertion operator on them, one by one. I want this to be applicable recursively. For readability, I want the members to be printed in order as a comma separated list, surrounded by {}. Unlike Antony Polukhin, I do not intend to limit this to POD types, but the data members must be accessible. For example, given these data types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct Inner
{
  int n;
  std::string s;
};
struct Data
{
  int n;
  Inner inner;
  char c;
};

I want std::cout << Data{3, { -2, "foo" }, '.'} to print the string "{ 3, { -2, foo }, . }".

To avoid clashes with other templated stream insertion operators, this one is placed in namespace print_struct.

C++17 structured bindings

Provided the number of data members in a struct is known by the programmer, and they are accessible, they are easy to get hold of using structured bindings. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
void print(std::ostream& os, const Inner& inner)
{
  auto& [m1, m2] = inner;
  // m1 is inner.n
  // m2 is inner.s

  os << "{ " << m1 << ", " << m2 << " }";
}
void print(std::ostream& os, const Data& data)
{
  auto& [m1, m2, m3] = data;
  // m1 is data.n
  // m2 is data.inner
  // m3 is data.c

  os << "{ " << m1 << ", ";
  print(os, m2);
  os << ", " << m3 << "}";
}

The declaration auto& [m1, m2] = inner means to declare as references (auto&)  the identifiers m1 and m2 as the members of inner ([m1, m2] = inner).

You can read more on Steve Lorrimer's blog post about structured bindings, or watch Jason Turner's introduction in Episode 24 of C++Weekly on YOUTube.

For this to be useful in generic function templates the number of members each type has must be figured out, and for each member we need to know whether it can be printed directly or it needs to be unwrapped recursively.

Let's tackle one problem at the time.

Find the arity of a struct

This is probably the most tricky part of this post. I don't think there is a generic solution to always finding the arity, i.e. the number of data members of a struct, but we can get far and cover most situations.

My take is the observation that if we limit ourselves to types that do not have constructors, but rely on brace initialization of the data members. For example, using the type Inner, above, can be constructed as Inner{value1, value2}, provided that value1 is convertible to int and value2 is convertible to std::string, but it cannot be constructed as Inner{value1, value2, value3}, regardless of the types of the values, nor can it be constructed as Inner(value1, value2). It can, however, be constructed as Inner{value1} (any sane compiler gives a warning, but that is just visual information to the programmer, not anything we can base program logic on.)

So, a struct without constcuctors can be constructed using Type{values...}, but can not be constructed using Type(values...). A constructor can tak a std::initializer_list<T>, which can be called using braces, but such a list has no upper limit to its size. If a type can be brace constructed using N values but not with N+1 values, and cannot be constructed using a parenthesized constructor with N values, it's very likely to be a struct (or an array.) 

Fortunately, we can test the ability to create objects using SFINAE techniques.

The standard library provides std::is_constructible<T, V...>, which has a value member constant that is true if T(V...) does create a T, and false otherwise. There is no version in the standard library which tests for brace initialization, so here's one way of doing that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
template <typename T, typename ... V>
inline constexpr auto is_brace_constructible_(T*)
  -> decltype(T{std::declval<V>()...},std::true_type{})
{
  return {};
}

template <typename T, typename ... V>
inline constexpr std::false_type is_brace_constructible_(...)
{
  return {};
}

template <typename T, typename ... V>
inline constexpr auto is_brace_constructible()
  -> decltype(is_brace_constructible_<T, V...>(nullptr))
{
  return {};
}

With this we can test static_assert(is_brace_constructible<Inner, int, std::string>()).

Most of the magic above is on line 3. The return type is whatever the type (decltype) of the expression T{std::declval<V>()...}, std::true_type{} is. std::declval<> is a way to get an instance of type V, for a compile time check, without knowing how V can be constructed. So the expression T{std::declval<V>()...} becomes a T, if T can be brace constructed from instances of V.... If that compiles, the comma operator comes into play so the type of the full expression becomes std::true_type. If it fails to compile, because T cannot be brace constructed from instances of V..., then SFINAE kicks in, and the ellipsis version on lines 8-12 is used instead, which returns std::false_type.

The function on lines 14-19 is a convenience to skip the need for a call with nullptr.

We now have the tools to test if a type can be constructed using parenthesized constructor or brace initialized, provided we know the types to construct it from.

For a generic type T, we don't know that, however.

Enter the compile time wildcard.

1
2
3
4
5
6
7
8
struct wildcard
{
  template <typename T>
  operator T&&() const;

  template <typename T>
  operator T&() const;
};

The conversion operators cannot need to be implemented, but fortunately that isn't needed. They're used in SFINAE context only, to test whether a conversion can succeed or not.

With this we can test constructibility of almost any type:

1
2
3
4
5
static_assert(is_brace_constructible<std::string, wildcard>());
static_assert(is_brace_constructible<int, wildcard>());
static_assert(std::is_constructible<int, wildcard>());
static_assert(std::is_constructible<std::string, wildcard>());
static_assert(is_brace_constructible<Inner, int, std::string>());

It's not free from problems, though.

Imagine a type:

1
2
3
4
5
struct uncopyable
{
  uncopyable(int) {}
  uncopyable(uncopyable&&) = default;
};

This will fail because there's an ambiguity between the two constructors. The situation can be helped a bit by improving the wildcard type a slight bit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct wildcard
{
  template <typename T,
            typename = std::enable_if_t<!std::is_lvalue_reference<T>::value>>
  operator T&&() const;

  template <typename T,
            typename = std::enable_if_t<std::is_copy_constructible<T>::value>>
  operator T&() const;
};

This covers the uncopyable (but movable) type. There's one kind that still doesn't work, and for which I do not have a solution (suggestions, anyone?) An immobile type:

1
2
3
4
5
struct immobile
{
  immobile(int) {}
  immobile(const immobile&) = delete;
};

Even though the copy constructor is deleted, it takes part in overload resolution for the constructor call from wildcard. It's a wart, but I can live with this limitation.

Now comes the problem of finding the largest N, for which N is the number of wildcard instances that the type T can be brace constructed from but for which std::is_constructible<> says false. That number N is the arity of the type T.

To get there, let's add a convenient variable template for wildcard:

1
2
template <size_t N = 0>
static constexpr const wildcard _{};

This doesn't look like much, but it allows a valuable change to is_brace_constructible<>, where the number of wildcard instances is a parameter, rather than repeating them over and over.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template <typename T, size_t ... I>
inline constexpr auto
is_brace_constructible_(std::index_sequence<I...>, T *)
  -> decltype(T{_<I>...}, std::true_type{})
{
  return {};
}

template <size_t ... I>
inline constexpr std::false_type
is_brace_constructible_(std::index_sequence<I...>, ...)
{
  return {};
}


template <typename T, size_t N>
constexpr auto
is_brace_constructible()
  -> decltype(is_brace_constructible_(std::make_index_sequence<N>{},
                                      static_cast<T *>(nullptr)))
{
  return {};
}

On line 4, _<I> is an instance of wildcard. The index I itself is not used for anything, it's just that it is very easy to get a sequence of N index values from the standard library, thus it is very easy to get N instances of wildcard. For this to work, the order of the parameters to the helper functions had to be changed, but this is not important since it is all hidden under the final function on lines 17-24.

A similar convenience predicate is made for std::is_constructible<>, which checks for constructor calls with parenthesis instead of braces:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
template <typename T, typename U>
struct is_paren_constructible_;

template <typename T, size_t ... I>
struct is_paren_constructible_<T, std::index_sequence<I...>>
       : std::is_constructible<T, decltype(_<I>)...>
{
};

template <typename T, size_t N>
constexpr auto
is_paren_constructible()
  -> is_paren_constructible_<T, std::make_index_sequence<N>>
{
  return {};
}

The declaration on lines 1-2 are just to introduce the name as template. Lines 4-8 does the real job by specializing on <T, std::index_sequence<I...>>, which inherits from std::is_constructible<>.

Lines 10-16 is to make the convenience predicate is_paren_constructible<> usable with the same syntax as is_brace_constructible<>.

These are the tools needed to check an arity of a struct, like:

1
2
3
4
5
6
static_assert(!is_paren_constructible<Inner, 2>());
static_assert(is_brace_constructible<Inner, 2>());
static_assert(!is_brace_constructible<Inner, 3>());
static_assert(!is_paren_constructible<Data, 3>());
static_assert(is_brace_constructible<Data, 3>());
static_assert(!is_brace_constructible<Data, 4>());

With these tests in place, we can make a bunch of function templates that either fails substitution, or reports the arity.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
template <typename T,
          typename = std::enable_if_t<
                         is_brace_constructible<T, count>{} &&
                         !is_brace_constructible<T, count+1>{} &&
                         !is_paren_constructible<T, count>{}
                       >
           >
inline constexpr std::integral_constant<size_t, count> arity()
{
  return {};
}

std::integral_constant<size_t, count> is going to be used a lot, so a convenience alias is introduced to help code readability a bit:

1
2
template <size_t N>
using size = std::integral_constant<size_t, N>;

Now just repeat the arity<>() function template for any value of count desired. As much as I dislike the preprocessor, I prefer a macro for this kind of code repetition:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#define MAKE_ARITY_FUNC(count)                                          \
  template <typename T,                                                 \
            typename = std::enable_if_t<                                \
                         is_brace_constructible<T, count>{} &&          \
                         !is_brace_constructible<T, count+1>{} &&       \
                         !is_paren_constructible<T, count>{}            \
                       >                                                \
           >                                                            \
  constexpr size<count> arity()                                         \
  {                                                                     \
    return {};                                                          \
  }

  MAKE_ARITY_FUNC(1)
  MAKE_ARITY_FUNC(2)
  MAKE_ARITY_FUNC(3)
  MAKE_ARITY_FUNC(4)
  MAKE_ARITY_FUNC(5)
  MAKE_ARITY_FUNC(6)
  MAKE_ARITY_FUNC(7)
  MAKE_ARITY_FUNC(8)
  MAKE_ARITY_FUNC(9)
  MAKE_ARITY_FUNC(10)
  MAKE_ARITY_FUNC(11)
  MAKE_ARITY_FUNC(12)
  MAKE_ARITY_FUNC(13)
  MAKE_ARITY_FUNC(14)
  MAKE_ARITY_FUNC(15)

A couple of asserts will show the correctness.

1
2
static_assert(arity<Inner>() == 2);
static_assert(arity<Data>() == 3);

The way this works is that arity<Inner>() will fail substitution for all versions except the one where count is 2. But since Substitution Failure Is Not An Error, this is OK, and there is exactly one that succeeds substitution, a uniquely defined value is returned.

The only thing missing is the case of the empty struct, because it is indistinguishable from default construction. This is taken care of with a special version of arity<>():


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
template <typename T,
          typename = std::enable_if_t<
                       std::is_class<T>{} &&
                       std::is_empty<T>{}
                     >
          >
constexpr size<0> arity()
{
  return {};
}

This will succeed substitution for empty structs.

Now, the problem of finding the arity of struct types is taken care of.

Making a print function for structs

If we, for the moment, boldly assume that there is a working stream insertion operator for all types, including for nested structs, making a print function is quite easy, if somewhat tedious.

1
2
3
4
5
template <typename T>
std::ostream& print_struct(std::ostream& os, const T& t)
{
  return print_struct(os, t, arity(t));
}

This function template just dispatches to another function template that accepts the arity as the 3rd parameter. Since this is a compile time constant encoded in the type as size<count>, we can make one function template for each arity. There will be a lot of repetition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
template <typename T>
std::ostream& print_struct(std::ostream& os, const T&, size<0>)
{
  return os << "{ }";
}

template <typename T>
std::ostream& print_struct(std::ostream& os, const T& t, size<1>)
{
  auto& [p1] = t;
  return os << "{ " << p1 << " }";
}

template <typename T>
std::ostream& print_struct(std::ostream& os, const T& t, size<2>)
{
  auto& [p1, p2] = t;
  return os << "{ " << p1 << ", " << p2 << " }";
}

template <typename T>
std::ostream& print_struct(std::ostream& os, const T& t, size<3>)
{
  auto& [p1, p2, p3] = t;
  return os << "{ " << p1 << ", " << p2 << ", " << p3 << " }";
}

template <typename T>
std::ostream& print_struct(std::ostream& os, const T& t, size<4>)
{
  auto& [p1, p2, p3, p4] = t;
  return os << "{ " << p1 << ", " << p2 << ", " << p3 << ", " << p4 << " }";
}

etc., for as many members as we want to support. It's not pretty.

Structured bindings works here, because each function template only accepts calls for types with the correct arity.

However, while it appears impossible to completely get rid of the repetitive tedium, the likelyhood of mistakes can be slightly reduced by making a print function for std::tuple<>, which prints its members comma separated and enclosed by {}.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template <typename T, size_t ... I>
std::ostream& print_tuple(std::ostream& os,
                          const T& t,
                          std::index_sequence<I...>)
{
  os << "{ ";
  return ((os << ((I ? os << ", ":os),std::get<I>(t))),...) << " }";
}

template <typename ... T>
std::ostream& print_tuple(std::ostream& os, const std::tuple<T...>& t)
{
  return print_tuple(os, t, std::make_index_sequence<sizeof...(T)>{});
}

The function template on lines 10-14 just dispatches the call over to the one on lines 1-8. The latter one has a indexes for each member of the tuple to play with.

The abomination on line 7 is a fold expression abusing the comma operator in two ways.

First, ((os << ((I ? os << ", ":os),std::get<I>(t))),...) is the fold expression. It can be modeled as (expr(I) op ...), and in this case op is ",". What this means is that expr(I) is repeated, separated by op (i.e. ",") for each value of I in the variadic template parameter pack. expr(I) is (os << ((I ? os << ", ":os),std::get<I>(t)), which again abuses the comma operator. The right hand side does evaluate to std::get<I>(t), which is passed to the stream insertion operator, but before that, the string ", " is inserted into os if, and only if, I != 0.

This absurdity means that the print_struct function templates can be ever so slightly simplified to:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
template <typename T>
std::ostream &print(std::ostream &os, const T &, size<0>)
{
  return os << "{ }";
}

template <typename T>
std::ostream &print(std::ostream &os, const T &t, size<1>)
{
  auto [p1] = t;
  return print_tuple(os, std::forward_as_tuple(p1));
}

template <typename T>
std::ostream &print(std::ostream &os, const T &t, size<2>)
{
  auto [p1,p2] = t;
  return print_tuple(os, std::forward_as_tuple(p1, p2));
}

template <typename T>
std::ostream &print(std::ostream &os, const T &t, size<3>)
{
  auto [p1,p2,p3] = t;
  return print_tuple(os, std::forward_as_tuple(p1, p2, p3));
}

template <typename T>
std::ostream &print(std::ostream &os, const T &t, size<4>)
{
  auto[p1,p2,p3,p4] = t;
  return print_tuple(os, std::forward_as_tuple(p1, p2, p3, p4));
}

The gain is not great, but it's something.

If you're a very sharp reader, you may have noticed that the structured bindings do not capture by reference here. They should, but due to bugs in the work-in-progress headers of both gcc and clang, header <tuple> interferes with the ability to capture elements of structs by reference. I assume the bugs will be fixed by the time the official support for C++17 is released.

A generic stream insertion operator

The idea is to write a templated operator<<(std::ostream&, const T&) that only instantiates for types T that do not already have a stream insertion operator, and that are structs that we can print using the functions shown earlier.

Checking whether a type can be inserted into an ostream is not very difficult, compared to what's already shown above. It can be done in many ways. Here's my take on it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
namespace stream_check
{
  template <typename T>
  constexpr auto is_output_streamable(const T* t)
  -> decltype((std::declval<std::ostream&>() << *t), std::true_type{})
  {
    return {};
  }

  constexpr std::false_type is_output_streamable(...)
  {
    return {};
  }

  template <typename T>
  constexpr auto is_output_streamable()
    -> decltype(is_output_streamable(static_cast<T*>(nullptr)))
  {
    return {};
  }
}

It's more or less the same technique as in the test for ability to construct an object. The version of check<T> on lines 3-8 is preferred over the one on lines 10-13, since the latter uses the ellipsis. If the stream insertion compiles, the comma expression makes sure that the return type of the version on lines 3-8 is std::true_type. If it fails to compile, the fallback is the less desirable version on lines 10-13, which has a return type of std::false_type.

The function template on lines 15-20 is the one to use. Its return type is whatever the version of check that compiles gives.

This one is placed in a separate namespace from all the previously listed stuff, so as to not interfere with the stream insertion operator that will be implemented next.

That stream insertion operator then becomes surprisingly simple:

1
2
3
4
5
6
7
template <typename T>
auto operator<<(std::ostream& os, const T& t)
  -> std::enable_if_t<!stream_check::is_output_streamable<T>(),
                      decltype(print_struct(os, t, arity<T>())>
{
  return print_struct(os, t, arity<T>());
}

This function template will only successfully be instantiated if the type T doesn't already have a stream insertion operator, and arity<T>() can be instantiated, and print_struct<>() can be instantiated with that arity. If the print_tuple<>() function template shown earlier can see this templated stream insertion operator, the stream insertion operators used for each member will find this one when needed, and thus drill down recursively, just as desired.

So what about performance?

Yes, yes, we're C++ developers. We count cycles in our sleep, so performance matters. Here's a very small test program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include "print_struct.hpp"
#include <iostream>
struct inner
{
  const char *p;
  int n;
};
struct outer
{
  int m;
  inner i;
};
int main()
{
  outer o{3, {"foo", 88}};
  using namespace print_struct;
  std::cout << o << '\n';
}

Compiling it with a home-built gcc (not yet version 7) with -O3 gives this assembly language output:


1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
.LC0:
        .string "{ "
.LC1:
        .string ", "
.LC2:
        .string "foo"
.LC3:
        .string " }"
        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB2124:
        .cfi_startproc
        pushq   %rbx
        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16
        movl    $.LC0, %esi
        movl    std::cout, %edi
        subq    $16, %rsp
        .cfi_def_cfa_offset 32
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
        movl    $3, %esi
        movl    std::cout, %edi
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
        movl    $.LC1, %esi
        movl    std::cout, %edi
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
        movl    $.LC0, %esi
        movl    std::cout, %edi
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
        movl    $.LC2, %esi
        movl    std::cout, %edi
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
        movl    $.LC1, %esi
        movl    std::cout, %edi
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
        movl    $88, %esi
        movl    std::cout, %edi
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
        movl    $2, %edx
        movq    %rax, %rbx
        movl    $.LC3, %esi
        movq    %rax, %rdi
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        movq    %rbx, %rdi
        movl    $2, %edx
        movl    $.LC3, %esi
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        leaq    15(%rsp), %rsi
        movq    %rbx, %rdi
        movl    $1, %edx
        movb    $10, 15(%rsp)
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        addq    $16, %rsp
        .cfi_def_cfa_offset 16
        xorl    %eax, %eax
        popq    %rbx
        .cfi_def_cfa_offset 8
        ret

This is quite good. All obvious overhead is removed. The one different that stands out over a hand tooled one, is that with knowledge of the flattened structure, you could remove a few calls by connatenating strings like ", " and "{ " into ", {". Compared to hand written stream insertion operators for each type, called recursively, I think this is identical to the best you can get.

Wrap up

It is possible to do generic serialization of structs, provided they don't have constructors, and provided that each member can be tested for constructibility using a single value. This is very close to the goal stated at the very beginning of this post. Writing it wasn't very easy, but it's write once use often, compared to writing stream insertion operators for each and every struct.

Implementing read logic is left as an exercise for the interested reader.