#### introduction.

as mentioned in part one, testing expression templates isn’t that easy. one often has to go down to assembler level to see what is really going on. to make developing expression templates easier, and to debug my (fictional) `myvec<T>`

expression templates a bit more in detail, i created a second test type: `TestType2`

. again equiped with expression templates, its aim is to break down the expressions into *three-address assembler* commands, using temporaries to achieve this. for example, `a -= (b + c) * d;`

should evaluate into something like

1 TestType2 t1, t2; // without calling constructors or destructors!
2 TT2_create(t1);
3 TT2_add(t1, b, c);
4 TT2_create(t2);
5 TT2_mul(t2, t1, d);
6 TT2_sub(a, a, t2);
7 TT2_destroy(t2);
8 TT2_destroy(t1);

by defining the functions

`add()`

,

`mul()`

, etc. in a different translation unit, one can analyse the assembler output of this translation unit to see what exactly came out. by searching for terms like

`TT2_add`

one can quickly find the corresponding assembler commands.

#### the implementation.

we begin by declaring the class `TestType2`

and by declaring the functions to operate on it. again, we leave out multiplication (and divison and modulo) to shorten code.

1 class TestType2;
2
3 void TT2_create(TestType2 & r);
4 void TT2_destroy(TestType2 & r);
5 void TT2_createcopy(TestType2 & r, const TestType2 & a);
6 void TT2_copy(TestType2 & r, const TestType2 & a);
7 void setZero(TestType2 & r);
8 void setOne(TestType2 & r);
9 void TT2_add(TestType2 & r, const TestType2 & a, const TestType2 & b);
10 void TT2_sub(TestType2 & r, const TestType2 & a, const TestType2 & b);
11 void TT2_neg(TestType2 & r, const TestType2 & a);

the implementation of

`TestType2`

is straightforward. we add a pointer so that the class actually stores something.

1 class TestType2
2 {
3 private:
4 void * d_data;
5
6 public:
7 inline TestType2()
8 {
9 TT2_create(*this);
10 }
11
12 inline TestType2(const TestType2 & src)
13 {
14 TT2_createcopy(*this, src);
15 }
16
17 inline ~TestType2()
18 {
19 TT2_destroy(*this);
20 }
21
22 inline TestType2 & operator = (const TestType2 & src)
23 {
24 TT2_copy(*this, src);
25 return *this;
26 }
27
28 inline TestType2 & operator += (const TestType2 & b)
29 {
30 TT2_add(*this, *this, b);
31 return *this;
32 }
33
34 inline TestType2 & operator -= (const TestType2 & b)
35 {
36 TT2_sub(*this, *this, b);
37 return *this;
38 }
39 };

again, as in

part one, we have templates

`TestExpression2<O, D>`

and

`TestWrapper2`

:

1 template<class Op, class Data>
2 class TestExpression2
3 {
4 private:
5 Op d_op;
6 Data d_data;
7
8 public:
9 inline TestExpression2(const Op & op, const Data & data)
10 : d_op(op), d_data(data)
11 {
12 }
13
14 operator TestType2 () const
15 {
16 TestType2 res;
17 evalTo(res);
18 return res;
19 }
20
21 inline TestType2 evaluate() const
22 {
23 TestType2 res;
24 evalTo(res);
25 return res;
26 }
27
28 inline void evalTo(TestType2 & dest) const
29 {
30 d_op.evalTo(dest, d_data);
31 }
32 };
33
34 class TestWrapper2
35 {
36 private:
37 const TestType2 & d_val;
38
39 public:
40 inline TestWrapper2(const TestType2 & val)
41 : d_val(val)
42 {
43 }
44
45 inline const TestType2 & evaluate() const
46 {
47 return d_val;
48 }
49
50 inline void evalTo(TestType2 & dest)
51 {
52 TT2_copy(dest, d_val);
53 }
54 };

this time, we have both an

`evaluate()`

and a

`evalTo()`

member function. the first allows to just evaluate the expression, generating temporaries for subexpressions, and the second one is used by callers like

`operator=()`

of

`TestType2`

to evaluate the result of an expression into an object of type

`TestType2`

. the expression machinery is added to

`TestType2`

by the following functions:

1 template<class O, class D>
2 inline TestType2(const TestExpression2<O, D> & src)
3 {
4 TT2_createcopy(*this, src.evaluate());
5 }
6
7 template<class O, class D>
8 inline TestType2 & operator = (const TestExpression2<O, D> & e)
9 {
10 e.evalTo(*this);
11 return *this;
12 }
13
14 template<class O, class D>
15 inline TestType2 & operator += (const TestExpression2<O, D> & e)
16 {
17 TT2_add(*this, *this, e.evaluate());
18 return *this;
19 }
20
21 template<class O, class D>
22 inline TestType2 & operator -= (const TestExpression2<O, D> & e)
23 {
24 TT2_sub(*this, *this, e.evaluate());
25 return *this;
26 }

the operators are defined as follows. there is not much to do: they just provide a

`evalTo()`

template function which calls the corresponding three-address operation:

1 class AddOp2
2 {
3 public:
4 template<class A, class B>
5 inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const
6 {
7 TT2_add(dest, data.first.evaluate(), data.second.evaluate());
8 }
9 };
10
11 class SubOp2
12 {
13 public:
14 template<class A, class B>
15 inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const
16 {
17 TT2_sub(dest, data.first.evaluate(), data.second.evaluate());
18 }
19 };
20
21 class NegOp2
22 {
23 public:
24 template<class A>
25 inline void evalTo(TestType2 & dest, const A & data) const
26 {
27 TT2_neg(dest, data.evaluate());
28 }
29 };

again, what is left is the tedious integration, by defining a ton of

`operator`

s:

1 inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> > operator + (const TestType2 & a, const TestType2 & b)
2 { return TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> >(AddOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); }
3 inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> > operator - (const TestType2 & a, const TestType2 & b)
4 { return TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> >(SubOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); }
5
6 template<class O2, class D2>
7 inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator + (const TestType2 & a, const TestExpression2<O2, D2> & b)
8 { return TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(TestWrapper2(a), b)); }
9 template<class O2, class D2>
10 inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator - (const TestType2 & a, const TestExpression2<O2, D2> & b)
11 { return TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(TestWrapper2(a), b)); }
12
13 template<class O1, class D1>
14 inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator + (const TestExpression2<O1, D1> & a, const TestType2 & b)
15 { return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(AddOp2(), std::make_pair(a, TestWrapper2(b))); }
16 template<class O1, class D1>
17 inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator - (const TestExpression2<O1, D1> & a, const TestType2 & b)
18 { return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(SubOp2(), std::make_pair(a, TestWrapper2(b))); }
19
20 template<class O1, class D1, class O2, class D2>
21 inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator + (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b)
22 { return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(a, b)); }
23 template<class O1, class D1, class O2, class D2>
24 inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator - (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b)
25 { return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(a, b)); }
26
27 inline TestExpression2<NegOp2, TestWrapper2> operator - (const TestType2 & a)
28 { return TestExpression2<NegOp2, TestWrapper2>(NegOp2(), TestWrapper2(a)); }
29
30 template<class O1, class D1>
31 inline TestExpression2<NegOp2, TestExpression2<O1, D1> > operator - (const TestExpression2<O1, D1> & a)
32 { return TestExpression2<NegOp2, TestExpression2<O1, D1> >(NegOp2(), a); }

note that all the above code contains a lot of

`inline`

s, as opposed to

part one. in part one, the aim was not to generate most efficient code, but to make visible the expressions evaluated by the expression templates. in this second part, we want to observe the assembler output, and thus want the code to be as optimized as possible.

#### a real-life example.

in this section, i want to look at three examples. first, let us consider the following example:

1 TestType2 s, t, u, x, y, z;
2 s /= t + (x - y) * z - u;

the resulting assembler code (generated with

`g++ -S -O3`

) looks as follows:

1 leaq 1024(%rsp), %rcx
2 ....
3 movq %rbx, 56(%rsp)
4 call _Z10TT2_createR9TestType2
5 leaq 960(%rsp), %r12
6 movq %r12, %rdi
7 call _Z10TT2_createR9TestType2
8 leaq 928(%rsp), %rbp
9 movq %rbp, %rdi
10 call _Z10TT2_createR9TestType2
11 leaq 944(%rsp), %rbx
12 movq %rbx, %rdi
13 call _Z10TT2_createR9TestType2
14 leaq 1024(%rsp), %rdx
15 leaq 1040(%rsp), %rsi
16 movq %rbx, %rdi
17 call _Z7TT2_subR9TestType2RKS_S2_
18 leaq 1008(%rsp), %rdx
19 movq %rbx, %rsi
20 movq %rbp, %rdi
21 call _Z7TT2_mulR9TestType2RKS_S2_
22 movq %rbx, %rdi
23 call _Z11TT2_destroyR9TestType2
24 leaq 1072(%rsp), %rsi
25 movq %rbp, %rdx
26 movq %r12, %rdi
27 call _Z7TT2_addR9TestType2RKS_S2_
28 movq %rbp, %rdi
29 call _Z11TT2_destroyR9TestType2
30 leaq 1056(%rsp), %rdx
31 movq %r12, %rsi
32 movq %r13, %rdi
33 call _Z7TT2_subR9TestType2RKS_S2_
34 movq %r12, %rdi
35 call _Z11TT2_destroyR9TestType2
36 leaq 1088(%rsp), %rsi
37 movq %r13, %rdx
38 movq %rsi, %rdi
39 call _Z7TT2_divR9TestType2RKS_S2_
40 movq %r13, %rdi
41 call _Z11TT2_destroyR9TestType2

(i removed a lot of register/memory arithmetic in the beginning, which prepares all addresses.) removing all cludder, we are left with the following function calls:

1 TT2_create()
2 TT2_create()
3 TT2_create()
4 TT2_create()
5 TT2_sub()
6 TT2_mul()
7 TT2_destroy()
8 TT2_add()
9 TT2_destroy()
10 TT2_sub()
11 TT2_destroy()
12 TT2_div()
13 TT2_destroy()

four temporaries

`t1, t2, t3, t4`

are created. then, the subtraction

`x - y`

is computed into the temporary

`t4`

, and the result is multiplied by

`z`

into the temporary

`t3`

. the temporary

`t4`

used to compute

`x - y`

is then destroyed, and

`t`

is added to

`t3`

, with the result being stored in the temporary

`t2`

. the temporary

`t3`

is destroyed, and

`t2 - u`

is evaluated into

`t1`

. finally,

`s`

is divided by

`t1`

and

`t1`

is destroyed. this shows that the compiler generated an equivalent to

1 TestType2 t1, t2, t3, t4; // without calling constructors or destructors!
2 TT2_create(t1);
3 TT2_create(t2);
4 TT2_create(t3);
5 TT2_create(t4);
6 TT2_sub(t4, x, y);
7 TT2_mul(t3, t4, z);
8 TT2_destroy(t4);
9 TT2_add(t2, t, t3);
10 TT2_destroy(t3);
11 TT2_sub(t1, t2, u);
12 TT2_destroy(t2);
13 TT2_div(s, s, t1);
14 TT2_destroy(t1);

this is pretty much optimal: if one would have done this translation by hand in a straightforward way, one would have reached the same solution.

now let us re-consider our example from

part one: the

`myvec<T>`

template. assume we have two

`myvec<TestType2>`

vectors

`v`

and

`w`

, and we write

`v += v + w;`

and

`v += w + v;`

. the first command should generate something like

1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3 add(v[i], v[i], v[i]);
4 add(v[i], v[i], w[i]);
5 }

while the second command should generate something like

1 TestType2 t; // without calling constructors or destructors!
2 create(t);
3 for (unsigned i = 0; i < v.size(); ++i)
4 {
5 add(t, w[i], v[i]);
6 add(v[i], v[i], t);
7 }
8 destroy(t);

or, if the expression templates for

`myvec<T>`

are not that good, at least something like

1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3 TestType2 t; // without calling constructors or destructors!
4 create(t);
5 add(t, w[i], v[i]);
6 add(v[i], v[i], t);
7 destroy(t);
8 }

first, consider

`v += v + w;`

. the assembler code generated is the following:

1 movl 320(%rsp), %r9d
2 testl %r9d, %r9d
3 je .L128
4 xorl %ebp, %ebp
5 .L129:
6 mov %ebp, %r12d
7 salq $3, %r12
8 movq %r12, %rbx
9 addq 328(%rsp), %rbx
10 movq %rbx, %rdx
11 movq %rbx, %rsi
12 movq %rbx, %rdi
13 call _Z7TT2_addR9TestType2RKS_S2_
14 movq %r12, %rdx
15 addq 312(%rsp), %rdx
16 movq %rbx, %rsi
17 movq %rbx, %rdi
18 call _Z7TT2_addR9TestType2RKS_S2_
19 addl $1, %ebp
20 cmpl 320(%rsp), %ebp
21 jb .L129
22 .L128:

clearly, first the code tests whether the loop has to be run through at least once; if not, one jumps to label

`.L128`

. then, per loop iteration, exactly two calls to

`TT2_add()`

are made. this shows that the code is essentially like

1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3 add(v[i], v[i], v[i]);
4 add(v[i], v[i], w[i]);
5 }

as we were hoping. thus, the expression templates worked well in this case. now let us look at

`v += w + v;`

. this time, a temporary has to be created, since otherwise

`v[i]`

is modified before being used in the expression. the generated assembler code is the following:

1 movl 320(%rsp), %eax
2 cmpl 304(%rsp), %eax
3 jne .L283
4 leaq 416(%rsp), %rbx
5 movq %rbx, %rdi
6 call _Z10TT2_createR9TestType2
7 movl 320(%rsp), %r8d
8 testl %r8d, %r8d
9 je .L146
10 xorl %r12d, %r12d
11 .p2align 4,,10
12 .p2align 3
13 .L147:
14 mov %r12d, %ebp
15 movq %rbx, %rdi
16 salq $3, %rbp
17 movq %rbp, %rsi
18 addq 312(%rsp), %rsi
19 call _Z8TT2_copyR9TestType2RKS_
20 movq %rbp, %rdx
21 addq 328(%rsp), %rdx
22 movq %rbx, %rsi
23 movq %rbx, %rdi
24 call _Z7TT2_addR9TestType2RKS_S2_
25 movq %rbp, %rdi
26 addq 328(%rsp), %rdi
27 movq %rbx, %rdx
28 movq %rdi, %rsi
29 call _Z7TT2_addR9TestType2RKS_S2_
30 addl $1, %r12d
31 cmpl 320(%rsp), %r12d
32 jb .L147
33 .L146:
34 movq %rbx, %rdi
35 call _Z11TT2_destroyR9TestType2

this is as good as we were hoping: before the loop, a temporary is created, and destroyed after the loop. in the loop, we have three calls:

`TT2_copy()`

,

`TT2_add()`

and a second time

`TT2_add()`

. thus, the resulting code is

1 TestType2 t; // without calling constructors or destructors!
2 create(t);
3 for (unsigned i = 0; i < v.size(); ++i)
4 {
5 copy(t, w[i]);
6 add(t, t, v[i]);
7 add(v[i], v[i], t);
8 }
9 destroy(t);

this is not as optimal as it could be: the best solution would be to optimize

1 copy(t, w[i]);
2 add(t, t, v[i]);

to one call:

`add(t, w[i], v[i]);`

. but this is still much better than

1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3 TestType2 t; // without calling constructors or destructors!
4 create(t);
5 add(t, w[i], v[i]);
6 add(v[i], v[i], t);
7 destroy(t);
8 }

where the temporary is created and destroyed every iteration. when trying to add this optimization, it is easier to use

`TestType`

from

part one to see what is happening. once the output looks fine, one switches back to

`TestType2`

to make sure the generated assembler code is fine.

#### the code.

you can download the source code of the `TestType2`

class here, and dummy implementations (to be put into another translation unit) of the `TT_*`

functions here.