introduction.
as mentioned in part one, testing expression templates isn’t that easy. one often has to go down to assembler level to see what is really going on. to make developing expression templates easier, and to debug my (fictional) myvec<T>
expression templates a bit more in detail, i created a second test type: TestType2
. again equiped with expression templates, its aim is to break down the expressions into three-address assembler commands, using temporaries to achieve this. for example, a -= (b + c) * d;
should evaluate into something like
1TestType2 t1, t2; // without calling constructors or destructors! 2TT2_create(t1); 3TT2_add(t1, b, c); 4TT2_create(t2); 5TT2_mul(t2, t1, d); 6TT2_sub(a, a, t2); 7TT2_destroy(t2); 8TT2_destroy(t1);
by defining the functions
add()
, mul()
, etc. in a different translation unit, one can analyse the assembler output of this translation unit to see what exactly came out. by searching for terms like TT2_add
one can quickly find the corresponding assembler commands.
the implementation.
we begin by declaring the class TestType2
and by declaring the functions to operate on it. again, we leave out multiplication (and divison and modulo) to shorten code.
1class TestType2; 2 3void TT2_create(TestType2 & r); 4void TT2_destroy(TestType2 & r); 5void TT2_createcopy(TestType2 & r, const TestType2 & a); 6void TT2_copy(TestType2 & r, const TestType2 & a); 7void setZero(TestType2 & r); 8void setOne(TestType2 & r); 9void TT2_add(TestType2 & r, const TestType2 & a, const TestType2 & b); 10void TT2_sub(TestType2 & r, const TestType2 & a, const TestType2 & b); 11void TT2_neg(TestType2 & r, const TestType2 & a);
the implementation of
TestType2
is straightforward. we add a pointer so that the class actually stores something.1class TestType2 2{ 3private: 4 void * d_data; 5 6public: 7 inline TestType2() 8 { 9 TT2_create(*this); 10 } 11 12 inline TestType2(const TestType2 & src) 13 { 14 TT2_createcopy(*this, src); 15 } 16 17 inline ~TestType2() 18 { 19 TT2_destroy(*this); 20 } 21 22 inline TestType2 & operator = (const TestType2 & src) 23 { 24 TT2_copy(*this, src); 25 return *this; 26 } 27 28 inline TestType2 & operator += (const TestType2 & b) 29 { 30 TT2_add(*this, *this, b); 31 return *this; 32 } 33 34 inline TestType2 & operator -= (const TestType2 & b) 35 { 36 TT2_sub(*this, *this, b); 37 return *this; 38 } 39};
again, as in part one, we have templates
TestExpression2<O, D>
and TestWrapper2
:1template<class Op, class Data> 2class TestExpression2 3{ 4private: 5 Op d_op; 6 Data d_data; 7 8public: 9 inline TestExpression2(const Op & op, const Data & data) 10 : d_op(op), d_data(data) 11 { 12 } 13 14 operator TestType2 () const 15 { 16 TestType2 res; 17 evalTo(res); 18 return res; 19 } 20 21 inline TestType2 evaluate() const 22 { 23 TestType2 res; 24 evalTo(res); 25 return res; 26 } 27 28 inline void evalTo(TestType2 & dest) const 29 { 30 d_op.evalTo(dest, d_data); 31 } 32}; 33 34class TestWrapper2 35{ 36private: 37 const TestType2 & d_val; 38 39public: 40 inline TestWrapper2(const TestType2 & val) 41 : d_val(val) 42 { 43 } 44 45 inline const TestType2 & evaluate() const 46 { 47 return d_val; 48 } 49 50 inline void evalTo(TestType2 & dest) 51 { 52 TT2_copy(dest, d_val); 53 } 54};
this time, we have both an
evaluate()
and a evalTo()
member function. the first allows to just evaluate the expression, generating temporaries for subexpressions, and the second one is used by callers like operator=()
of TestType2
to evaluate the result of an expression into an object of type TestType2
. the expression machinery is added to TestType2
by the following functions:1 template<class O, class D> 2 inline TestType2(const TestExpression2<O, D> & src) 3 { 4 TT2_createcopy(*this, src.evaluate()); 5 } 6 7 template<class O, class D> 8 inline TestType2 & operator = (const TestExpression2<O, D> & e) 9 { 10 e.evalTo(*this); 11 return *this; 12 } 13 14 template<class O, class D> 15 inline TestType2 & operator += (const TestExpression2<O, D> & e) 16 { 17 TT2_add(*this, *this, e.evaluate()); 18 return *this; 19 } 20 21 template<class O, class D> 22 inline TestType2 & operator -= (const TestExpression2<O, D> & e) 23 { 24 TT2_sub(*this, *this, e.evaluate()); 25 return *this; 26 }
the operators are defined as follows. there is not much to do: they just provide a
evalTo()
template function which calls the corresponding three-address operation:1class AddOp2 2{ 3public: 4 template<class A, class B> 5 inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const 6 { 7 TT2_add(dest, data.first.evaluate(), data.second.evaluate()); 8 } 9}; 10 11class SubOp2 12{ 13public: 14 template<class A, class B> 15 inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const 16 { 17 TT2_sub(dest, data.first.evaluate(), data.second.evaluate()); 18 } 19}; 20 21class NegOp2 22{ 23public: 24 template<class A> 25 inline void evalTo(TestType2 & dest, const A & data) const 26 { 27 TT2_neg(dest, data.evaluate()); 28 } 29};
again, what is left is the tedious integration, by defining a ton of
operator
s:1inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> > operator + (const TestType2 & a, const TestType2 & b) 2{ return TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> >(AddOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); } 3inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> > operator - (const TestType2 & a, const TestType2 & b) 4{ return TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> >(SubOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); } 5 6template<class O2, class D2> 7inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator + (const TestType2 & a, const TestExpression2<O2, D2> & b) 8{ return TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(TestWrapper2(a), b)); } 9template<class O2, class D2> 10inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator - (const TestType2 & a, const TestExpression2<O2, D2> & b) 11{ return TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(TestWrapper2(a), b)); } 12 13template<class O1, class D1> 14inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator + (const TestExpression2<O1, D1> & a, const TestType2 & b) 15{ return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(AddOp2(), std::make_pair(a, TestWrapper2(b))); } 16template<class O1, class D1> 17inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator - (const TestExpression2<O1, D1> & a, const TestType2 & b) 18{ return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(SubOp2(), std::make_pair(a, TestWrapper2(b))); } 19 20template<class O1, class D1, class O2, class D2> 21inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator + (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b) 22{ return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(a, b)); } 23template<class O1, class D1, class O2, class D2> 24inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator - (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b) 25{ return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(a, b)); } 26 27inline TestExpression2<NegOp2, TestWrapper2> operator - (const TestType2 & a) 28{ return TestExpression2<NegOp2, TestWrapper2>(NegOp2(), TestWrapper2(a)); } 29 30template<class O1, class D1> 31inline TestExpression2<NegOp2, TestExpression2<O1, D1> > operator - (const TestExpression2<O1, D1> & a) 32{ return TestExpression2<NegOp2, TestExpression2<O1, D1> >(NegOp2(), a); }
note that all the above code contains a lot of
inline
s, as opposed to part one. in part one, the aim was not to generate most efficient code, but to make visible the expressions evaluated by the expression templates. in this second part, we want to observe the assembler output, and thus want the code to be as optimized as possible.
a real-life example.
in this section, i want to look at three examples. first, let us consider the following example:
1TestType2 s, t, u, x, y, z; 2s /= t + (x - y) * z - u;
the resulting assembler code (generated with
g++ -S -O3
) looks as follows:1 leaq 1024(%rsp), %rcx 2 .... 3 movq %rbx, 56(%rsp) 4 call _Z10TT2_createR9TestType2 5 leaq 960(%rsp), %r12 6 movq %r12, %rdi 7 call _Z10TT2_createR9TestType2 8 leaq 928(%rsp), %rbp 9 movq %rbp, %rdi 10 call _Z10TT2_createR9TestType2 11 leaq 944(%rsp), %rbx 12 movq %rbx, %rdi 13 call _Z10TT2_createR9TestType2 14 leaq 1024(%rsp), %rdx 15 leaq 1040(%rsp), %rsi 16 movq %rbx, %rdi 17 call _Z7TT2_subR9TestType2RKS_S2_ 18 leaq 1008(%rsp), %rdx 19 movq %rbx, %rsi 20 movq %rbp, %rdi 21 call _Z7TT2_mulR9TestType2RKS_S2_ 22 movq %rbx, %rdi 23 call _Z11TT2_destroyR9TestType2 24 leaq 1072(%rsp), %rsi 25 movq %rbp, %rdx 26 movq %r12, %rdi 27 call _Z7TT2_addR9TestType2RKS_S2_ 28 movq %rbp, %rdi 29 call _Z11TT2_destroyR9TestType2 30 leaq 1056(%rsp), %rdx 31 movq %r12, %rsi 32 movq %r13, %rdi 33 call _Z7TT2_subR9TestType2RKS_S2_ 34 movq %r12, %rdi 35 call _Z11TT2_destroyR9TestType2 36 leaq 1088(%rsp), %rsi 37 movq %r13, %rdx 38 movq %rsi, %rdi 39 call _Z7TT2_divR9TestType2RKS_S2_ 40 movq %r13, %rdi 41 call _Z11TT2_destroyR9TestType2
(i removed a lot of register/memory arithmetic in the beginning, which prepares all addresses.) removing all cludder, we are left with the following function calls:
1TT2_create() 2TT2_create() 3TT2_create() 4TT2_create() 5TT2_sub() 6TT2_mul() 7TT2_destroy() 8TT2_add() 9TT2_destroy() 10TT2_sub() 11TT2_destroy() 12TT2_div() 13TT2_destroy()
four temporaries
t1, t2, t3, t4
are created. then, the subtraction x - y
is computed into the temporary t4
, and the result is multiplied by z
into the temporary t3
. the temporary t4
used to compute x - y
is then destroyed, and t
is added to t3
, with the result being stored in the temporary t2
. the temporary t3
is destroyed, and t2 - u
is evaluated into t1
. finally, s
is divided by t1
and t1
is destroyed. this shows that the compiler generated an equivalent to1TestType2 t1, t2, t3, t4; // without calling constructors or destructors! 2TT2_create(t1); 3TT2_create(t2); 4TT2_create(t3); 5TT2_create(t4); 6TT2_sub(t4, x, y); 7TT2_mul(t3, t4, z); 8TT2_destroy(t4); 9TT2_add(t2, t, t3); 10TT2_destroy(t3); 11TT2_sub(t1, t2, u); 12TT2_destroy(t2); 13TT2_div(s, s, t1); 14TT2_destroy(t1);
this is pretty much optimal: if one would have done this translation by hand in a straightforward way, one would have reached the same solution.
now let us re-consider our example from part one: the
myvec<T>
template. assume we have two myvec<TestType2>
vectors v
and w
, and we write v += v + w;
and v += w + v;
. the first command should generate something like1for (unsigned i = 0; i < v.size(); ++i) 2{ 3 add(v[i], v[i], v[i]); 4 add(v[i], v[i], w[i]); 5}
while the second command should generate something like
1TestType2 t; // without calling constructors or destructors! 2create(t); 3for (unsigned i = 0; i < v.size(); ++i) 4{ 5 add(t, w[i], v[i]); 6 add(v[i], v[i], t); 7} 8destroy(t);
or, if the expression templates for
myvec<T>
are not that good, at least something like1for (unsigned i = 0; i < v.size(); ++i) 2{ 3 TestType2 t; // without calling constructors or destructors! 4 create(t); 5 add(t, w[i], v[i]); 6 add(v[i], v[i], t); 7 destroy(t); 8}
first, consider
v += v + w;
. the assembler code generated is the following:1 movl 320(%rsp), %r9d 2 testl %r9d, %r9d 3 je .L128 4 xorl %ebp, %ebp 5.L129: 6 mov %ebp, %r12d 7 salq $3, %r12 8 movq %r12, %rbx 9 addq 328(%rsp), %rbx 10 movq %rbx, %rdx 11 movq %rbx, %rsi 12 movq %rbx, %rdi 13 call _Z7TT2_addR9TestType2RKS_S2_ 14 movq %r12, %rdx 15 addq 312(%rsp), %rdx 16 movq %rbx, %rsi 17 movq %rbx, %rdi 18 call _Z7TT2_addR9TestType2RKS_S2_ 19 addl $1, %ebp 20 cmpl 320(%rsp), %ebp 21 jb .L129 22.L128:
clearly, first the code tests whether the loop has to be run through at least once; if not, one jumps to label
.L128
. then, per loop iteration, exactly two calls to TT2_add()
are made. this shows that the code is essentially like1for (unsigned i = 0; i < v.size(); ++i) 2{ 3 add(v[i], v[i], v[i]); 4 add(v[i], v[i], w[i]); 5}
as we were hoping. thus, the expression templates worked well in this case. now let us look at
v += w + v;
. this time, a temporary has to be created, since otherwise v[i]
is modified before being used in the expression. the generated assembler code is the following:1 movl 320(%rsp), %eax 2 cmpl 304(%rsp), %eax 3 jne .L283 4 leaq 416(%rsp), %rbx 5 movq %rbx, %rdi 6 call _Z10TT2_createR9TestType2 7 movl 320(%rsp), %r8d 8 testl %r8d, %r8d 9 je .L146 10 xorl %r12d, %r12d 11 .p2align 4,,10 12 .p2align 3 13.L147: 14 mov %r12d, %ebp 15 movq %rbx, %rdi 16 salq $3, %rbp 17 movq %rbp, %rsi 18 addq 312(%rsp), %rsi 19 call _Z8TT2_copyR9TestType2RKS_ 20 movq %rbp, %rdx 21 addq 328(%rsp), %rdx 22 movq %rbx, %rsi 23 movq %rbx, %rdi 24 call _Z7TT2_addR9TestType2RKS_S2_ 25 movq %rbp, %rdi 26 addq 328(%rsp), %rdi 27 movq %rbx, %rdx 28 movq %rdi, %rsi 29 call _Z7TT2_addR9TestType2RKS_S2_ 30 addl $1, %r12d 31 cmpl 320(%rsp), %r12d 32 jb .L147 33.L146: 34 movq %rbx, %rdi 35 call _Z11TT2_destroyR9TestType2
this is as good as we were hoping: before the loop, a temporary is created, and destroyed after the loop. in the loop, we have three calls:
TT2_copy()
, TT2_add()
and a second time TT2_add()
. thus, the resulting code is1TestType2 t; // without calling constructors or destructors! 2create(t); 3for (unsigned i = 0; i < v.size(); ++i) 4{ 5 copy(t, w[i]); 6 add(t, t, v[i]); 7 add(v[i], v[i], t); 8} 9destroy(t);
this is not as optimal as it could be: the best solution would be to optimize
1 copy(t, w[i]); 2 add(t, t, v[i]);
to one call:
add(t, w[i], v[i]);
. but this is still much better than1for (unsigned i = 0; i < v.size(); ++i) 2{ 3 TestType2 t; // without calling constructors or destructors! 4 create(t); 5 add(t, w[i], v[i]); 6 add(v[i], v[i], t); 7 destroy(t); 8}
where the temporary is created and destroyed every iteration. when trying to add this optimization, it is easier to use
TestType
from part one to see what is happening. once the output looks fine, one switches back to TestType2
to make sure the generated assembler code is fine.
the code.
you can download the source code of the TestType2
class here, and dummy implementations (to be put into another translation unit) of the TT_*
functions here.
comments.