skip to main content.

posts about programming.

after already revamping math and musikwiese earlier this month, it’s time to do spielwiese as well. and here we go! spielwiese now adjusts to all window widths, including mobile phones. and the gallery looks good again, too. (i had to screw with it when adjusting spielwiese for mobile phones the first time.)

the only things missing are a proper popup gallery for mobile phones, and as in the other blogs: comments. so sorry, there’s no way to add a comment at the moment, except by writing me an email.

posted in: www
tags:
places:

over the last weeks, i’ve been starting to convert my blogs to nikola, a static blog generator named after nikola tesla. so far, i’ve finished my math blog and musikwiese. i also got rid of the fixed width layout. spielwiese itself will take a bit more work, mostly because of the wordpress plugins i wrote and heavily rely on throughout this blog.

to convert my blogs, i had to add some changes to nikola. most of the core changes are now contained in nikola itself or are close to be integrated (except xhtml support). i also worked a lot on improving its wordpess import, which so far wasn’t very helpful from my point of view. for that, i created a wordpress page compiler for nikola which essentially does the same to posts as wordpress’ formatter does. that part is still not completely done, and since it is a pure plugin, i haven’t released it yet. (that will happen later, hopefully still this year.)

the main disadvantage of the whole conversion process is currently that there’s no more possibility to add comments. all comments written so far are preserved, but currently there’s no support for adding new ones. i’ll try to figure out a good way to do that later. the “standard way” to do that, i.e. using something like disqus, is not acceptable for me, and even though there are free alternatives for self-hosting, i don’t like to rely on javascript for such things. for the moment, you have to send me an email if you want to add a comment to musikwiese or my math blog.

i’m very happy to announce that the lattice reduction library plll has finally been released as open source under the MIT license by the university of zurich. during my recent years at the university of zurich, i’ve been mainly working on this c++ library. it supports a wide range of lattice reduction algorithms and svp solvers, and makes heavy use of c++ templates and has support for c++11‘s move operations.

in 2011, i began implementing it since i wasn’t happy with some of the behavior of ntl‘s lattice reduction algorithms (mainly: in case of fatal numerical instability, they just terminate the program, and the library cannot be used in more than one thread at the same time). back then, ntl’s main competior fplll didn’t support bkz reduction, so i decided to try things out myself. after some time (years), my initial experiments grew into a full library supporting not only the more common lll and bkz algorithms as well as svp solving by enumeration, but also several different algorithms for lattice reduction and svp solving which are described in literature but for which it is sometimes quite hard to find a working implementation of. though the implementations of these algorithms are still more on the experimental side, the basic algorithms such as lll, bkz, potentially combined with deep insertions, and enumeration, are well-tested over the years. (in fact, some large-scale lattice reduction experiments i did for these algorithms yielded some results in the svp challenge’s hall of fame).

in case you’re interested in this library, feel free to play around with it! in case you have any questions, encounter problems, or want to give feedback, feel free to contact me by email.

today was my first day of my new job as a software engineer at dybuster, a small zürich based software company producing learning software treating dyslexia and dyscalculia.

quite a different job than before, but still a very nice and friendly environment, and a very noble objective in my opinion :-)

when adding a thread-specific allocator to a program of mine, to avoid terrible performance loss while using gmp and/or mpfr to do arbitrary precision integer respectively floating point arithmetic, i stumbled about a problem which seems to be fixed with newer solaris versions. in case anyone experiences a similar problem and cannot just update to a new enough solaris version, here’s some information on a dirty’n'quick fix for the problem.
more precisely, i wanted to combine boost::thread_specific_pointer (a portable implementation of thread specific storage, with dlmalloc, to obtain an allocator which won’t block when used from different threads at once. if you use arbitrary precision arithmetic on a machine with many cores/cpus (say, 30 to 60), having a single blocking (via a mutex) allocator totally kills performance. for example, on our ultrasparc/solaris machine, running 29 threads (on 30 cpus) in parallel, only 20% of the system’s ressources were used effectively. if the machine would have only had 6 cpus, the program would have run at the same speed. quite a waste, isn’t it?
anyway, combining thread local storage and a memory allocator solves this problem. in theory, at least. when i put the two things together, and ran my program with 30 threads, stlil only 60% of the 30 cpus processing power was used – the other 40% of the cycles were still spend waiting. (solaris has some excellent profiling tools on board. that’s why i like to use our slow old outdated solaris machine to profile, instead of our blazing fast newer big linux machine. in case anyone cares.) interestingly, on our linux machine, with 64 threads (running on 64 cores), the problem wasn’t there: 100% of the cycles went into computing, and essentially none into waiting.
inspecting the problem closer with the sun studio analyzer, it turns out that the 40% waiting cycles are caused by pthread_once, which is called by the internal boost method boost::detail::find_tss_data. that method is called every time a boost::thread_specific_pointer<> is dereferenced. which in my program happens every time when the thread local allocator is fired up to allocate, reallocate or free a piece of memory. (more precisely, boost::detail::find_tss_data calls boost::detail::get_current_thread_data, which uses boost::call_once, which in turn uses pthread_once in the pthread implementation of boost::thread, which is the implementation used on unixoid systems, such as solaris and linux.)
in theory, pthread_once uses a double-checked locking mechanism to make sure that the function specified is ran exactly once during the execution of the wohle program. while searching online, i found the source of the pthread implementation of a newer opensolaris from 2008 here; it uses a double-checked locking with a memory barrier, which should (at least in theory) turn it into a working solution (multi-threaded programming is far from being simple, both the compiler and the cpu can screw up your code by rearranging instructions in a deadly way).
anyway, it seems that the pthread_once implementation on the soliaris installation on the machine i’m using just locks a mutex every time it is called. when you massively call the function from 30 threads at once, all running perfectly parallel on a machine with enough cpus, this gives a natural bottle-neck. to make sure it is pthread_once which causes the problem, i wrote the following test program:

 1#include <pthread.h>
 2#include <iostream>
 3
 4static pthread_once_t onceControl = PTHREAD_ONCE_INIT;
 5static int nocalls = 0;
 6
 7extern "C" void onceRoutine(void)
 8{
 9    std::cout << "onceRoutine()\n";
10    nocalls++;
11}
12
13extern "C" void * thethread(void * x)
14{
15    for (unsigned i = 0; i < 10000000; ++i)
16        pthread_once(&onceControl, onceRoutine);
17    return NULL;
18}
19
20int main()
21{
22    const int nothreads = 30;
23    pthread_t threads[nothreads];
24
25    for (int i=0; i < nothreads; ++i)
26        pthread_create(&threads[i], NULL, thethread, NULL);
27
28    for (int i=0; i < nothreads; ++i)
29    {
30        void * status;
31        pthread_join(threads[i], &status);
32    }
33
34    if (nocalls != 1)
35        std::cout << "pthread_once() screwed up totally!\n";
36    else
37        std::cout << "pthread_once() seems to be doing what it promises\n";
38    return 0;
39}

i compiled the program with CC -m64 -fast -xarch=native64 -xchip=native -xcache=native -mt -lpthread oncetest.cpp -o oncetest and ran it with time. the result:
1real    16m9.541s
2user    201m1.476s
3sys     0m18.499s

compiling the same program under linux and running it there (with enough cores in the machine) yielded
1real    0m0.243s
2user    0m1.640s
3sys     0m0.060s

quite a difference, isn’t it? the solaris machine is slower, so a few seconds total time would be ok, but 16 minutes?! inspecting the running program on solaris with prstat -Lmp <pid> shows the amount of waiting involved…
to solve this problem, at least for me, with this old solaris verison running, i took the code of pthread_once from the above link – namely the includes
1#include <atomic.h>
2#include <thread.h>
3#include <errno.h>

copied the lines 38 to 46 from the link, and the lines 157 to 179 from the link into boost_directory/libs/thread/src/pthread/once.cpp, renamed pthread_once to my_pthread_once in the code i copied and in the boost source file i added the lines to, and re-compiled boost. then, i re-ran my program, and suddenly, there was no more waiting (at least, not for mutexes :-) ). and the oncetest from above, rewritten using boost::once_call, yielded:
1real    0m0.928s
2user    0m20.181s
3sys     0m0.036s

perfect!

introduction.

as mentioned in part one, testing expression templates isn’t that easy. one often has to go down to assembler level to see what is really going on. to make developing expression templates easier, and to debug my (fictional) myvec<T> expression templates a bit more in detail, i created a second test type: TestType2. again equiped with expression templates, its aim is to break down the expressions into three-address assembler commands, using temporaries to achieve this. for example, a -= (b + c) * d; should evaluate into something like

1TestType2 t1, t2; // without calling constructors or destructors!
2TT2_create(t1);
3TT2_add(t1, b, c);
4TT2_create(t2);
5TT2_mul(t2, t1, d);
6TT2_sub(a, a, t2);
7TT2_destroy(t2);
8TT2_destroy(t1);

by defining the functions add(), mul(), etc. in a different translation unit, one can analyse the assembler output of this translation unit to see what exactly came out. by searching for terms like TT2_add one can quickly find the corresponding assembler commands.

the implementation.

we begin by declaring the class TestType2 and by declaring the functions to operate on it. again, we leave out multiplication (and divison and modulo) to shorten code.

 1class TestType2;
 2
 3void TT2_create(TestType2 & r);
 4void TT2_destroy(TestType2 & r);
 5void TT2_createcopy(TestType2 & r, const TestType2 & a);
 6void TT2_copy(TestType2 & r, const TestType2 & a);
 7void setZero(TestType2 & r);
 8void setOne(TestType2 & r);
 9void TT2_add(TestType2 & r, const TestType2 & a, const TestType2 & b);
10void TT2_sub(TestType2 & r, const TestType2 & a, const TestType2 & b);
11void TT2_neg(TestType2 & r, const TestType2 & a);

the implementation of TestType2 is straightforward. we add a pointer so that the class actually stores something.
 1class TestType2
 2{
 3private:
 4    void * d_data;
 5    
 6public:
 7    inline TestType2()
 8    {
 9        TT2_create(*this);
10    }
11
12    inline TestType2(const TestType2 & src)
13    {
14        TT2_createcopy(*this, src);
15    }
16
17    inline ~TestType2()
18    {
19        TT2_destroy(*this);
20    }
21
22    inline TestType2 & operator = (const TestType2 & src)
23    {
24        TT2_copy(*this, src);
25        return *this;
26    }
27    
28    inline TestType2 & operator += (const TestType2 & b)
29    {
30        TT2_add(*this, *this, b);
31        return *this;
32    }
33    
34    inline TestType2 & operator -= (const TestType2 & b)
35    {
36        TT2_sub(*this, *this, b);
37        return *this;
38    }
39};

again, as in part one, we have templates TestExpression2<O, D> and TestWrapper2:
 1template<class Op, class Data>
 2class TestExpression2
 3{
 4private:
 5    Op d_op;
 6    Data d_data;
 7    
 8public:
 9    inline TestExpression2(const Op & op, const Data & data)
10        : d_op(op), d_data(data)
11    {
12    }
13    
14    operator TestType2 () const
15    {
16        TestType2 res;
17        evalTo(res);
18        return res;
19    }
20    
21    inline TestType2 evaluate() const
22    {
23        TestType2 res;
24        evalTo(res);
25        return res;
26    }
27    
28    inline void evalTo(TestType2 & dest) const
29    {
30        d_op.evalTo(dest, d_data);
31    }
32};
33
34class TestWrapper2
35{
36private:
37    const TestType2 & d_val;
38    
39public:
40    inline TestWrapper2(const TestType2 & val)
41        : d_val(val)
42    {
43    }
44    
45    inline const TestType2 & evaluate() const
46    {
47        return d_val;
48    }
49    
50    inline void evalTo(TestType2 & dest)
51    {
52        TT2_copy(dest, d_val);
53    }
54};

this time, we have both an evaluate() and a evalTo() member function. the first allows to just evaluate the expression, generating temporaries for subexpressions, and the second one is used by callers like operator=() of TestType2 to evaluate the result of an expression into an object of type TestType2. the expression machinery is added to TestType2 by the following functions:
 1    template<class O, class D>
 2    inline TestType2(const TestExpression2<O, D> & src)
 3    {
 4        TT2_createcopy(*this, src.evaluate());
 5    }
 6    
 7    template<class O, class D>
 8    inline TestType2 & operator = (const TestExpression2<O, D> & e)
 9    {
10        e.evalTo(*this);
11        return *this;
12    }
13    
14    template<class O, class D>
15    inline TestType2 & operator += (const TestExpression2<O, D> & e)
16    {
17        TT2_add(*this, *this, e.evaluate());
18        return *this;
19    }
20    
21    template<class O, class D>
22    inline TestType2 & operator -= (const TestExpression2<O, D> & e)
23    {
24        TT2_sub(*this, *this, e.evaluate());
25        return *this;
26    }

the operators are defined as follows. there is not much to do: they just provide a evalTo() template function which calls the corresponding three-address operation:
 1class AddOp2
 2{
 3public:
 4    template<class A, class B>
 5    inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const
 6    {
 7        TT2_add(dest, data.first.evaluate(), data.second.evaluate());
 8    }
 9};
10
11class SubOp2
12{
13public:
14    template<class A, class B>
15    inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const
16    {
17        TT2_sub(dest, data.first.evaluate(), data.second.evaluate());
18    }
19};
20
21class NegOp2
22{
23public:
24    template<class A>
25    inline void evalTo(TestType2 & dest, const A & data) const
26    {
27        TT2_neg(dest, data.evaluate());
28    }
29};

again, what is left is the tedious integration, by defining a ton of operators:
 1inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> > operator + (const TestType2 & a, const TestType2 & b)
 2{ return TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> >(AddOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); }
 3inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> > operator - (const TestType2 & a, const TestType2 & b)
 4{ return TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> >(SubOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); }
 5
 6template<class O2, class D2>
 7inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator + (const TestType2 & a, const TestExpression2<O2, D2> & b)
 8{ return TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(TestWrapper2(a), b)); }
 9template<class O2, class D2>
10inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator - (const TestType2 & a, const TestExpression2<O2, D2> & b)
11{ return TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(TestWrapper2(a), b)); }
12
13template<class O1, class D1>
14inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator + (const TestExpression2<O1, D1> & a, const TestType2 & b)
15{ return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(AddOp2(), std::make_pair(a, TestWrapper2(b))); }
16template<class O1, class D1>
17inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator - (const TestExpression2<O1, D1> & a, const TestType2 & b)
18{ return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(SubOp2(), std::make_pair(a, TestWrapper2(b))); }
19
20template<class O1, class D1, class O2, class D2>
21inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator + (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b)
22{ return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(a, b)); }
23template<class O1, class D1, class O2, class D2>
24inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator - (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b)
25{ return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(a, b)); }
26
27inline TestExpression2<NegOp2, TestWrapper2> operator - (const TestType2 & a)
28{ return TestExpression2<NegOp2, TestWrapper2>(NegOp2(), TestWrapper2(a)); }
29
30template<class O1, class D1>
31inline TestExpression2<NegOp2, TestExpression2<O1, D1> > operator - (const TestExpression2<O1, D1> & a)
32{ return TestExpression2<NegOp2, TestExpression2<O1, D1> >(NegOp2(), a); }

note that all the above code contains a lot of inlines, as opposed to part one. in part one, the aim was not to generate most efficient code, but to make visible the expressions evaluated by the expression templates. in this second part, we want to observe the assembler output, and thus want the code to be as optimized as possible.

a real-life example.

in this section, i want to look at three examples. first, let us consider the following example:

1TestType2 s, t, u, x, y, z;
2s /= t + (x - y) * z - u;

the resulting assembler code (generated with g++ -S -O3) looks as follows:
 1        leaq    1024(%rsp), %rcx
 2        ....
 3        movq    %rbx, 56(%rsp)
 4        call    _Z10TT2_createR9TestType2
 5        leaq    960(%rsp), %r12
 6        movq    %r12, %rdi
 7        call    _Z10TT2_createR9TestType2
 8        leaq    928(%rsp), %rbp
 9        movq    %rbp, %rdi
10        call    _Z10TT2_createR9TestType2
11        leaq    944(%rsp), %rbx
12        movq    %rbx, %rdi
13        call    _Z10TT2_createR9TestType2
14        leaq    1024(%rsp), %rdx
15        leaq    1040(%rsp), %rsi
16        movq    %rbx, %rdi
17        call    _Z7TT2_subR9TestType2RKS_S2_
18        leaq    1008(%rsp), %rdx
19        movq    %rbx, %rsi
20        movq    %rbp, %rdi
21        call    _Z7TT2_mulR9TestType2RKS_S2_
22        movq    %rbx, %rdi
23        call    _Z11TT2_destroyR9TestType2
24        leaq    1072(%rsp), %rsi
25        movq    %rbp, %rdx
26        movq    %r12, %rdi
27        call    _Z7TT2_addR9TestType2RKS_S2_
28        movq    %rbp, %rdi
29        call    _Z11TT2_destroyR9TestType2
30        leaq    1056(%rsp), %rdx
31        movq    %r12, %rsi
32        movq    %r13, %rdi
33        call    _Z7TT2_subR9TestType2RKS_S2_
34        movq    %r12, %rdi
35        call    _Z11TT2_destroyR9TestType2
36        leaq    1088(%rsp), %rsi
37        movq    %r13, %rdx
38        movq    %rsi, %rdi
39        call    _Z7TT2_divR9TestType2RKS_S2_
40        movq    %r13, %rdi
41        call    _Z11TT2_destroyR9TestType2

(i removed a lot of register/memory arithmetic in the beginning, which prepares all addresses.) removing all cludder, we are left with the following function calls:
 1TT2_create()
 2TT2_create()
 3TT2_create()
 4TT2_create()
 5TT2_sub()
 6TT2_mul()
 7TT2_destroy()
 8TT2_add()
 9TT2_destroy()
10TT2_sub()
11TT2_destroy()
12TT2_div()
13TT2_destroy()

four temporaries t1, t2, t3, t4 are created. then, the subtraction x - y is computed into the temporary t4, and the result is multiplied by z into the temporary t3. the temporary t4 used to compute x - y is then destroyed, and t is added to t3, with the result being stored in the temporary t2. the temporary t3 is destroyed, and t2 - u is evaluated into t1. finally, s is divided by t1 and t1 is destroyed. this shows that the compiler generated an equivalent to
 1TestType2 t1, t2, t3, t4; // without calling constructors or destructors!
 2TT2_create(t1);
 3TT2_create(t2);
 4TT2_create(t3);
 5TT2_create(t4);
 6TT2_sub(t4, x, y);
 7TT2_mul(t3, t4, z);
 8TT2_destroy(t4);
 9TT2_add(t2, t, t3);
10TT2_destroy(t3);
11TT2_sub(t1, t2, u);
12TT2_destroy(t2);
13TT2_div(s, s, t1);
14TT2_destroy(t1);

this is pretty much optimal: if one would have done this translation by hand in a straightforward way, one would have reached the same solution.
now let us re-consider our example from part one: the myvec<T> template. assume we have two myvec<TestType2> vectors v and w, and we write v += v + w; and v += w + v;. the first command should generate something like
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    add(v[i], v[i], v[i]);
4    add(v[i], v[i], w[i]);
5}

while the second command should generate something like
1TestType2 t; // without calling constructors or destructors!
2create(t);
3for (unsigned i = 0; i < v.size(); ++i)
4{
5    add(t, w[i], v[i]);
6    add(v[i], v[i], t);
7}
8destroy(t);

or, if the expression templates for myvec<T> are not that good, at least something like
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    TestType2 t; // without calling constructors or destructors!
4    create(t);
5    add(t, w[i], v[i]);
6    add(v[i], v[i], t);
7    destroy(t);
8}

first, consider v += v + w;. the assembler code generated is the following:
 1        movl    320(%rsp), %r9d
 2        testl   %r9d, %r9d
 3        je      .L128
 4        xorl    %ebp, %ebp
 5.L129:
 6        mov     %ebp, %r12d
 7        salq    $3, %r12
 8        movq    %r12, %rbx
 9        addq    328(%rsp), %rbx
10        movq    %rbx, %rdx
11        movq    %rbx, %rsi
12        movq    %rbx, %rdi
13        call    _Z7TT2_addR9TestType2RKS_S2_
14        movq    %r12, %rdx
15        addq    312(%rsp), %rdx
16        movq    %rbx, %rsi
17        movq    %rbx, %rdi
18        call    _Z7TT2_addR9TestType2RKS_S2_
19        addl    $1, %ebp
20        cmpl    320(%rsp), %ebp
21        jb      .L129
22.L128:

clearly, first the code tests whether the loop has to be run through at least once; if not, one jumps to label .L128. then, per loop iteration, exactly two calls to TT2_add() are made. this shows that the code is essentially like
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    add(v[i], v[i], v[i]);
4    add(v[i], v[i], w[i]);
5}

as we were hoping. thus, the expression templates worked well in this case. now let us look at v += w + v;. this time, a temporary has to be created, since otherwise v[i] is modified before being used in the expression. the generated assembler code is the following:
 1        movl    320(%rsp), %eax
 2        cmpl    304(%rsp), %eax
 3        jne     .L283
 4        leaq    416(%rsp), %rbx
 5        movq    %rbx, %rdi
 6        call    _Z10TT2_createR9TestType2
 7        movl    320(%rsp), %r8d
 8        testl   %r8d, %r8d
 9        je      .L146
10        xorl    %r12d, %r12d
11        .p2align 4,,10
12        .p2align 3
13.L147:
14        mov     %r12d, %ebp
15        movq    %rbx, %rdi
16        salq    $3, %rbp
17        movq    %rbp, %rsi
18        addq    312(%rsp), %rsi
19        call    _Z8TT2_copyR9TestType2RKS_
20        movq    %rbp, %rdx
21        addq    328(%rsp), %rdx
22        movq    %rbx, %rsi
23        movq    %rbx, %rdi
24        call    _Z7TT2_addR9TestType2RKS_S2_
25        movq    %rbp, %rdi
26        addq    328(%rsp), %rdi
27        movq    %rbx, %rdx
28        movq    %rdi, %rsi
29        call    _Z7TT2_addR9TestType2RKS_S2_
30        addl    $1, %r12d
31        cmpl    320(%rsp), %r12d
32        jb      .L147
33.L146:
34        movq    %rbx, %rdi
35        call    _Z11TT2_destroyR9TestType2

this is as good as we were hoping: before the loop, a temporary is created, and destroyed after the loop. in the loop, we have three calls: TT2_copy(), TT2_add() and a second time TT2_add(). thus, the resulting code is
1TestType2 t; // without calling constructors or destructors!
2create(t);
3for (unsigned i = 0; i < v.size(); ++i)
4{
5    copy(t, w[i]);
6    add(t, t, v[i]);
7    add(v[i], v[i], t);
8}
9destroy(t);

this is not as optimal as it could be: the best solution would be to optimize
1    copy(t, w[i]);
2    add(t, t, v[i]);

to one call: add(t, w[i], v[i]);. but this is still much better than
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    TestType2 t; // without calling constructors or destructors!
4    create(t);
5    add(t, w[i], v[i]);
6    add(v[i], v[i], t);
7    destroy(t);
8}

where the temporary is created and destroyed every iteration. when trying to add this optimization, it is easier to use TestType from part one to see what is happening. once the output looks fine, one switches back to TestType2 to make sure the generated assembler code is fine.

the code.

you can download the source code of the TestType2 class here, and dummy implementations (to be put into another translation unit) of the TT_* functions here.

introduction.

when developing expression templates, say a vector class allowing efficient arithmetic using expression templates, one often wonders if the long, complicated pages of template code one wrote actually does what it is supposed to do. to be sure, one has to ask the compiler to generate assembler code and wade through that code to see what is happening. this is tiresome, though essential.
most of the time, it is enough to check if the operations done with the supplied type – in the above example of a vector class, the scalar type – are what we want them to be. for example, if v, w, x, y, z are variables of type myvec<int>, say all of the same dimension, one wants to know whether v = 2 * w + (x - (y - z)) is evaluated to something like

1for (unsigned i = 0; i < v.size(); ++i)
2    v[i] = 2 * w[i] + (x[i] - (y[i] - z[i]));

or (more likely)
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    v[i] = 2 * w[i];
4    v[i] += x[i];
5    v[i] -= y[i];
6    v[i] += z[i];
7}

(note that this code is not equivalent to the one above for certain more complex types than int, but for my purposes, such a transformation is fine. but anyway, that is not what we want to discuss here.) an alternative would be that temporaries are created, like in
1myvec<int> t1, t2, t3, t4;
2t1 = 2 * w;
3t2 = y - z;
4t3 = x - t2;
5t4 = t1 - t3;
6v = t4;

this involves allocating and releasing memory and copying around data, which is significantly slower than the direct version
1for (unsigned i = 0; i < v.size(); ++i)
2    v[i] = 2 * w[i] + (x[i] - (y[i] - z[i]));

the aim of expression templates is to generate such more optimal code. but what if your expression templates don’t do what you want? maybe they still generate temporaries, like in the following version:
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    int t2 = 2 * w[i];
4    int t2 = y[i] - z[i];
5    int t3 = x[i] - t2;
6    v[i] = t1 - t3;
7}

the compiler will still optimize this to the same as above, but in case the type is not int but, say, myArbitraryPrecisionInteger (which uses expression templates itself and can deal very efficiently with expressions like v[i] = 2 * w[i] + (x[i] - (y[i] - z[i]));), this code is suboptimal, and no expression templates provided by that type can make it better.
therefore, one would like to have a dummy type, a type one can just plug in instead of int in the above example, which somehow outputs what exactly is done: which temporaries are created and destroyed, and which operations and expressions are created.

a test type.

the test type has to use expression templates itself to gather the expressions which are generated by the other expression templates, like the ones of myvec<T>. then, it should just print these expressions when assignments etc. occur, say to std::cout. then one can run a test program and just read the output to see what was going on. this is in general much easier than looking at the generated assembler code.
note that we only care about additive operations in the following, to make the code more readable.
the basic version of the type looks as follows:

 1class TestType
 2{
 3public:
 4    TestType()
 5    {
 6        std::cout << "create " << this << "\n";
 7    }
 8
 9    TestType(const TestType & src)
10    {
11        std::cout << "create " << this << " from " << &src << "\n";
12    }
13
14    ~TestType()
15    {
16        std::cout << "destroy " << this << "\n";
17    }
18
19    TestType & operator = (const TestType & src)
20    {
21        std::cout << "copy " << this << " from " << &src << "\n";
22        return *this;
23    }
24    
25    TestType & operator += (const TestType & b)
26    {
27        std::cout << this << " += " << &b << "\n";
28        return *this;
29    }
30    
31    TestType & operator -= (const TestType & b)
32    {
33        std::cout << this << " -= " << &b << "\n";
34        return *this;
35    }
36};

this already allows us to see when TestType objects and temporaries are created and assigned, and when basic arithmetic is done. but so far, no real arithmetic can be done. to allow arithmetic, we introduce a TestExpression<O, D> template. here, the O class describes the operand, and the D class describes the argument(s). it is defined as follows:
 1template<class Op, class Data>
 2class TestExpression
 3{
 4private:
 5    Op d_op;
 6    Data d_data;
 7    
 8public:
 9    inline TestExpression(const Op & op, const Data & data)
10        : d_op(op), d_data(data)
11    {
12    }
13    
14    operator TestType () const
15    {
16        std::cout << "casting TestExpression[";
17        d_op.print(d_data);
18        std::cout << "] to TestType()\n";
19        return TestType();
20    }
21    
22    void print() const
23    {
24        d_op.print(d_data);
25    }
26};

to add support for expressions to the TestType class, one adds the following methods to it:
 1    template<class O, class D>
 2    TestType(const TestExpression<O, D> & src)
 3    {
 4        std::cout << "create " << this << " from ";
 5        src.print();
 6        std::cout << "\n";
 7    }
 8
 9    template<class O, class D>
10    TestType & operator = (const TestExpression<O, D> & e)
11    {
12        std::cout << this << " = ";
13        e.print();
14        std::cout << "\n";
15        return *this;
16    }
17
18    template<class O, class D>
19    TestType & operator += (const TestExpression<O, D> & e)
20    {
21        std::cout << this << " += ";
22        e.print();
23        std::cout << "\n";
24        return *this;
25    }
26    
27    template<class O, class D>
28    TestType & operator -= (const TestExpression<O, D> & e)
29    {
30        std::cout << this << " -= ";
31        e.print();
32        std::cout << "\n";
33        return *this;
34    }

then if we assign a TestExpression<O, D> to a TestType object, or add it to it, or subtract it from it, etc., the corresponding messages are printed. now, let us discuss how the operands are implemented. these are simple classes with template members, which do not store data:
 1class AddOp
 2{
 3public:
 4    template<class A, class B>
 5    void print(const std::pair<A, B> & data) const
 6    {
 7        std::cout << "(";
 8        data.first.print();
 9        std::cout << " + ";
10        data.second.print();
11        std::cout << ")";
12    }
13};
14
15class SubOp
16{
17public:
18    template<class A, class B>
19    void print(const std::pair<A, B> & data) const
20    {
21        std::cout << "(";
22        data.first.print();
23        std::cout << " - ";
24        data.second.print();
25        std::cout << ")";
26    }
27};
28
29class NegOp
30{
31public:
32    template<class A>
33    void print(const A & data) const
34    {
35        std::cout << "-";
36        data.print();
37    }
38};

note that all operations but the unary NegOp are binary; the data object is in that case a std::pair<T1, T2> object. now one main piece is missing which puts everything together: the overloaded operators which generate the expression templates. let us begin with the universal ones: the ones taking two expressions and combining them by an operator.
1template<class O1, class D1, class O2, class D2>
2TestExpression<AddOp, std::pair<TestExpression<O1, D1>, TestExpression<O2, D2> > > operator + (const TestExpression<O1, D1> & a, const TestExpression<O2, D2> & b)
3{ return TestExpression<AddOp, std::pair<TestExpression<O1, D1>, TestExpression<O2, D2> > >(AddOp(), std::make_pair(a, b)); }
4template<class O1, class D1, class O2, class D2>
5TestExpression<SubOp, std::pair<TestExpression<O1, D1>, TestExpression<O2, D2> > > operator - (const TestExpression<O1, D1> & a, const TestExpression<O2, D2> & b)
6{ return TestExpression<SubOp, std::pair<TestExpression<O1, D1>, TestExpression<O2, D2> > >(SubOp(), std::make_pair(a, b)); }

this code is not exactly readable, but does its job: it takes two expressions, TestExpression<O1, D1> and TestExpression<O2, D2>, and combines them to a new expression TestExpression<NewOperand, std::pair<TestExpression<O1, D1>, TestExpression<O2, D2> > >. the operator for inversion looks similar, but simpler:
1template<class O1, class D1>
2TestExpression<NegOp, TestExpression<O1, D1> > operator - (const TestExpression<O1, D1> & a)
3{ return TestExpression<NegOp, TestExpression<O1, D1> >(NegOp(), a); }

but this whole thing only works when we already have expressions. so far, we have no code which actually creates an expression in the first place. this can be done by more operator overloading, and by introducing a TestWrapper[/url] which encapsulates an object of type [code language="c++"]TestType and behaves like an expression on its own. let us first show the operator definition in the unary case:
1TestExpression<NegOp, TestWrapper> operator - (const TestType & a)
2{ return TestExpression<NegOp, TestWrapper>(NegOp(), TestWrapper(a)); }

the template TestWrapper encapsulates a TestObject. the definition looks as follows:
 1class TestWrapper
 2{
 3private:
 4    const TestType & d_val;
 5    
 6public:
 7    inline TestWrapper(const TestType & val)
 8        : d_val(val)
 9    {
10    }
11    
12    void print() const
13    {
14        std::cout << &d_val;
15    }
16};

compare this to the definition of TestExpression above; note that no casting operator is needed as a TestWrapper object should never show up directly to the user.
now, we are left to implement the binary operators for + and -. we have to go through all combinations of TestType and TestExpression<O, D> combinations (except two TestExpression<O, D>‘s, which we already covered). this looks as follows:
 1TestExpression<AddOp, std::pair<TestWrapper, TestWrapper> > operator + (const TestType & a, const TestType & b)
 2{ return TestExpression<AddOp, std::pair<TestWrapper, TestWrapper> >(AddOp(), std::make_pair(TestWrapper(a), TestWrapper(b))); }
 3TestExpression<SubOp, std::pair<TestWrapper, TestWrapper> > operator - (const TestType & a, const TestType & b)
 4{ return TestExpression<SubOp, std::pair<TestWrapper, TestWrapper> >(SubOp(), std::make_pair(TestWrapper(a), TestWrapper(b))); }
 5
 6template<class O2, class D2>
 7TestExpression<AddOp, std::pair<TestWrapper, TestExpression<O2, D2> > > operator + (const TestType & a, const TestExpression<O2, D2> & b)
 8{ return TestExpression<AddOp, std::pair<TestWrapper, TestExpression<O2, D2> > >(AddOp(), std::make_pair(TestWrapper(a), b)); }
 9template<class O2, class D2>
10TestExpression<SubOp, std::pair<TestWrapper, TestExpression<O2, D2> > > operator - (const TestType & a, const TestExpression<O2, D2> & b)
11{ return TestExpression<SubOp, std::pair<TestWrapper, TestExpression<O2, D2> > >(SubOp(), std::make_pair(TestWrapper(a), b)); }
12
13template<class O1, class D1>
14TestExpression<AddOp, std::pair<TestExpression<O1, D1>, TestWrapper> > operator + (const TestExpression<O1, D1> & a, const TestType & b)
15{ return TestExpression<AddOp, std::pair<TestExpression<O1, D1>, TestWrapper> >(AddOp(), std::make_pair(a, TestWrapper(b))); }
16template<class O1, class D1>
17TestExpression<SubOp, std::pair<TestExpression<O1, D1>, TestWrapper> > operator - (const TestExpression<O1, D1> & a, const TestType & b)
18{ return TestExpression<SubOp, std::pair<TestExpression<O1, D1>, TestWrapper> >(SubOp(), std::make_pair(a, TestWrapper(b))); }

this is as annoying to write as it looks like, but it is necessary. but only once.

testing the result.

now assume that v and w are two objects of type myvec<TestType>, each having six elements. the object s is of type TestType itself. assume that i write the following: v += v + w. then the compiled version will output:

 10x1b780d0 += 0x1b780d0
 20x1b780d0 += 0x1b780f0
 30x1b780d1 += 0x1b780d1
 40x1b780d1 += 0x1b780f1
 50x1b780d2 += 0x1b780d2
 60x1b780d2 += 0x1b780f2
 70x1b780d3 += 0x1b780d3
 80x1b780d3 += 0x1b780f3
 90x1b780d4 += 0x1b780d4
100x1b780d4 += 0x1b780f4
110x1b780d5 += 0x1b780d5
120x1b780d5 += 0x1b780f5

this shows that the command v += v + w is replaced by something like:
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    v[i] += v[i];
4    v[i] += w[i];
5}

now let us look at something more complicated. if i write v = w + v, this cannot be translated to
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    v[i] += w[i];
4    v[i] += v[i];
5}

anymore, as v[i] is changing its value inbetween. i added code to my myvec<T> implementation to detect and try to avoid such problems. in this case, it should translate the code something like
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    TestType t = w[i] + v[i];
4    v[i] += t;
5}

the output is:
 1create 0x7fffc4d6b5df
 2copy 0x7fffc4d6b5df from 0x17860f0
 30x7fffc4d6b5df += 0x17860d0
 40x17860d0 += 0x7fffc4d6b5df
 5copy 0x7fffc4d6b5df from 0x17860f1
 60x7fffc4d6b5df += 0x17860d1
 70x17860d1 += 0x7fffc4d6b5df
 8copy 0x7fffc4d6b5df from 0x17860f2
 90x7fffc4d6b5df += 0x17860d2
100x17860d2 += 0x7fffc4d6b5df
11copy 0x7fffc4d6b5df from 0x17860f3
120x7fffc4d6b5df += 0x17860d3
130x17860d3 += 0x7fffc4d6b5df
14copy 0x7fffc4d6b5df from 0x17860f4
150x7fffc4d6b5df += 0x17860d4
160x17860d4 += 0x7fffc4d6b5df
17copy 0x7fffc4d6b5df from 0x17860f5
180x7fffc4d6b5df += 0x17860d5
190x17860d5 += 0x7fffc4d6b5df
20destroy 0x7fffc4d6b5df

this shows that in fact, the generated code is more like this:
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    TestType t = w[i];
4    t += v[i];
5    v[i] += t;
6}

so without looking at the generated assembler code, we already have a good idea what the template expressions of myvec<T> are doing. now assume we write something like v = s * v + w;. the output is
 10x1786116 += (0x1786116 * 0x7fffc4d6b7af)
 20x1786116 += 0x1786110
 30x1786117 += (0x1786117 * 0x7fffc4d6b7af)
 40x1786117 += 0x1786111
 50x1786118 += (0x1786118 * 0x7fffc4d6b7af)
 60x1786118 += 0x1786112
 70x1786119 += (0x1786119 * 0x7fffc4d6b7af)
 80x1786119 += 0x1786113
 90x178611a += (0x178611a * 0x7fffc4d6b7af)
100x178611a += 0x1786114
110x178611b += (0x178611b * 0x7fffc4d6b7af)
120x178611b += 0x1786115

which shows that the code was translated to something like
1for (unsigned i = 0; i < v.size(); ++i)
2{
3    v[i] += v[i] * s;
4    v[i] += w[i];
5}

(this of course assumes that we also implemented operator * for TestType.)
to really check what is going on one still has to check the generated assembler code. for example, in the test above which used a temporary when writing v += w + v and no temporary when writing v += v + w, it is unclear if already the compiler made the decision which case to use (which he could, since he knows the addresses of v and w, or at least knows whether they are equal or not), or whether both cases are compiled into the program and whether the running program has to figure out which block of code to run. this cannot be detected using TestType.
note that for checking the assembler code, one better uses a different test type: one which translates everything to “three-address assembler”-like commands, which are declared extern (and not defined in this translation unit). then one can search for these function calls, and the whole scenario is more realistic as with complex operators (as above, where std::cout is used all over the place), in which case often operators are outsourced as own functions.

the code.

you can download the source code of the TestType class here.

here is a table of contents for the implementing multivariate polynomials in c++ using templates mini-series:

the soure code can be downloaded here (8 kb). it is licensed under the simplified bsd license. if you need the code licensed under another open source license, feel free to contact me.

welcome to the fifth (also last and easiest) part of the multivariate polynomials in c++ using templates series. this part will explain how i implemented long division and the euclidean algorithm.

long division.

long division in R[X] is defined as follows: given polynomials f, g \in R[X], find polynomials q, r \in R[X] such that f = q \cdot g + r and \deg r < \deg g. provided that the leading term of g is a unit in R (or at least divides most things), q and r exist and are unique (in fact, i'm implicitly assuming that R is commutative and unitary). this division makes R[X] an euclidean ring in case R is a field (i.e. all non-zero elements are invertible). our T is in general not a field, but we can still try to do long division, and let the user worry if this actually makes sense. for example, if T == int, one can divide two elements (assuming the second is not zero), but this is not the division a mathematician would expect. hence, if g is not monic in case of T == int, our implmentation will output <i>something</i> which does not necessarily make sense.

anyway. my implementation is rather straightforward, which is not very surprising:

 1template<class T>
 2void eucliddiv(const poly<1, T> & f, const poly<1, T> & g, poly<1, T> & q, poly<1, T> & r)
 3{
 4    if (f.degree() < g.degree())
 5    {
 6        // Ignore the trivial case
 7        q = poly<1, T>();
 8        r = f;
 9        return;
10    }
11    q = poly<1, T>(true, f.degree() - g.degree() + 1);
12    r = f;
13    for (int i = q.degree(); i >= 0; --i)
14    {
15        q[i] = r[i + g.degree()] / g.leading();
16        for (int j = g.degree(); j >= 0; --j)
17            r[i + j] -= q[i] * g[j];
18    }
19    r.normalize();
20    // Note that the degree of q is already correct.
21}

here, we assume that f.leading() / g.leading() is non-zero; otherwise, q.normalize() would have to be called, but in that case the whole result will probably be broken anyway. we create the polynomial q using the special constructor mentioned here.

euclidean algorithm.

the implementation of the euclidean algorithm is not depending on the implementation anymore (long division made only use of the special constructor). first, we take care of special cases (one of the two arguments is zero), and then proceed with the main loop. we normalize the remainder in every iteration of the loop, and also the result, to (hopefully) increase numerical stability a bit. hence, for this to work, we really need that T is a field, or otherwise it will only work in very special cases.

 1template<class T>
 2poly<1, T> GCD(const poly<1, T> & f, const poly<1, T> & g)
 3{
 4    if (f.isZero())
 5    {
 6        if (g.isZero())
 7            return f; // both are zero
 8        else
 9            return g / g.leading(); // make g monic
10    }
11    poly<1, T> d1(f / f.leading()), d2(g / g.leading()), q, r;
12    while (!d2.isZero())
13    {
14        eucliddiv(d1, d2, q, r);
15        d1.swap(d2);
16        d2 = r;
17        d2 /= r.leading(); // make r monic
18    }
19    return d1;
20}

i don't think there's much to write about this, which isn't already written somewhere else. this just shows how to implement a generic polynomial based algorithm.

welcome to the fourth part of the multivariate polynomials in c++ using templates series. in this part, i want to explain how to implement efficient polynomial evaluation using a collection of interacting templates. the output will be quite optimized code (assuming the compiler has a decent enough optimizer).

evaluating multivariate polynomials is more complicated. i started with a naive implementation of operator() in poly<n, T>:

 1    template<class S>
 2    poly<n - 1, S> operator() (const S & x) const
 3    {
 4        poly<n - 1, S> res;
 5        S xx = (S)1;
 6        for (unsigned i = 0; i < d_value.size(); ++i)
 7        {
 8            res += d_value[i] * xx;
 9            xx *= x;
10        }
11        return res;
12    }

if f is of type poly<2, int>, then f(4) will be of type poly<1, int>, whence f(4)(2) will call poly<1, int>::operator()<int>(const int &) to yield a polynomial of type poly<0, int>, which automatically casts to an int. unfortunately, in this process, first f(4) will be created, which requires arithmetic of polynomials of type poly<1, int>, and then the resulting polynomial will be evaluated again. in particular, if n is larger, this is far from being optimal. hence, this solution is fine if evaluations are seldomly done, but if they are done more often, it is too slow.

to stay with our example of f of type poly<2, int>. one could evaluate f(4, 2), written as f(4)(2), directly as follows:

 1int result = 0, xx1 = 1;
 2for (int i = 0; i <= f.degree(); ++i)
 3{
 4    int result2 = 0, xx2 = 1;
 5    for (int j = 0; j <= f[i].degree(); ++j)
 6    {
 7        result2 += f[i][j] * xx2;
 8        xx2 *= 2;
 9    }
10    result += result2 * xx1;
11    xx1 *= 4;
12}

clearly, this is tiresome (and prone to typos) to write every time. moreover, this is also not optimal, as our operator[] will do more than just returning an element. in particular, if f is not const in this context, the compiler will insert code at every operation f[i] and f[i][j] to check whether the index is out of range (to resize d_value in that case). writing more directly

 1int result = 0, xx1 = 1;
 2for (int i = 0; i < f.d_value.size(); ++i)
 3{
 4    int result2 = 0, xx2 = 1;
 5    for (int j = 0; j < f.d_value[i].d_value.size(); ++j)
 6    {
 7        result2 += f.d_value[i].d_value[j] * xx2;
 8        xx2 *= 2;
 9    }
10    result += result2 * xx1;
11    xx1 *= 4;
12}

would result in faster code, if it would compile – d_value is private. for this reason, i came up with two templates:

1template<int n, class T, class S = T>
2class poly_evaluator;
3template<int n, class T, class HL, class S>
4class poly_evaluator_impl;

the template poly_evaluator<n, T, S> is instanciated in poly<n, T>::operator()<S>(const S &), which is defined as follows:

1    template<class S>
2    inline poly_evaluator<n, T, S> operator() (const S & x) const
3    {
4        return poly_evaluator<n, T, S>(*this, x);
5    }

the idea is that if operator() of poly_evaluator<n, T, S>(*this, x) is called, it will spawn an object of type poly_evaluator_impl<n-1, T, poly_evaluator<n, T, S>, S>. if operator() of this new object is called, it will create an object of type poly_evaluator_impl<n-2, T, poly_evaluator_impl<n-1, T, poly_evaluator<n, T, S>, S>, S>, and so on. the purpose of carrying the type of the caller around is that the outer loop of the evaluation function is the loop for the inner-most object (of type poly_evaluator<n, T, S>). if operator() would be right-associative instead of left-associative (as it is), this trouble would not be necessary.

the idea is that both templates poly_evaluator<n, T, S> and poly_evaluator<n, T, HL, S> provide an internal function template<class SS, class Fun> evaluate(SS & res, Fun evalfun), which in case of poly_evaluator<n, T, HL, S> calls the corresponding evaluate function of its owner (with evalfun replaced), and in the case of poly_evaluator<n, T, S> implements the outermost loop, which loops over the coefficients of X_0 and uses evalfun to evaluate the coefficients. hence, the evalful handed upwards must be extended in every step to include the next variable.

assume that we have a polynomial f of type poly<3, int>, and we want to evaluate f(10)(20)(30). this is an object of type poly_evaluator_impl<1, int, poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int>, int>, whose cast operator operator int() will start evaluation by defining a local class poly_evaluator_impl<1, int, poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int>, int>::eval_fun, defined as follows:

 1    class eval_fun
 2    {
 3        const poly_evaluator_impl<1, int, poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int>, int> & d_owner;
 4        
 5    public:
 6        inline eval_fun(const poly_evaluator_impl<1, int, poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int>, int> & owner)
 7            : d_owner(owner)
 8        {
 9        }
10        
11        inline int operator() (const poly<1, int> & p) const
12        {
13            int res = 0;
14            int xx = 1;
15            for (int i = 0; i < (int)p.d_value.size(); ++i)
16            {
17                res += p.d_value[i] * xx;
18                xx = xx * d_owner.d_evalpoint;
19            }
20            return res;
21        }
22    };

an object of it is created by operator int() as follows:

1    inline operator int() const
2    {
3        int res = 0;
4        d_owner.evaluate(res, eval_fun(*this));
5        return res;
6    }

the object eval_fun(*this) can now be used to evaluate polynomials of type poly<1, int> at the specified evaluation point d_owner.d_evalpoint, which equals 30 in this case.

the operator int() calls poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int>::evaluate<int, Fun>(int &, Fun), which in turn is implemented as follows:

1    template<class Fun>
2    inline void evaluate(int & res, const Fun & evalfun) const
3    {
4        d_owner.evaluate(res, eval_fun<int, Fun>(*this, evalfun));
5    }

the newly created object of type eval_fun<int, Fun> will now evaluate polynomials of type poly<2, int> by using the provided functor evalfun, which is an object of the class poly_evaluator_impl<1, int, poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int>, int>::eval_fun defined above. the class eval_fun in poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int> is defined as follows:

 1    template<class Fun>
 2    class eval_fun
 3    {
 4        const poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int> & d_owner;
 5        const Fun & d_evalfun;
 6        
 7    public:
 8        inline eval_fun(const poly_evaluator_impl<2, int, poly_evaluator<3, int, int>, int> & owner, const Fun & evalfun)
 9            : d_owner(owner), d_evalfun(evalfun)
10        {
11        }
12        
13        inline int operator() (const poly<2, int> & p) const
14        {
15            int res = 0;
16            int xx = 1;
17            for (int i = 0; i < (int)p.d_value.size(); ++i)
18            {
19                res += d_evalfun(p.d_value[i]) * xx;
20                xx = xx * d_owner.d_evalpoint;
21            }
22            return res;
23        }
24    };

note that d_owner.d_evalpoint is 20 in this case. finally, we are at the top level, namely in the function poly_evaluator<3, int, int>::evaluate<int, Fun>(), defined as follows:

 1    template<class Fun>
 2    inline void evaluate(int & res, const Fun & evalfun) const
 3    {
 4        int xx = 1;
 5        for (int i = 0; i < (int)d_poly.d_value.size(); ++i)
 6        {
 7            res += evalfun(d_poly.d_value[i]) * xx;
 8            xx = xx * d_evalpoint;
 9        }
10    }

(here, d_evalpoint is 10.) this is the outer loop. the middle loop is inserted when calling evalfun(d_poly.d_value[i]), and into that the innermost loop is inserted when it calles d_evalfun(p.d_value[i]). this all essentially boils down to the following combination:

 1    {
 2        int res = 0;
 3        int xx = 1;
 4        for (int i = 0; i < (int)d_poly.d_value.size(); ++i)
 5        {
 6            int tmp;
 7            const poly<2, int> & p = d_poly.d_value[i];
 8            {
 9                int res2 = 0;
10                int xx2 = 1;
11                for (int i2 = 0; i2 < (int)p.d_value.size(); ++i2)
12                {
13                    int tmp2;
14                    const poly<2, int> & p2 = p.d_value[i2];
15                    {
16                        int res3 = 0;
17                        int xx3 = 1;
18                        for (int i3 = 0; i3 < (int)p2.d_value.size(); ++i3)
19                        {
20                            res3 += p2.d_value[i3] * xx3;
21                            xx3 = xx3 * d_owner.d_evalpoint;
22                        }
23                        tmp2 = res3;
24                    }
25                    res2 += tmp2 * xx2;
26                    xx2 = xx2 * d_owner.d_evalpoint;
27                }
28                tmp = res2;
29            }
30            res += tmp * xx;
31            xx = xx * d_evalpoint;
32        }
33        return res;
34    }

of course, this is just an intermediate step (to highlight the similiarities to all the code snippets from above); the compiler will optimize this to something more compact.

this “example” already presented most of the concept. the implementation itself is more complicated, since different types can be used for evaluation (at different stages, even), and since the implementation has support for allocators.

an important point is that both templates poly_evaluator<n, T, S> and poly_evaluator<n, T, HL, S> are specialized for n == 1, as in that case, there is no lower level, and hence these specializations provide no evaluate template and try to pass evaluation further down, but right away pass it higher up to their owner, or in case of poly_evaluator<1, T, S>, it will just do the evaluation.

this concludes the fourth part. the last part will concentrate something less complicated: implementing long division and the euclidean algorithm.