skip to main content.

posts about computer. (page 6.)

from today on, i’m enforcing https for (almost) all my web pages. i’ve added an automatic redirect which redirects all http:// pages to their corresponding https:// pages.

despite the tons of problems ssl/tls have – essentially, everything less than TLS 1.2 is unsafe, but only very few browsers actually support TLS 1.2 even though it has already been standarized in 2008 –, it is better than using no encryption at all.

and yes, i know that “just” having a self-signed certificate is only partially helpful. but i don’t have a better solution at the moment, as i don’t want to dump tons of money into CAs which i don’t really trust anyway. (maybe i’ll change my mind eventually. but not right now.) so for the moment, you have to accept my self-signed certificate (whose sha-1 fingerprint is 69:02:33:1D:F7:E3:9C:DA:D2:7D:9E:1D:4A:C6:40:99:A3:F8:B2:58, and whose md5 fingerprint is E5:DA:7D:4E:11:34:20:BD:7C:9E:3B:CD:E1:C9:6A:1B. you can compare them in firefox, for example, by clicking the padlock and then clicking “more information…” and then “view certificate”, and in chromium/chrome by clicking the padlock and then “certificate information”).

posted in: computer

after recently installing arch linux on my laptop (a thinkpad x230), i was first quite happy. but after some time, i noticed some flaws. first of all, having to do so many things by hand is somewhat annoying. if it would be just about installing software: no big deal (for me). but it is also about configuring stuff, like deciding between networkmanager and the arch-specific command line wireless setup, which is installed by default. switching to networkmanager was quite annoying, and in the end didn’t work very well (one anecdote: at some point, i had to reboot to get plain eth0 working again – reconfiguring by hand might have worked, but you don’t always have time to do that). and also power management was not so good, after trying some things i finally had a system which, coming back from suspend, waited a few seconds (usually enough to enter my password and unlock the computer) and sent it back to suspend. after the next unsuspend, there was no password protection left…
the final kick came when i tried to install hugin: it simply didn’t work. at all. pacman always gave up without an understandable error message. great, eh? at that point i decided to try linux mint another time.
last weekend, i first tried to install linux mint debian edition (lmde) on my laptop. it has the advantage of being a rolling release distribution. well, the installer doesn’t support harddisk encryption, but it allows you to do that by yourself. after having managed that with arch linux, i tried it. basically, at two points during the installation process, the installer lets you do some stuff – set up and mount partitions in the first stop, and installing packages/modules and setting up stuff for the first boot in the second stop – and waits for you to press the “forward” button. unfortunately, during the second stop, the “forward” button was grayed out. i hoped that maybe the installer enables it when time comes, but after doing everything (hopefully) and waiting, nothing happened. great, eh? well, i searched around the net, but found nothing. the only thing i found was a blog entry announcing lmde 201303 (which i was trying to install) with the note “please use this blog to report bugs”, which is nice, but not when you notice that comments are disabled. at that point, i gave up and downloaded a linux mint 15 image instead…
installing that one went quite smoothly. of course, again, the installer didn’t support using my encrypted setup (seems to be implemented nowhere, except in the old ubuntu alternate installer which is discontinued. yay, the good old times when stuff just worked out of the box!). mounting stuff before starting the installer (i also had to install the lvm2 package), the install went well, before rebooting, though, i had to do some new tricks. after trying around unsuccessfully for some time, i finally found a question on, whose accepted answer provided the solution for me: it explains how to set up /etc/crpyttab, initramfs and grub to ask for a password on boot-up and unlock the encrypted disks (see also below in this post). with these steps, i was able to boot the newly installed linux mint 15, and from that point on, everything went well.
most stuff worked out of the box, and all packages i wanted to install actually existed (arch linux doesn’t have mmv by default, for example), and both wine and hugin did work out of the box. the only very annoying part was that linux mint screwed up my firefox profile. it created a new profile and changed the .mozilla/firefox/profiles.ini to only use the new profile. after modifying that file, i had my old profile back. after that, i was happy, and after a couple of days with wlan/vpn field test (i never even got so far to try vpn on arch linux), i’m opting to keep linux mint 15 for some while. i guess i’ll also install it on my desktop (replacing ubuntu 12.04 lts).
(actually, for desktop machines, arch linux will function much better, since there you don’t need fancy stuff like wireless setup, power saving etc. nonetheless, after the experience i had i won’t try it again for some time…)

quick conclusion: how to set up luks/lvm encryption manually on ubuntu/mint.

before i forget how this was done, or maybe askubuntu gets rid of the question and answer, i’ll document the necessary steps i had to do here (all paths are relative to the installed system’s root):

  1. create /etc/crypttab with a line like this:
    sda2_crypt UUID=... none luks
    to find out the correct uuid, try ls -la /dev/disk/by-uuid/. then you can see which uuid is mapped to which device. another (somewhat unrelated) useful tool is lsblk, which shows your current device and filesystem topology.
  2. create /etc/initramfs-tools/conf.d/cryptroot containing a similar line:
    again, use the correct uuid instead of the “…”.
  3. mount /dev into the new environment by running
    mount -o bind /dev /target/dev
    (replace target with the path to the new system’s root directory.)
    then chroot the environment, and run the following commands:
    1 mount -t proc proc /proc
    2 mount -t sysfs sys /sys
    3 mount -t devpts devpts /dev/pts
    4 locale-gen --purge --no-archive
    5 update-initramfs -k all -c

    this will set up the ram disk correctly so that it will deal with the encrypted root partition. (note that it usually will complain about an “invalid line” in /etc/crypttab. you can usually ignore this.)
  4. change GRUB_CMDLINE_LINUX in /etc/default/grub to something like
    again, think of replacing sda2_crypt if necessary and filling in the correct uuid.
  5. in the chroot environment, run update-grub.

after this, it should work. maybe you also have to install cryptsetup and/or lvm2 in the chroot environment, if it wasn’t already done by the installer.
anyway, i’m really looking forward to the moment when most distribution installers know how to (again!) deal with existing luks/lvm installations. i hope it won’t take as long as it took for basic hdd encryption find its way into the graphical installers in the first place. (that was, like, forever! and without an initiative of the eff, it might really have taken forever.)

today, i finally got around to try arch linux with xfce4 on my laptop. and considering how it looks, i will also install it on my desktop computer on the next reinstall. (currently, it still has ubuntu with xfce4 installed. and in case you wonder why i decided to try out a new system on my laptop: i’ve been using linux mint 14 the last couple of months, and was pretty unhappy both during install – setting up full disk encryption was somewhat annoying – and finally when trying to install wine recently, which simply didn’t work.)

i followed the beginner’s guide, which essentially told me what to enter on the console to set up arch linux. (note that arch linux does not come with a graphical install, you have to type a lot of commands in yourself. but apart from that, it actually works like a charm. so if you’re not scared by using the command line, it’s worth a try.)

there’s also a arch wiki entry about encrypting a lvm setup, which is what i was doing and wanted to continue doing – for example, to not again restart by copying all my data to the machine, but by simply re-using the encrypted partitions layout set up before. for the way i (and ubuntu) was doing it, that wiki entry pointed to a blog post by simon dittlmann, which explains how to set up a huge encrypted partition, which will contain a lvm (logical volume manager) group with root, home and swap partition. unfortunately, the blog post is somewhat older, and apparently the whole installation procedure of arch linux changed somewhat, so i had to improvise.

in order to create an up to date documentation on how to install arch linux with full disk encryption, both discussing how to create such a setup and how to install arch linux in an already existing such setup.

beginning installation: creating the encrypted partition.

first, follow the beginner’s guide up to the step “prepare the storage drive”. at this step, you have to do something else.

(in case you already have a working set-up, skip the next steps until the mark.)

follow the steps described in the beginner’s guide, create a small boot partition – this one will not be encrypted. i assume that it will be /dev/sda1. it should be a simple ext3/ext4 partition. (i usually give it 256 or 512 megabytes.)

then, create another partition (i assume it will be /dev/sda2), which consumes the whole left-over space on the hard disk. first, you should clear everything on that partition, preferably with random bits. you can for example do:
dd if=/dev/urandom of=/dev/sda2
this will take quite some time, though. alternatively, you can skip this step, and later, after encrypting the partition, overwrite the encrypted partition with zeros. (look down below for that.) afterwards, set up encryption on /dev/sda2:

1 modprobe dm-crypt
2 cryptsetup --verbose --cipher aes-xts-plain64 --key-size 512 --verify-passphrase luksFormat /dev/sda2

you will have to enter a passphrase (twice), which you will need later on every boot to unlock the disk. (note that you can later on change the passphrase as you like; look at the section passphrase management in an older blog-post by me.)

(edit: since there is now a successful attack on the aes-cbc-essiv encryption mentioned here earlier, i changed it to aes-xts-plain64, using a different approach.)

(mark: skip until here if you already have a working set-up.)

now you can unlock the encrypted disk:
cryptsetup luksOpen /dev/sda2 lvm

setting up the logical volumes.

(skip almost everything of this section if you already have a working set-up. the only thing you should not skip is the mounting below and enabling swap with swapon.)

after unlocking the encrypted volume, you have to create a volume group and logical volumes inside it. first, begin by creating a physical volume, which will contain the logical volumes. for that, we use the encrypted partition /dev/sda2, whose contents can be accessed by /dev/mapper/lvm. do the following:

1 lvm pvcreate /dev/mapper/lvm
2 lvm vgcreate vgroup /dev/mapper/lvm

you can replace vgroup with any name you want. i replaced it with the (future) hostname of my laptop. now you can use the following commands to create logical volumes. there should be at least one volume for root (/) and swap. i recommend to also create a volume for /home, so that your personal files are separated from the operating system and you can simply wipe out the operating system when you want to install a new one by formatting root, but not home. for such a setting, the commands are as follows:
1 lvm lvcreate -L 16GB -n root vgroup
2 lvm lvcreate -L 16GB -n swap vgroup
3 lvm lvcreate -l 100%FREE -n home vgroup

(my machine has 16 gigabyte ram, whence i created a 16 gigabyte swap partition.)
don’t forget to replace vgroup if you used a different name above. you can also choose different names after -n. the next step is to format the data partitions as in the beginner’s guide:
1 mkfs.ext4 /dev/mapper/vgroup-root
2 mkfs.ext4 /dev/mapper/vgroup-home

to set up the swap, proceed as follows:
1 mkswap /dev/mapper/vgroup-swap
2 swapon /dev/mapper/vgroup-swap

finally, let us mount the partitions to install arch linux on them:

1 mount /dev/mapper/vgroup-root /mnt
2 mkdir -p /mnt/home /mnt/boot
3 mount /dev/mapper/vgroup-home /mnt/home
4 mount /dev/sda1 /mnt/boot

(you only need the mkdir if you created a new set-up. also, in case you created more logical volumes, you have to adjust the commands above.)

continue arch linux installation.

from this point on, you can follow the beginner’s guide to install arch linux from this point on. continue until the point of creating an initial ramdisk environment. there, you must edit /etc/mkinitcpio.conf and modify the HOOKS statement from
HOOKS="base udev autodetect modconf block filesystems keyboard fsck"
(or something similar) to
HOOKS="base udev autodetect modconf block encrypt lvm2 filesystems keyboard fsck"
note that you must insert encrypt lvm2 in precisely this order somewhere before filesystems. afterwards, continue with running mkinitcpio -p linux (or continue editing the config file if necessary).

now you can continue with setting the root password.

the next step where you have to pay attention is the step where you set up the boot loader. i chose grub here. set it (or syslinux) up as described in the beginner’s guide. in the case of syslinux, you have to modify /boot/syslinux/syslinux.cfg, and in the case of grub, you have to modify /boot/grub/grub.cfg. in the case of syslinux, you should have two entries (regular system and fallback)
APPEND root=/dev/mapper/vgroup-root ro
for syslinux and
linux /vmlinuz-linux root=/dev/mapper/vgroup-root ro quiet
for grub, or something similar. for all such entries, insert cryptdevice=/dev/sda2:vgroup between root=… and ro; that is, the entries should look like
APPEND root=/dev/mapper/vgroup-root cryptdevice=/dev/sda2:vgroup ro
for syslinux and
linux /vmlinuz-linux root=/dev/mapper/vgroup-root cryptdevice=/dev/sda2:vgroup ro quiet
for grub.

change (2014/04/13): in case you want to use grub, it is better to proceed as follows. edit the line GRUB_CMDLINE_LINUX in /etc/default/grub and add cryptdevice=/dev/sda2:vgroup there. then, run grub-mkconfig -o /boot/grub/grub.cfg as described in the beginner’s guide. this automatically adds this to all entries in grub.cfg. end of change.

afterwards, continue with the beginner’s guide. after the next reboot, you should be asked for a password to unlock the volumes. after entering it correctly, the system should boot up as normal.

nowadays, there are quite some fair trade products customers can choose when buying stuff. there’s fair chocolate, fair bananas, fair t-shirts, etc. one common denominator of these products is that they consist of not too many things, that they are not too complex. essentially for all kind of products which are too complex – think of electronics – virtually no fair products exist. and in fact, producing a 100% fair electronic device is essentially impossible without a huge amount of ressources available. there are just too many different tasks to ensure.
but fortunately, there are some projects which at least try. most notably, there are two projects i want to write about today. first, there’s the german faire maus, a (somewhat) fair mouse. the precise list of pieces need to assemble one can be found here, together with information what problems can arise in their production, which problems are (essentially) solved for the fair mouse, and which are still unsolved. so, while not 100% fair, at least the process is very transparent and it is possible to identity points where the process is still not exactly fair.
another project is the fairphone, a project from the netherlands trying to produce a fairer smartphone. compared to a simple mouse, a smartphone is way more complex, and depends on a much larger range of different parts. well, as a consequence, it is also much harder to make it fair. the fairphone project still tries hard. besides fair, they also try to be very transparent about where everything is from and under which conditions it was obtained/created. for example, there’s conflict-free tin from a congolese mine involved.
the fairphone project is currently trying to get enough advance orders to produce the first batch of fairphones. they need 5000 orders, and so far, they just got around 1640. the number is increasing now and then, but i’m wondering if it will reach the required 5000 early enough. in september, the fairphone team wants to inform about a possible delivery date, which will hopefully be in october. so if you’re planning to get a (new) smartphone somewhen in the near future, you should think about supporting that project. the price of 325 euros is quite in range, and you’re supporting a good cause. (and if it doesn’t work out, you’ll get your money back somewhen in fall.)
actually, i just ordered one fairphone last week. (well, and also two fair mouses.) not that i suddenly like the idea of having a smartphone (i still don’t), but then, i can still install linux on it – after all, i will be allowed to do that, as opposed to most other smartphones which you don’t really own when buying them. (isn’t that another reason?)

today i discovered why sometimes, some of my latex output contains tildes (~) in the dvi/pdf version. usually, if you use a tilde in a tex file, it is interpreted as a non-breakable space (except in special circumstances, such as verbatim environments or in \url{…}). but thanks to a “bugfix” to texi2dvi/texi2pdf, which is a wonderful tool as it runs (pdf)latex often enough together with bibtex, makeindex etc., tildes appearing in tex files are now shown as tildes in the dvi/pdf output. which is absolutely inacceptable behaviour.
it seems that this already was reported (see here, here, here), but it is still around. i don’t really know what to think of this – is nobody responsible for working on texi2dvi/texi2pdf? or did people stop using it as it is broken?
anyway, i fixed my local installed version (/usr/bin/texi2dvi) by chaning the line catcode_special=true to catcode_special=false. a more sophisticated version would be nice, which only changes catcode_special for tex files (and not for texinfo files), but i don’t have time for that now.

last week i was in leiden, attending a workshop on post-quantum crytpography and quantum algorithms at the lorentz center. it was a collection of talks and working in smaller groups, where we discussed certain topics, such as quantum attacks on ideal lattices in more detail, trying to find a way to use quantum computers to speed up attacks against primitives of post-quantum cryptography. as this is somewhat close to my research – i work on analyzing a quantum algorithm with pawel wocjan and am working on lattices and lattice reductions – i was very happy to attend this meeting. especially since there were not just some mathematicians, but also a lot of experts on various aspect of quantum computers and quantum algorithms. now i also know a lot more about quantum computing, both from the theoretical side – like having been explained grover’s algorithm, which is another of the fundamental quantum algorithms next to the period-finding algorithm family starting with shor’s algorithm – and the practical side – what the current technology with regard to building quantum computers is, and how people writing compilers for quantum computers can suffer and how complicated it can be to turn a “simple” algorithm into a circuit. i think this was one of the most productive workshops i’ve ever attended.
unfortunately, i neither took my camera with me (the little one i had with me last week, since it is somewhat broken (the sd card slot won’t keep the card anymore, similar to what is described here), nor did i really had time to take any pictures, as i spend most awake hours doing mathematics. i took a few shots with my mobile phone on the excursion/conference dinner on wednesday, which happend to be on a boat going through grachten around leiden and the city and also happened to be a very decicious and very spicy asian food buffet. and later, there was a great dessert buffet. one of the best and original conference dinners i had for quite some time :)

over the weekend, i almost became a victim of a man-in-the-middle attack. while staying in castro urdiales, i changed into a different hotel on friday. the hotel had free (password protected) wireless internet, as many hotels do have. what was different to other hotels was what happened when i wanted to check my emails via imaps, i.e. imap over ssl, which means that my communication with my email servers are encrypted. my email program informed me that the cerficiates changed and asked me whether i want to store the new certificate. this happened for both imap servers i’m using, namely my own and the one of my institute. i was somewhat surprised – sometimes the institute’s server gets a new certificate, but my own server? without me, the admin, doing anything? i checked the certificates, to find out that essentially nothing changed, except the (rsa) public key, and that the signer changed to “FortiGate CA”. fortigate is the flagship product of an american company called fortinet; this is apparently (related to) secure wireless (wlan) equipment.
part of their software/hardware seems to be something which tries to scan email. scanning unencrypted email communication (via smtp, pop3, imap, …) is easy – as there is no encryption. but they also try to check encrypted communications with email servers. for that, they have to break or circumvent encryption. the easiest way to do that is a man-in-the-middle attack: send traffic through a proxy, and in case someone wants to connect to a imaps server (identified by access to port number 993, i assume), act as if you are the imaps server, send a faked certificate including your own public key, so that the user is transmitting its data to you. then connect to the “real” imaps server and forward the data to it. works pretty well. except that suddenly, you, the client, gets presented with a different public key. if you (more precisely: your email software) stored the certificate (which includes the public key) and compares the certificate it obtains from the proxy (thinking it is the email server) with the stored one, it will note that something changed. and it will hopefully complain to you, the user, and ask you what to do. ask you whether the new certificate is acceptable or not.
if you accept the new certificate, or the software you’re using does it for you (guess that’s “user friendly”, so you don’t have to care about stuff like certificates), the proxy server will read all your communication with the imaps server. some combinations of servers and clients will even ensure that your password is send plaintext (assuming the ssl encryption between client and server), which means that in this case, the proxy server knows your email password. in case the proxy server is used in a malicious way, someone can steal or abuse your data. (especially as passwords are often used for different accounts). more importantly, once the connection is open and you authenticated to the email server, the proxy server can use the open connection to do anything it likes with your email account – it can access all emails stored on your account on that server.
i assume (hope?) that the fortigate software/hardware is not doing this. but anyway, such behavior is very, very bad. (and it opens the question what else is analyzed by the proxy. maybe all your http connections? https seems to be unaffected, as the certificates there were still fine, at least in my case.) luckily, i was able to circumvent this by using the vpn of my university. but not everyone has access to a vpn, or knows how to use that. (and another thing which made me worry is that i don’t have a certificate for my vpn server. so in theory the proxy could also try a man-in-the-middle attack here, and circumvent my use of the vpn. but apparently they don’t, or at least not so easily: when i used the vpn, the email servers returned their “correct” certificates.
well. so what’s the morale of this story? certificates are important! be vigilant when using unknown networks, such as hotel wlans, and use vpn in case you don’t trust it. and use software which correctly checks certificates! and in case you get a warning that a certificate changed, be alert! don’t just click on “accept new certificate” to make your life easier!

monday and tuesday i was in basel, attending the speedup workshop, including a tutorial on intel threading building blocks and cilk plus. something i have to try out, by writing some lattice enumeration code with it, to see how it copes with it. especially the tutorial was definitely worth the time spend there.
but besides business, i walked around a bit in basel during monday’s lunch break, this time with my camera. this resulted in some nice photos, and some of these are here:

when adding a thread-specific allocator to a program of mine, to avoid terrible performance loss while using gmp and/or mpfr to do arbitrary precision integer respectively floating point arithmetic, i stumbled about a problem which seems to be fixed with newer solaris versions. in case anyone experiences a similar problem and cannot just update to a new enough solaris version, here’s some information on a dirty’n'quick fix for the problem.
more precisely, i wanted to combine boost::thread_specific_pointer (a portable implementation of thread specific storage, with dlmalloc, to obtain an allocator which won’t block when used from different threads at once. if you use arbitrary precision arithmetic on a machine with many cores/cpus (say, 30 to 60), having a single blocking (via a mutex) allocator totally kills performance. for example, on our ultrasparc/solaris machine, running 29 threads (on 30 cpus) in parallel, only 20% of the system’s ressources were used effectively. if the machine would have only had 6 cpus, the program would have run at the same speed. quite a waste, isn’t it?
anyway, combining thread local storage and a memory allocator solves this problem. in theory, at least. when i put the two things together, and ran my program with 30 threads, stlil only 60% of the 30 cpus processing power was used – the other 40% of the cycles were still spend waiting. (solaris has some excellent profiling tools on board. that’s why i like to use our slow old outdated solaris machine to profile, instead of our blazing fast newer big linux machine. in case anyone cares.) interestingly, on our linux machine, with 64 threads (running on 64 cores), the problem wasn’t there: 100% of the cycles went into computing, and essentially none into waiting.
inspecting the problem closer with the sun studio analyzer, it turns out that the 40% waiting cycles are caused by pthread_once, which is called by the internal boost method boost::detail::find_tss_data. that method is called every time a boost::thread_specific_pointer<> is dereferenced. which in my program happens every time when the thread local allocator is fired up to allocate, reallocate or free a piece of memory. (more precisely, boost::detail::find_tss_data calls boost::detail::get_current_thread_data, which uses boost::call_once, which in turn uses pthread_once in the pthread implementation of boost::thread, which is the implementation used on unixoid systems, such as solaris and linux.)
in theory, pthread_once uses a double-checked locking mechanism to make sure that the function specified is ran exactly once during the execution of the wohle program. while searching online, i found the source of the pthread implementation of a newer opensolaris from 2008 here; it uses a double-checked locking with a memory barrier, which should (at least in theory) turn it into a working solution (multi-threaded programming is far from being simple, both the compiler and the cpu can screw up your code by rearranging instructions in a deadly way).
anyway, it seems that the pthread_once implementation on the soliaris installation on the machine i’m using just locks a mutex every time it is called. when you massively call the function from 30 threads at once, all running perfectly parallel on a machine with enough cpus, this gives a natural bottle-neck. to make sure it is pthread_once which causes the problem, i wrote the following test program:

 1 #include <pthread.h>
 2 #include <iostream>
 4 static pthread_once_t onceControl = PTHREAD_ONCE_INIT;
 5 static int nocalls = 0;
 7 extern "C" void onceRoutine(void)
 8 {
 9     std::cout << "onceRoutine()\n";
10     nocalls++;
11 }
13 extern "C" void * thethread(void * x)
14 {
15     for (unsigned i = 0; i < 10000000; ++i)
16         pthread_once(&onceControl, onceRoutine);
17     return NULL;
18 }
20 int main()
21 {
22     const int nothreads = 30;
23     pthread_t threads[nothreads];
25     for (int i=0; i < nothreads; ++i)
26         pthread_create(&threads[i], NULL, thethread, NULL);
28     for (int i=0; i < nothreads; ++i)
29     {
30         void * status;
31         pthread_join(threads[i], &status);
32     }
34     if (nocalls != 1)
35         std::cout << "pthread_once() screwed up totally!\n";
36     else
37         std::cout << "pthread_once() seems to be doing what it promises\n";
38     return 0;
39 }

i compiled the program with CC -m64 -fast -xarch=native64 -xchip=native -xcache=native -mt -lpthread oncetest.cpp -o oncetest and ran it with time. the result:
1 real    16m9.541s
2 user    201m1.476s
3 sys     0m18.499s

compiling the same program under linux and running it there (with enough cores in the machine) yielded
1 real    0m0.243s
2 user    0m1.640s
3 sys     0m0.060s

quite a difference, isn’t it? the solaris machine is slower, so a few seconds total time would be ok, but 16 minutes?! inspecting the running program on solaris with prstat -Lmp <pid> shows the amount of waiting involved…
to solve this problem, at least for me, with this old solaris verison running, i took the code of pthread_once from the above link – namely the includes
1 #include <atomic.h>
2 #include <thread.h>
3 #include <errno.h>

copied the lines 38 to 46 from the link, and the lines 157 to 179 from the link into boost_directory/libs/thread/src/pthread/once.cpp, renamed pthread_once to my_pthread_once in the code i copied and in the boost source file i added the lines to, and re-compiled boost. then, i re-ran my program, and suddenly, there was no more waiting (at least, not for mutexes :-) ). and the oncetest from above, rewritten using boost::once_call, yielded:
1 real    0m0.928s
2 user    0m20.181s
3 sys     0m0.036s



as mentioned in part one, testing expression templates isn’t that easy. one often has to go down to assembler level to see what is really going on. to make developing expression templates easier, and to debug my (fictional) myvec<T> expression templates a bit more in detail, i created a second test type: TestType2. again equiped with expression templates, its aim is to break down the expressions into three-address assembler commands, using temporaries to achieve this. for example, a -= (b + c) * d; should evaluate into something like

1 TestType2 t1, t2; // without calling constructors or destructors!
2 TT2_create(t1);
3 TT2_add(t1, b, c);
4 TT2_create(t2);
5 TT2_mul(t2, t1, d);
6 TT2_sub(a, a, t2);
7 TT2_destroy(t2);
8 TT2_destroy(t1);

by defining the functions add(), mul(), etc. in a different translation unit, one can analyse the assembler output of this translation unit to see what exactly came out. by searching for terms like TT2_add one can quickly find the corresponding assembler commands.

the implementation.

we begin by declaring the class TestType2 and by declaring the functions to operate on it. again, we leave out multiplication (and divison and modulo) to shorten code.

 1 class TestType2;
 3 void TT2_create(TestType2 & r);
 4 void TT2_destroy(TestType2 & r);
 5 void TT2_createcopy(TestType2 & r, const TestType2 & a);
 6 void TT2_copy(TestType2 & r, const TestType2 & a);
 7 void setZero(TestType2 & r);
 8 void setOne(TestType2 & r);
 9 void TT2_add(TestType2 & r, const TestType2 & a, const TestType2 & b);
10 void TT2_sub(TestType2 & r, const TestType2 & a, const TestType2 & b);
11 void TT2_neg(TestType2 & r, const TestType2 & a);

the implementation of TestType2 is straightforward. we add a pointer so that the class actually stores something.
 1 class TestType2
 2 {
 3 private:
 4     void * d_data;
 6 public:
 7     inline TestType2()
 8     {
 9         TT2_create(*this);
10     }
12     inline TestType2(const TestType2 & src)
13     {
14         TT2_createcopy(*this, src);
15     }
17     inline ~TestType2()
18     {
19         TT2_destroy(*this);
20     }
22     inline TestType2 & operator = (const TestType2 & src)
23     {
24         TT2_copy(*this, src);
25         return *this;
26     }
28     inline TestType2 & operator += (const TestType2 & b)
29     {
30         TT2_add(*this, *this, b);
31         return *this;
32     }
34     inline TestType2 & operator -= (const TestType2 & b)
35     {
36         TT2_sub(*this, *this, b);
37         return *this;
38     }
39 };

again, as in part one, we have templates TestExpression2<O, D> and TestWrapper2:
 1 template<class Op, class Data>
 2 class TestExpression2
 3 {
 4 private:
 5     Op d_op;
 6     Data d_data;
 8 public:
 9     inline TestExpression2(const Op & op, const Data & data)
10         : d_op(op), d_data(data)
11     {
12     }
14     operator TestType2 () const
15     {
16         TestType2 res;
17         evalTo(res);
18         return res;
19     }
21     inline TestType2 evaluate() const
22     {
23         TestType2 res;
24         evalTo(res);
25         return res;
26     }
28     inline void evalTo(TestType2 & dest) const
29     {
30         d_op.evalTo(dest, d_data);
31     }
32 };
34 class TestWrapper2
35 {
36 private:
37     const TestType2 & d_val;
39 public:
40     inline TestWrapper2(const TestType2 & val)
41         : d_val(val)
42     {
43     }
45     inline const TestType2 & evaluate() const
46     {
47         return d_val;
48     }
50     inline void evalTo(TestType2 & dest)
51     {
52         TT2_copy(dest, d_val);
53     }
54 };

this time, we have both an evaluate() and a evalTo() member function. the first allows to just evaluate the expression, generating temporaries for subexpressions, and the second one is used by callers like operator=() of TestType2 to evaluate the result of an expression into an object of type TestType2. the expression machinery is added to TestType2 by the following functions:
 1     template<class O, class D>
 2     inline TestType2(const TestExpression2<O, D> & src)
 3     {
 4         TT2_createcopy(*this, src.evaluate());
 5     }
 7     template<class O, class D>
 8     inline TestType2 & operator = (const TestExpression2<O, D> & e)
 9     {
10         e.evalTo(*this);
11         return *this;
12     }
14     template<class O, class D>
15     inline TestType2 & operator += (const TestExpression2<O, D> & e)
16     {
17         TT2_add(*this, *this, e.evaluate());
18         return *this;
19     }
21     template<class O, class D>
22     inline TestType2 & operator -= (const TestExpression2<O, D> & e)
23     {
24         TT2_sub(*this, *this, e.evaluate());
25         return *this;
26     }

the operators are defined as follows. there is not much to do: they just provide a evalTo() template function which calls the corresponding three-address operation:
 1 class AddOp2
 2 {
 3 public:
 4     template<class A, class B>
 5     inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const
 6     {
 7         TT2_add(dest, data.first.evaluate(), data.second.evaluate());
 8     }
 9 };
11 class SubOp2
12 {
13 public:
14     template<class A, class B>
15     inline void evalTo(TestType2 & dest, const std::pair<A, B> & data) const
16     {
17         TT2_sub(dest, data.first.evaluate(), data.second.evaluate());
18     }
19 };
21 class NegOp2
22 {
23 public:
24     template<class A>
25     inline void evalTo(TestType2 & dest, const A & data) const
26     {
27         TT2_neg(dest, data.evaluate());
28     }
29 };

again, what is left is the tedious integration, by defining a ton of operators:
 1 inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> > operator + (const TestType2 & a, const TestType2 & b)
 2 { return TestExpression2<AddOp2, std::pair<TestWrapper2, TestWrapper2> >(AddOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); }
 3 inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> > operator - (const TestType2 & a, const TestType2 & b)
 4 { return TestExpression2<SubOp2, std::pair<TestWrapper2, TestWrapper2> >(SubOp2(), std::make_pair(TestWrapper2(a), TestWrapper2(b))); }
 6 template<class O2, class D2>
 7 inline TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator + (const TestType2 & a, const TestExpression2<O2, D2> & b)
 8 { return TestExpression2<AddOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(TestWrapper2(a), b)); }
 9 template<class O2, class D2>
10 inline TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > > operator - (const TestType2 & a, const TestExpression2<O2, D2> & b)
11 { return TestExpression2<SubOp2, std::pair<TestWrapper2, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(TestWrapper2(a), b)); }
13 template<class O1, class D1>
14 inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator + (const TestExpression2<O1, D1> & a, const TestType2 & b)
15 { return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(AddOp2(), std::make_pair(a, TestWrapper2(b))); }
16 template<class O1, class D1>
17 inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> > operator - (const TestExpression2<O1, D1> & a, const TestType2 & b)
18 { return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestWrapper2> >(SubOp2(), std::make_pair(a, TestWrapper2(b))); }
20 template<class O1, class D1, class O2, class D2>
21 inline TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator + (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b)
22 { return TestExpression2<AddOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(AddOp2(), std::make_pair(a, b)); }
23 template<class O1, class D1, class O2, class D2>
24 inline TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > > operator - (const TestExpression2<O1, D1> & a, const TestExpression2<O2, D2> & b)
25 { return TestExpression2<SubOp2, std::pair<TestExpression2<O1, D1>, TestExpression2<O2, D2> > >(SubOp2(), std::make_pair(a, b)); }
27 inline TestExpression2<NegOp2, TestWrapper2> operator - (const TestType2 & a)
28 { return TestExpression2<NegOp2, TestWrapper2>(NegOp2(), TestWrapper2(a)); }
30 template<class O1, class D1>
31 inline TestExpression2<NegOp2, TestExpression2<O1, D1> > operator - (const TestExpression2<O1, D1> & a)
32 { return TestExpression2<NegOp2, TestExpression2<O1, D1> >(NegOp2(), a); }

note that all the above code contains a lot of inlines, as opposed to part one. in part one, the aim was not to generate most efficient code, but to make visible the expressions evaluated by the expression templates. in this second part, we want to observe the assembler output, and thus want the code to be as optimized as possible.

a real-life example.

in this section, i want to look at three examples. first, let us consider the following example:

1 TestType2 s, t, u, x, y, z;
2 s /= t + (x - y) * z - u;

the resulting assembler code (generated with g++ -S -O3) looks as follows:
 1         leaq    1024(%rsp), %rcx
 2         ....
 3         movq    %rbx, 56(%rsp)
 4         call    _Z10TT2_createR9TestType2
 5         leaq    960(%rsp), %r12
 6         movq    %r12, %rdi
 7         call    _Z10TT2_createR9TestType2
 8         leaq    928(%rsp), %rbp
 9         movq    %rbp, %rdi
10         call    _Z10TT2_createR9TestType2
11         leaq    944(%rsp), %rbx
12         movq    %rbx, %rdi
13         call    _Z10TT2_createR9TestType2
14         leaq    1024(%rsp), %rdx
15         leaq    1040(%rsp), %rsi
16         movq    %rbx, %rdi
17         call    _Z7TT2_subR9TestType2RKS_S2_
18         leaq    1008(%rsp), %rdx
19         movq    %rbx, %rsi
20         movq    %rbp, %rdi
21         call    _Z7TT2_mulR9TestType2RKS_S2_
22         movq    %rbx, %rdi
23         call    _Z11TT2_destroyR9TestType2
24         leaq    1072(%rsp), %rsi
25         movq    %rbp, %rdx
26         movq    %r12, %rdi
27         call    _Z7TT2_addR9TestType2RKS_S2_
28         movq    %rbp, %rdi
29         call    _Z11TT2_destroyR9TestType2
30         leaq    1056(%rsp), %rdx
31         movq    %r12, %rsi
32         movq    %r13, %rdi
33         call    _Z7TT2_subR9TestType2RKS_S2_
34         movq    %r12, %rdi
35         call    _Z11TT2_destroyR9TestType2
36         leaq    1088(%rsp), %rsi
37         movq    %r13, %rdx
38         movq    %rsi, %rdi
39         call    _Z7TT2_divR9TestType2RKS_S2_
40         movq    %r13, %rdi
41         call    _Z11TT2_destroyR9TestType2

(i removed a lot of register/memory arithmetic in the beginning, which prepares all addresses.) removing all cludder, we are left with the following function calls:
 1 TT2_create()
 2 TT2_create()
 3 TT2_create()
 4 TT2_create()
 5 TT2_sub()
 6 TT2_mul()
 7 TT2_destroy()
 8 TT2_add()
 9 TT2_destroy()
10 TT2_sub()
11 TT2_destroy()
12 TT2_div()
13 TT2_destroy()

four temporaries t1, t2, t3, t4 are created. then, the subtraction x - y is computed into the temporary t4, and the result is multiplied by z into the temporary t3. the temporary t4 used to compute x - y is then destroyed, and t is added to t3, with the result being stored in the temporary t2. the temporary t3 is destroyed, and t2 - u is evaluated into t1. finally, s is divided by t1 and t1 is destroyed. this shows that the compiler generated an equivalent to
 1 TestType2 t1, t2, t3, t4; // without calling constructors or destructors!
 2 TT2_create(t1);
 3 TT2_create(t2);
 4 TT2_create(t3);
 5 TT2_create(t4);
 6 TT2_sub(t4, x, y);
 7 TT2_mul(t3, t4, z);
 8 TT2_destroy(t4);
 9 TT2_add(t2, t, t3);
10 TT2_destroy(t3);
11 TT2_sub(t1, t2, u);
12 TT2_destroy(t2);
13 TT2_div(s, s, t1);
14 TT2_destroy(t1);

this is pretty much optimal: if one would have done this translation by hand in a straightforward way, one would have reached the same solution.
now let us re-consider our example from part one: the myvec<T> template. assume we have two myvec<TestType2> vectors v and w, and we write v += v + w; and v += w + v;. the first command should generate something like
1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3     add(v[i], v[i], v[i]);
4     add(v[i], v[i], w[i]);
5 }

while the second command should generate something like
1 TestType2 t; // without calling constructors or destructors!
2 create(t);
3 for (unsigned i = 0; i < v.size(); ++i)
4 {
5     add(t, w[i], v[i]);
6     add(v[i], v[i], t);
7 }
8 destroy(t);

or, if the expression templates for myvec<T> are not that good, at least something like
1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3     TestType2 t; // without calling constructors or destructors!
4     create(t);
5     add(t, w[i], v[i]);
6     add(v[i], v[i], t);
7     destroy(t);
8 }

first, consider v += v + w;. the assembler code generated is the following:
 1         movl    320(%rsp), %r9d
 2         testl   %r9d, %r9d
 3         je      .L128
 4         xorl    %ebp, %ebp
 5 .L129:
 6         mov     %ebp, %r12d
 7         salq    $3, %r12
 8         movq    %r12, %rbx
 9         addq    328(%rsp), %rbx
10         movq    %rbx, %rdx
11         movq    %rbx, %rsi
12         movq    %rbx, %rdi
13         call    _Z7TT2_addR9TestType2RKS_S2_
14         movq    %r12, %rdx
15         addq    312(%rsp), %rdx
16         movq    %rbx, %rsi
17         movq    %rbx, %rdi
18         call    _Z7TT2_addR9TestType2RKS_S2_
19         addl    $1, %ebp
20         cmpl    320(%rsp), %ebp
21         jb      .L129
22 .L128:

clearly, first the code tests whether the loop has to be run through at least once; if not, one jumps to label .L128. then, per loop iteration, exactly two calls to TT2_add() are made. this shows that the code is essentially like
1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3     add(v[i], v[i], v[i]);
4     add(v[i], v[i], w[i]);
5 }

as we were hoping. thus, the expression templates worked well in this case. now let us look at v += w + v;. this time, a temporary has to be created, since otherwise v[i] is modified before being used in the expression. the generated assembler code is the following:
 1         movl    320(%rsp), %eax
 2         cmpl    304(%rsp), %eax
 3         jne     .L283
 4         leaq    416(%rsp), %rbx
 5         movq    %rbx, %rdi
 6         call    _Z10TT2_createR9TestType2
 7         movl    320(%rsp), %r8d
 8         testl   %r8d, %r8d
 9         je      .L146
10         xorl    %r12d, %r12d
11         .p2align 4,,10
12         .p2align 3
13 .L147:
14         mov     %r12d, %ebp
15         movq    %rbx, %rdi
16         salq    $3, %rbp
17         movq    %rbp, %rsi
18         addq    312(%rsp), %rsi
19         call    _Z8TT2_copyR9TestType2RKS_
20         movq    %rbp, %rdx
21         addq    328(%rsp), %rdx
22         movq    %rbx, %rsi
23         movq    %rbx, %rdi
24         call    _Z7TT2_addR9TestType2RKS_S2_
25         movq    %rbp, %rdi
26         addq    328(%rsp), %rdi
27         movq    %rbx, %rdx
28         movq    %rdi, %rsi
29         call    _Z7TT2_addR9TestType2RKS_S2_
30         addl    $1, %r12d
31         cmpl    320(%rsp), %r12d
32         jb      .L147
33 .L146:
34         movq    %rbx, %rdi
35         call    _Z11TT2_destroyR9TestType2

this is as good as we were hoping: before the loop, a temporary is created, and destroyed after the loop. in the loop, we have three calls: TT2_copy(), TT2_add() and a second time TT2_add(). thus, the resulting code is
1 TestType2 t; // without calling constructors or destructors!
2 create(t);
3 for (unsigned i = 0; i < v.size(); ++i)
4 {
5     copy(t, w[i]);
6     add(t, t, v[i]);
7     add(v[i], v[i], t);
8 }
9 destroy(t);

this is not as optimal as it could be: the best solution would be to optimize
1     copy(t, w[i]);
2     add(t, t, v[i]);

to one call: add(t, w[i], v[i]);. but this is still much better than
1 for (unsigned i = 0; i < v.size(); ++i)
2 {
3     TestType2 t; // without calling constructors or destructors!
4     create(t);
5     add(t, w[i], v[i]);
6     add(v[i], v[i], t);
7     destroy(t);
8 }

where the temporary is created and destroyed every iteration. when trying to add this optimization, it is easier to use TestType from part one to see what is happening. once the output looks fine, one switches back to TestType2 to make sure the generated assembler code is fine.

the code.

you can download the source code of the TestType2 class here, and dummy implementations (to be put into another translation unit) of the TT_* functions here.