skip to main content.

when adding a thread-specific allocator to a program of mine, to avoid terrible performance loss while using gmp and/or mpfr to do arbitrary precision integer respectively floating point arithmetic, i stumbled about a problem which seems to be fixed with newer solaris versions. in case anyone experiences a similar problem and cannot just update to a new enough solaris version, here’s some information on a dirty’n'quick fix for the problem.
more precisely, i wanted to combine boost::thread_specific_pointer (a portable implementation of thread specific storage, with dlmalloc, to obtain an allocator which won’t block when used from different threads at once. if you use arbitrary precision arithmetic on a machine with many cores/cpus (say, 30 to 60), having a single blocking (via a mutex) allocator totally kills performance. for example, on our ultrasparc/solaris machine, running 29 threads (on 30 cpus) in parallel, only 20% of the system’s ressources were used effectively. if the machine would have only had 6 cpus, the program would have run at the same speed. quite a waste, isn’t it?
anyway, combining thread local storage and a memory allocator solves this problem. in theory, at least. when i put the two things together, and ran my program with 30 threads, stlil only 60% of the 30 cpus processing power was used – the other 40% of the cycles were still spend waiting. (solaris has some excellent profiling tools on board. that’s why i like to use our slow old outdated solaris machine to profile, instead of our blazing fast newer big linux machine. in case anyone cares.) interestingly, on our linux machine, with 64 threads (running on 64 cores), the problem wasn’t there: 100% of the cycles went into computing, and essentially none into waiting.
inspecting the problem closer with the sun studio analyzer, it turns out that the 40% waiting cycles are caused by pthread_once, which is called by the internal boost method boost::detail::find_tss_data. that method is called every time a boost::thread_specific_pointer<> is dereferenced. which in my program happens every time when the thread local allocator is fired up to allocate, reallocate or free a piece of memory. (more precisely, boost::detail::find_tss_data calls boost::detail::get_current_thread_data, which uses boost::call_once, which in turn uses pthread_once in the pthread implementation of boost::thread, which is the implementation used on unixoid systems, such as solaris and linux.)
in theory, pthread_once uses a double-checked locking mechanism to make sure that the function specified is ran exactly once during the execution of the wohle program. while searching online, i found the source of the pthread implementation of a newer opensolaris from 2008 here; it uses a double-checked locking with a memory barrier, which should (at least in theory) turn it into a working solution (multi-threaded programming is far from being simple, both the compiler and the cpu can screw up your code by rearranging instructions in a deadly way).
anyway, it seems that the pthread_once implementation on the soliaris installation on the machine i’m using just locks a mutex every time it is called. when you massively call the function from 30 threads at once, all running perfectly parallel on a machine with enough cpus, this gives a natural bottle-neck. to make sure it is pthread_once which causes the problem, i wrote the following test program:

 1 #include <pthread.h>
 2 #include <iostream>
 3 
 4 static pthread_once_t onceControl = PTHREAD_ONCE_INIT;
 5 static int nocalls = 0;
 6 
 7 extern "C" void onceRoutine(void)
 8 {
 9     std::cout << "onceRoutine()\n";
10     nocalls++;
11 }
12 
13 extern "C" void * thethread(void * x)
14 {
15     for (unsigned i = 0; i < 10000000; ++i)
16         pthread_once(&onceControl, onceRoutine);
17     return NULL;
18 }
19 
20 int main()
21 {
22     const int nothreads = 30;
23     pthread_t threads[nothreads];
24 
25     for (int i=0; i < nothreads; ++i)
26         pthread_create(&threads[i], NULL, thethread, NULL);
27 
28     for (int i=0; i < nothreads; ++i)
29     {
30         void * status;
31         pthread_join(threads[i], &status);
32     }
33 
34     if (nocalls != 1)
35         std::cout << "pthread_once() screwed up totally!\n";
36     else
37         std::cout << "pthread_once() seems to be doing what it promises\n";
38     return 0;
39 }

i compiled the program with CC -m64 -fast -xarch=native64 -xchip=native -xcache=native -mt -lpthread oncetest.cpp -o oncetest and ran it with time. the result:
1 real    16m9.541s
2 user    201m1.476s
3 sys     0m18.499s

compiling the same program under linux and running it there (with enough cores in the machine) yielded
1 real    0m0.243s
2 user    0m1.640s
3 sys     0m0.060s

quite a difference, isn’t it? the solaris machine is slower, so a few seconds total time would be ok, but 16 minutes?! inspecting the running program on solaris with prstat -Lmp <pid> shows the amount of waiting involved…
to solve this problem, at least for me, with this old solaris verison running, i took the code of pthread_once from the above link – namely the includes
1 #include <atomic.h>
2 #include <thread.h>
3 #include <errno.h>

copied the lines 38 to 46 from the link, and the lines 157 to 179 from the link into boost_directory/libs/thread/src/pthread/once.cpp, renamed pthread_once to my_pthread_once in the code i copied and in the boost source file i added the lines to, and re-compiled boost. then, i re-ran my program, and suddenly, there was no more waiting (at least, not for mutexes :-) ). and the oncetest from above, rewritten using boost::once_call, yielded:
1 real    0m0.928s
2 user    0m20.181s
3 sys     0m0.036s

perfect!

comments.

felix wrote on january 23, 2012 at 14:03:

ok, now i had the chance to test the same on another ultrasparc/solaris system with a newer version of solaris 10 (solaris 10 8/11 s10s_u10wos_17b sparc, assembled 23 august 2011) with 8 cpus. the behaviour is the same as with the other machine: the program with my replacement of pthread_once runs for one second, the unfixed original (as listed above) runs “forever”.

skwllsp wrote on october 10, 2012 at 10:49:

Thank you for you post. Using your idea I have written a small shared library what let me get rid of this performance problem on Solaris. I tested it and it works OK. This is the code of the shared library:

 1 #include <synch.h>
 2 #include <errno.h>
 3 #include <pthread.h>
 4 #include <stdio.h>
 5 
 6 #define once_flag       oflag.pad32_flag[1]
 7 
 8 typedef struct  __once {
 9   mutex_t       mlock;
10   union {
11     uint32_t    pad32_flag[2];
12     uint64_t    pad64_flag;
13   } oflag;
14 } __once_t;
15 
16 
17 /*
18  * pthread_once: calls given function only once.
19  * it synchronizes via mutex in pthread_once_t structure
20  */
21 int
22 pthread_once(pthread_once_t *once_control, void (*init_routine)(void))
23 {
24   __once_t *once = (__once_t *)once_control;
25 
26   if (once == NULL || init_routine == NULL)
27     return (EINVAL);
28 
29   if (once->once_flag == PTHREAD_ONCE_NOTDONE) {
30     (void) mutex_lock(&once->mlock);
31     if (once->once_flag == PTHREAD_ONCE_NOTDONE) {
32       pthread_cleanup_push(mutex_unlock, &once->mlock);
33       (*init_routine)();
34       pthread_cleanup_pop(0);
35       membar_producer();
36       once->once_flag = PTHREAD_ONCE_DONE;
37     }
38     (void) mutex_unlock(&once->mlock);
39   }
40   membar_consumer();
41 
42   return (0);
43 }

To build it:
gcc -O2 -g -m64 my_pthread_once.c -shared -pthread -fPIC -o libmy_pthread_once.so

After building the library I run my application either in this way:
LD_PRELOAD=./libmy_pthread_once.so ./my_applicaton

Or I can rebuild my application:
g++ -g -m64 test_pthread_once.cpp -pthread -L. -lmy_pthread_once -lpthread -o my_application
and run it without LD_PRELOAD