spielwiese. (Posts about ultrasparc.)https://spielwiese.fontein.de/tag/ultrasparc.atom2024-01-05T07:10:12ZfelixNikolaa problem with pthread_once on an out-dated solaris installation.https://spielwiese.fontein.de/2012/01/22/a-problem-with-pthread_once-on-an-out-dated-solaris-installation/2012-01-22T19:52:02+01:002012-01-22T19:52:02+01:00felix<p>when adding a thread-specific allocator to a program of mine, to avoid terrible performance loss while using gmp and/or mpfr to do arbitrary precision integer respectively floating point arithmetic, i stumbled about a problem which seems to be fixed with newer solaris versions. in case anyone experiences a similar problem and cannot just update to a new enough solaris version, here’s some information on a dirty’n'quick fix for the problem.<br>
more precisely, i wanted to combine <a href="http://www.boost.org/doc/libs/1_35_0/doc/html/thread/thread_local_storage.html"><code>boost::thread_specific_pointer</code></a> (a portable implementation of <a href="https://en.wikipedia.org/wiki/Thread-local_storage">thread specific storage</a>, with <a href="http://g.oswego.edu/dl/html/malloc.html">dlmalloc</a>, to obtain an allocator which won’t block when used from different threads at once. if you use arbitrary precision arithmetic on a machine with many cores/cpus (say, 30 to 60), having a single blocking (via a <a href="https://en.wikipedia.org/wiki/Mutex">mutex</a>) allocator totally kills performance. for example, on our ultrasparc/solaris machine, running 29 threads (on 30 cpus) in parallel, only 20% of the system’s ressources were used effectively. if the machine would have only had 6 cpus, the program would have run at the same speed. quite a waste, isn’t it?<br>
anyway, combining thread local storage and a memory allocator solves this problem. in theory, at least. when i put the two things together, and ran my program with 30 threads, stlil only 60% of the 30 cpus processing power was used – the other 40% of the cycles were still spend waiting. (solaris has some excellent profiling tools on board. that’s why i like to use our slow old outdated solaris machine to profile, instead of our blazing fast newer big linux machine. in case anyone cares.) interestingly, on our linux machine, with 64 threads (running on 64 cores), the problem wasn’t there: 100% of the cycles went into computing, and essentially none into waiting.<br>
inspecting the problem closer with the sun studio analyzer, it turns out that the 40% waiting cycles are caused by <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span>, which is called by the internal boost method <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::detail::find_tss_data</code></span>. that method is called every time a <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::thread_specific_pointer<></code></span> is dereferenced. which in my program happens every time when the thread local allocator is fired up to allocate, reallocate or free a piece of memory. (more precisely, <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::detail::find_tss_data</code></span> calls <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::detail::get_current_thread_data</code></span>, which uses <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::call_once</code></span>, which in turn uses <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span> in the <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread</code></span> implementation of <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::thread</code></span>, which is the implementation used on unixoid systems, such as solaris and linux.)<br>
in theory, <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span> uses a <a href="https://en.wikipedia.org/wiki/Double-checked_locking">double-checked locking</a> mechanism to make sure that the function specified is ran exactly once during the execution of the wohle program. while searching online, i found the source of the <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread</code></span> implementation of a newer opensolaris from 2008 <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/pthread.c#134">here</a>; it uses a double-checked locking with a <a href="https://en.wikipedia.org/wiki/Memory_barrier">memory barrier</a>, which should (at least in theory) turn it into a working solution (multi-threaded programming is far from being simple, both the compiler and the cpu can screw up your code by rearranging instructions in a deadly way).<br>
anyway, it seems that the <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span> implementation on the soliaris installation on the machine i’m using just locks a mutex every time it is called. when you massively call the function from 30 threads at once, all running perfectly parallel on a machine with enough cpus, this gives a natural bottle-neck. to make sure it is <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span> which causes the problem, i wrote the following test program:<br>
</p><div class="code-c++"><pre class="code literal-block"><span></span><span class="linenos"> 1</span><span class="cp">#include</span><span class="w"> </span><span class="cpf"><pthread.h></span>
<span class="linenos"> 2</span><span class="cp">#include</span><span class="w"> </span><span class="cpf"><iostream></span>
<span class="linenos"> 3</span>
<span class="linenos"> 4</span><span class="k">static</span><span class="w"> </span><span class="n">pthread_once_t</span><span class="w"> </span><span class="n">onceControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PTHREAD_ONCE_INIT</span><span class="p">;</span>
<span class="linenos"> 5</span><span class="k">static</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">nocalls</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="linenos"> 6</span>
<span class="linenos"> 7</span><span class="k">extern</span><span class="w"> </span><span class="s">"C"</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="n">onceRoutine</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="linenos"> 8</span><span class="p">{</span>
<span class="linenos"> 9</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="s">"onceRoutine()</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="linenos">10</span><span class="w"> </span><span class="n">nocalls</span><span class="o">++</span><span class="p">;</span>
<span class="linenos">11</span><span class="p">}</span>
<span class="linenos">12</span>
<span class="linenos">13</span><span class="k">extern</span><span class="w"> </span><span class="s">"C"</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">thethread</span><span class="p">(</span><span class="kt">void</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span>
<span class="linenos">14</span><span class="p">{</span>
<span class="linenos">15</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">unsigned</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">10000000</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="linenos">16</span><span class="w"> </span><span class="n">pthread_once</span><span class="p">(</span><span class="o">&</span><span class="n">onceControl</span><span class="p">,</span><span class="w"> </span><span class="n">onceRoutine</span><span class="p">);</span>
<span class="linenos">17</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nb">NULL</span><span class="p">;</span>
<span class="linenos">18</span><span class="p">}</span>
<span class="linenos">19</span>
<span class="linenos">20</span><span class="kt">int</span><span class="w"> </span><span class="n">main</span><span class="p">()</span>
<span class="linenos">21</span><span class="p">{</span>
<span class="linenos">22</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">nothreads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">30</span><span class="p">;</span>
<span class="linenos">23</span><span class="w"> </span><span class="n">pthread_t</span><span class="w"> </span><span class="n">threads</span><span class="p">[</span><span class="n">nothreads</span><span class="p">];</span>
<span class="linenos">24</span>
<span class="linenos">25</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">nothreads</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="linenos">26</span><span class="w"> </span><span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="nb">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">thethread</span><span class="p">,</span><span class="w"> </span><span class="nb">NULL</span><span class="p">);</span>
<span class="linenos">27</span>
<span class="linenos">28</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">nothreads</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="linenos">29</span><span class="w"> </span><span class="p">{</span>
<span class="linenos">30</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">status</span><span class="p">;</span>
<span class="linenos">31</span><span class="w"> </span><span class="n">pthread_join</span><span class="p">(</span><span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="o">&</span><span class="n">status</span><span class="p">);</span>
<span class="linenos">32</span><span class="w"> </span><span class="p">}</span>
<span class="linenos">33</span>
<span class="linenos">34</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">nocalls</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span>
<span class="linenos">35</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="s">"pthread_once() screwed up totally!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="linenos">36</span><span class="w"> </span><span class="k">else</span>
<span class="linenos">37</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="s">"pthread_once() seems to be doing what it promises</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="linenos">38</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="linenos">39</span><span class="p">}</span>
</pre></div><br>
i compiled the program with <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>CC -m64 -fast -xarch=native64 -xchip=native -xcache=native -mt -lpthread oncetest.cpp -o oncetest</code></span> and ran it with <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>time</code></span>. the result:<br>
<div class="code-unformatted"><pre class="code literal-block"><span></span><span class="linenos">1</span>real 16m9.541s
<span class="linenos">2</span>user 201m1.476s
<span class="linenos">3</span>sys 0m18.499s
</pre></div><br>
compiling the same program under linux and running it there (with enough cores in the machine) yielded<br>
<div class="code-unformatted"><pre class="code literal-block"><span></span><span class="linenos">1</span>real 0m0.243s
<span class="linenos">2</span>user 0m1.640s
<span class="linenos">3</span>sys 0m0.060s
</pre></div><br>
quite a difference, isn’t it? the solaris machine is slower, so a few seconds total time would be ok, but 16 minutes?! inspecting the running program on solaris with <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>prstat -Lmp <pid></code></span> shows the amount of waiting involved…<br>
to solve this problem, at least for me, with this old solaris verison running, i took the code of <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span> from the above <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/pthread.c#134">link</a> – namely the includes<br>
<div class="code-c++"><pre class="code literal-block"><span></span><span class="linenos">1</span><span class="cp">#include</span><span class="w"> </span><span class="cpf"><atomic.h></span>
<span class="linenos">2</span><span class="cp">#include</span><span class="w"> </span><span class="cpf"><thread.h></span>
<span class="linenos">3</span><span class="cp">#include</span><span class="w"> </span><span class="cpf"><errno.h></span>
</pre></div><br>
copied the lines 38 to 46 from the link, and the lines 157 to 179 from the link into <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost_directory/libs/thread/src/pthread/once.cpp</code></span>, renamed <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>pthread_once</code></span> to <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>my_pthread_once</code></span> in the code i copied and in the boost source file i added the lines to, and re-compiled boost. then, i re-ran my program, and suddenly, there was no more waiting (at least, not for mutexes :-) ). and the <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>oncetest</code></span> from above, rewritten using <span class="code-unformatted inline-code"><code class="code literal-block"><span></span>boost::once_call</code></span>, yielded:<br>
<div class="code-unformatted"><pre class="code literal-block"><span></span><span class="linenos">1</span>real 0m0.928s
<span class="linenos">2</span>user 0m20.181s
<span class="linenos">3</span>sys 0m0.036s
</pre></div><br>
perfect!compiling fun under solaris...https://spielwiese.fontein.de/2011/03/14/compiling-fun-under-solaris/2011-03-14T22:44:59+01:002011-03-14T22:44:59+01:00felix<p>in the last weeks, i had to compile several libraries for <a href="http://www.math.uzh.ch/">our</a> ultrasparc machine running solaris (sunos 5.10). in particular, these libraries were <a href="https://en.wikipedia.org/wiki/GNU_Multi-Precision_Library">gmp</a>, <a href="https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software">atlas</a>, <a href="https://en.wikipedia.org/wiki/Integer_Matrix_Library">iml</a>, <a href="https://en.wikipedia.org/wiki/Number_Theory_Library">ntl</a> and <a href="https://en.wikipedia.org/wiki/Boost_C%2B%2B_Libraries">boost</a>. i wanted to use the <a href="https://en.wikipedia.org/wiki/Sun_Studio_%28software%29">sun studio</a> c/c++ compiler (<code>cc</code> has version 5.8, <code>CC</code> has version 5.9) instead of <a href="https://en.wikipedia.org/wiki/GNU_Compiler_Collection">gcc/g++</a>. moreover, i need 64 bit versions of everything, since my programs need a <i>lot</i> of memory. (the machine has around 140 gb of ram anyway, so it makes a lot of sense.)</p>
<p>since it was somewhat troublesome to get everything running (atleast running enough so that i could use what i needed), i want to describe the process of compiling everything here. maybe this is useful for someone…</p>
<p>i compile everything into my home directory, <code>/home/felix</code>. i also use <code>stlport4</code> instead of the sun studio standard c++ <a href="https://en.wikipedia.org/wiki/Standard_Template_Library">stl</a>, since i couldn’t figure out how to compile boost with the usual stl. the code generated will not be portable, but should be fast.</p>
<h4>gmp.</h4>
<p>for configuration and compilation, i did the following:</p>
<blockquote><p><code>$ export CC=cc<br>
$ export CXX=CC<br>
$ export CFLAGS=’-m64 -fast -xO3 -xarch=native64 -xchip=native -xcache=native’<br>
$ export CXXFLAGS=’-m64 -fast -xO3 -xarch=native64 -xchip=native -xcache=native -library=stlport4′<br>
$ ./configure –prefix=/home/felix<br>
$ gmake<br>
$ gmake check<br>
$ gmake install<br>
$ gmake distclean</code></p></blockquote>
<p>i didn’t add the <code>–enable-cxx</code> switch for <code>configure</code>, since this didn’t work and i didn’t need it. note that i chose the optimization level <code>-xO3</code> instead of <code>-xO4</code> or <code>-xO5</code> since otherwise some of the checks failed. you can try a higher level, but i urge you to run <code>gmake check</code> and reduce the level when checks fail.</p>
<h4>atlas.</h4>
<p>to build atlas, i proceeded as follows. you can replace <code>mybuilddir</code> with any other sensible name; that directory will contain all build specific files for that machine. note that <i>atlas</i> does some profiling to determine which methods are fastest, so it is better to not have anything else running on the machine while building <i>atlas</i>. i didn’t build the fortran parts of the library (by <code>–nof77</code>), as well as the fortran tests, since i couldn’t get them to link correctly. (one probably has to set <code>FFLAGS</code> or however the corresponding variable is called…)</p>
<blockquote><p><code>$ mkdir mybuilddir<br>
$ cd mybuilddir<br>
$ export CC=cc<br>
$ export CFLAGS=’-m64 -fast -xarch=native64 -xchip=native -xcache=native’<br>
$ ../configure –nof77 –prefix=/home/felix –cc=cc –cflags=’-m64 -fast -xarch=native64 -xchip=native -xcache=native’<br>
$ gmake<br>
$ gmake check<br>
$ gmake ptcheck<br>
$ gmake time<br>
$ gmake install<br>
$ cd ..</code></p></blockquote>
<h4>iml.</h4>
<p>building <i>iml</i> is rather easy. it needs both <i>gmp</i> and <i>atlas</i>.</p>
<blockquote><code>$ export CC=cc<br>
$ export CFLAGS=’-m64 -fast -xarch=native64 -xchip=native -xcache=native’<br>
$ ./configure –prefix=/home/felix –with-gmp-include=/home/felix/include –with-atlas-include=/home/felix/include –with-gmp-lib=/home/felix/lib –with-atlas-lib=/home/felix/lib<br>
$ gmake<br>
$ gmake check<br>
$ gmake install</code></blockquote>
<h4>ntl.</h4>
<p>buliding <i>ntl</i> is a bit more complicated. it requires that <i>gmp</i> is already built. the whole process is more complicated since on our machine, a little tool called <code>MakeDesc</code> called at the beginning of the build process hangs. the problem lies in <code>src/MakeDesc.c</code>, when the main program calls <code>DoublePrecision1(one)</code> in order to find out the (internal) precision of <code>double</code> registers. if i replace the line</p>
<blockquote><code>dp1 = DoublePrecision1(one);</code></blockquote>
<p>by</p>
<blockquote><code>dp1 = dp;</code></blockquote>
<p>the whole process works perfectly – though maybe some things will not be 100% correct in the end. (but i’m willing to take that risk.)</p>
<blockquote><code>$ cd src<br>
$ export CC=cc<br>
$ export CXX=CC<br>
$ export CFLAGS=’-m64 -fast -xarch=native64 -xchip=native -xcache=native’<br>
$ export CXXFLAGS=’-m64 -fast -xarch=native64 -xchip=native -xcache=native -library=stlport4′<br>
$ export LDFLAGS=’-R/home/felix/lib -library=stlport4′<br>
$ ./configure PREFIX=/home/felix CC=cc CXX=CC CFLAGS=’-m64 -fast -xarch=native64 -xchip=native -xcache=native’ CXXFLAGS=’-m64 -fast -xarch=native64 -xchip=native -xcache=native -library=stlport4′ LDFLAGS=’-R:/home/felix/lib’ NTL_GMP_LIP=on GMP_PREFIX=/home/felix<br>
$ gmake<br>
$ gmake check<br>
$ gmake install<br>
$ cd ..</code></blockquote>
<h4>boost.</h4>
<p>finally, i had to compile <i>boost</i>. after a lot of trying and fiddling, i found out that these calls seem to work:</p>
<blockquote><code>$ ./bootstrap.sh –prefix=/home/felix –show-libraries –with-toolset=sun –with-libraries=iostreams<br>
$ ./bjam –prefix=/home/felix toolset=sun –with-iostreams threading=multi address-model=64 link=static install</code></blockquote>
<p>note that i only build the <code>iostreams</code> library of boost. remove the <code>–with-libraries=iostreams</code> to (try to) build all libraries.</p>
<h4>conclusion.</h4>
<p>yes, the whole process is pretty much a pain in the ass. just installing the packages with <code>apt-get</code> on some debian-based linux, or compiling them from scratch on a gcc/g++ based linux, is just <i>sooo</i> much easier. but then, if you have a solaris machine standing around, why not use it to crunch some numbers for you? :-) (especially since currently, i essentially have the machine for myself.)<br>
to compile my code, i use</p>
<blockquote><code>$ CC -I/home/felix/include -m64 -fast -xarch=native64 -xchip=native -xcache=native -c -library=stlport4 <i>object files and so on;</i></code></blockquote>
<p>to link, i do</p>
<blockquote><code>$ CC -m64 -fast -xarch=native64 -xchip=native -xcache=native -L/home/felix/lib -R/home/felix/lib -library=stlport4 -lntl -liml -lcblas -latlas -lgmp -lm -lboost_iostreams -lz <i>object files and so on.</i></code></blockquote>