skip to main content.

posts for june 2011.

this night, i decided to walk around a bit and take photos. i first walked through the irchelpark to the irchel campus, then continued up to the edge of the forest on the zürichberg, and after that i walked back towards the campus, where i met a black cat, and finally continued into the irchelpark. i started somewhen past 2 am and returned back home after 6:30.

i recently presented a bash script which schedules computational tasks on multi-core machines. in the meanwhile, i fixed a bug in the display, made the program more flexible, and started to use local variables instead of global variables only. the new version is also more intelligent: it tries to adjust the running times of its controlled processes so that the running times are not far apart.
here is the newest version:

  1 #/bin/bash
  2 
  3 initProfile() {
  4     PROFILEFN=bigprimerunner-$PROFILE.profile
  5     CORES=`grep "^CORES " $PROFILEFN`
  6     CORES=${CORES/CORES }
  7     STARTUP=`grep "^STARTUP " $PROFILEFN`
  8     STARTUP=${STARTUP/STARTUP }
  9     eval STARTUP=$STARTUP
 10 }
 11 
 12 # Startup
 13 LOADMODIFIER=0
 14 if [ "$1" != "" ]
 15 then
 16     PROFILE=$1
 17 else
 18     PROFILE=`hostname`
 19 fi
 20 if [ "$2" != "" ]
 21 then
 22     LOADMODIFIER=$2
 23 fi
 24 initProfile
 25 if [ "$CORES" == "" ]
 26 then
 27     echo "Cannot load profile $PROFILEFN!"
 28     exit
 29 fi
 30 echo Cores: $CORES
 31 echo Load modifier: $LOADMODIFIER
 32 
 33 # The command to execute
 34 COMMAND=primefinder
 35 
 36 computeFreecores() {
 37     FREECORES=0
 38     local DAY=`date +%w`
 39     local LINE=`grep "^$DAY " $PROFILEFN`
 40     local LINE=${LINE/$DAY }
 41     local HOUR=`date +%k`
 42     for ((i=0;i<$HOUR;++i));
 43     do
 44         local LINE=${LINE#* }
 45     done
 46     local LINE=${LINE/ *}
 47     eval FREECORES=$LINE
 48     # Also determine how many jobs should be started
 49     STARTUP=`grep "^STARTUP " $PROFILEFN`
 50     STARTUP=${STARTUP/STARTUP }
 51     eval STARTUP=$STARTUP
 52 }
 53 
 54 killProcess() { # One argument: PID of process to kill
 55     local PID=$1
 56     local FILE=`lsof -p $PID -F n 2>/dev/null | grep primedatabase | grep -v "\.nfs"`
 57     kill $PID 2> /dev/null
 58     local A=${FILE#n*}
 59     local A=${A/ (nfs*}
 60     if [ "$A" != "" ]
 61     then
 62         rm $A
 63         echo Killed $PID with open file $A
 64     else
 65         echo Killed $PID with no open file
 66     fi
 67 }
 68 
 69 stopsignal() {
 70     local PIDS=`jobs -p`
 71     echo
 72     echo
 73     echo Terminating...
 74     echo Killing: $PIDS
 75     for PID in $PIDS;
 76     do
 77         killProcess $PID
 78     done
 79     echo done.
 80     exit
 81 }
 82 
 83 trap 'stopsignal' 2
 84 
 85 computeFreecores
 86 
 87 echo "Starting $STARTUP instances (in $BINDIR)"
 88 
 89 filterRunning() { # Removes all PIDs from the arguments which are currently stopped
 90     ps -o pid= -o s= $* | grep R | sed -e "s/R//"
 91 }
 92 
 93 filterStopped() { # Removes all PIDs from the arguments
 94     ps -o pid= -o s= $* | grep T | sed -e "s/T//"
 95 }
 96 
 97 determineToAdd() {
 98     computeFreecores
 99     local LOAD=`uptime`
100     local LOAD=${LOAD#*average: }
101     local LOAD=${LOAD/,*}
102     local LOAD=${LOAD/.*}
103     ADD=$[CORES-FREECORES-(LOAD+LOADMODIFIER)]
104     local JOBS=`jobs -p`
105     local JOBS=`filterRunning $JOBS`
106     echo "Load: $[LOAD+LOADMODIFIER], Intended number of free cores: $FREECORES, Running: `echo $JOBS | wc -w`, Started: `jobs -p | wc -l` (should be $STARTUP)"
107 }
108 
109 continueOne() {
110     local JOBS=`jobs -p`
111     local JOBS=`filterStopped $JOBS`
112     if [ "$JOBS" != "" ]
113     then
114         local PID=`ps -o pid= --sort +time $JOBS | head -1`
115         echo Continuing $PID...
116         kill -SIGCONT $PID
117     fi
118 }
119 
120 stopOne() {
121     local JOBS=`jobs -p`
122     local JOBS=`filterRunning $JOBS`
123     if [ "$JOBS" != "" ]
124     then
125         local PID=`ps -o pid= --sort -time $JOBS | head -1`
126         echo Stopping $PID...
127         kill -SIGSTOP $PID
128     fi
129 }
130 
131 killOne() {
132     local JOBS=`jobs -p`
133     if [ "$JOBS" != "" ]
134     then
135         local PID=`ps -o pid= --sort -time $JOBS | head -1`
136         killProcess $PID
137     fi
138 }
139 
140 launchOne() {
141     echo "Launching \"$COMMAND\"..."
142     $COMMAND &
143     sleep 1.5
144 }
145 
146 computeTotaltimeInSecs() {
147     # Input: $1
148     # Output: $TOTALSECS
149     local I=$1
150     local SECS=${I##*:}
151     local REST=${I%:*}
152     local MINS=${REST##*:}
153     local REST=${REST%:*}
154     local HOURS=${REST##*-}
155     local DAYS=`expr "$REST" : '\([0-9]*-\)'`
156     local DAYS=${DAYS%-}
157     if [ "$DAYS" == "" ]
158     then
159         local DAYS=0
160     fi
161     if [ "$HOURS" == "" ]
162     then
163         local HOURS=0
164     fi
165     if [ "$MINS" == "" ]
166     then
167         local MINS=0
168     fi
169     echo "((($DAYS * 24) + $HOURS) * 60 + $MINS) * 60 + $SECS" | bc
170 }
171 
172 adjustProcesses() {
173     local JOBS=`jobs -p`
174     local JOBS=`filterRunning $JOBS`
175     if [ "$JOBS" != "" ]
176     then
177         local STOPPID=`ps -o pid= --sort -time $JOBS | head -1`
178         local JOBS=`jobs -p`
179         local JOBS=`filterStopped $JOBS`
180         if [ "$JOBS" != "" ]
181         then
182             local CONTPID=`ps -o pid= --sort +time $JOBS | head -1`
183             # Compute times
184             local I=`ps -o time= $STOPPID`
185             local STOPSEC=`computeTotaltimeInSecs $I`
186             local I=`ps -o time= $CONTPID`
187             local CONTSEC=`computeTotaltimeInSecs $I`
188             # Compare times
189             local CT=`echo $CONTSEC+60*5 | bc`
190             if [ $STOPSEC -gt $CT ]
191             then
192                 echo Stopping $STOPPID and continuing $CONTPID
193                 kill -SIGSTOP $STOPPID
194                 kill -SIGCONT $CONTPID
195             fi
196         fi
197     fi
198 }
199 
200 # Start programs in the background
201 determineToAdd
202 for ((i=1;i<=STARTUP;++i));
203 do
204     launchOne
205     if [ $i -gt $ADD ]
206     then
207         sleep 1
208         kill -SIGSTOP %$i
209     fi
210 done
211 
212 # Start mainloop
213 while [ 1 ]
214 do
215     sleep 60
216     
217     # Determine how many processes should be added/removed
218     determineToAdd
219 
220     # Stop/continue processes
221     if [ $ADD -gt 0 ]
222     then
223         # Add processes
224         echo ADD:$ADD
225         for ((i=0;i<ADD;++i))
226         do
227             continueOne
228         done
229     fi
230     if [ $ADD -lt 0 ]
231     then
232         REM=$[-ADD]
233         # Remove processes
234         echo REMOVE:$REM
235         for ((i=0;i<REM;++i))
236         do
237             stopOne
238         done;
239     fi
240 
241     # Launch new processes or kill running ones
242     CURRLAUNCHED=`jobs -p | wc -l`
243     if [ $STARTUP != $CURRLAUNCHED ]
244     then
245         if [ $STARTUP -lt $CURRLAUNCHED ]
246         then
247             echo kill: $STARTUP $CURRLAUNCHED
248             for ((i=STARTUP;i<CURRLAUNCHED;++i));
249             do
250                 killOne
251             done;
252         else
253             echo add: $CURRLAUNCHED $STARTUP
254             for ((i=CURRLAUNCHED;i<STARTUP;++i));
255             do
256                 launchOne
257             done;
258         fi
259     fi
260     sleep 2
261     
262     # Adjust
263     adjustProcesses
264 done
posted in: computer
tags:
places:

lately i’ve been listening a lot to some finnish bands and artists, mainly tenhi (väre, maaäet, airut:aamujen), timo rautiainen ja trio niskalaukaus (kylmä tilä, rajaportti) and eicca toppinen (black ice soundtrack). below you can listen to the fabulous soing suoti from tenhi, a beautiful progressive folk/neofolk piece.

[[for legal reasons, i do not want to include youtube videos here anymore. please click on this link to watch the video at youtube.]]

yesterday, a total lunar eclipse (also known as a blood moon in german) occured. the next total one will be visible here in 2015, the previous one was in 2008. the one in 2008 was clouds only for me, as the one yesterday turned out to be as well. anyway, i spent the evening on the üetliberg, and took some nice photos, even though not of the moon.

sunset.

the first photos are from the sunset. in that direction, there weren’t too many clouds. only in the other direction, where the moon was supposed to rise, there were a lot of clouds.

zoomed zürich.

at least, i got some nice zoom shots of zürich. the second photo in the second row shows the eth and university main buildings.

üetliberg.

finally, here are two fisheye shots. the first shows a bit of the cloud cover, and the second one shows the uto kulm. i took the sunset photos from the look-out tower, and the latter ones from around the position that photo was taken. there was a lot of wind upstairs, and for the zürich photos i needed longer exposure times, which resulted in totally wiggled photos when i tried it from up there.

ever had the problem that you have access to a big machine (with many cores), and you want to run many (tens of thousands) small computations, but you want to make sure that not too many cores are used?
i’ve had this problem, and since i now have a pretty nice (i think so) solution, i thought that maybe more people are interested in it. so here’s my setup. i have a program, let’s call it primefinder, which, for a certain input n (where n is a natural number ≤ 21000), computes a prime of n bits with special properties. the program loops over all possible n, and checks for each n if a file n.prime exists. if it does not, it creates it (with zero content), computes the prime (which can take between minutes and days), writes the prime into the file and continues with the next file. this simple task distribution technique allows me to run the program in parallel on different machines (since the files are in a nfs folder) with many instances on each machine. now at our institute, we have a big computation machine (64 cores) and four user machines (on which the users work, each 32 cores). since the user machines are often not intensively used (and that only during certain times of the day), i want to use these as well. but there should be enough cores free, so the users won’t notice that there are computations going on in the background. on the computation server, also other people want to run something, so there should also be some free cores. optimally, my program would somehow decide how many cores are used by others, and use the rest. or most of them, to leave some free, especially on the user machines.
after a suggestion by our it guys, i started writing a bash script which controls the instances of my program on the same machine. the first version used the time of the day to determine the number of processes. everything was computed in terms of the number of cores of the machine, the load (with a load modifier applied, since some machines have uninterruptable processes running which do not effectively do something, and which won’t go away until the next reboot) and the hour of the day. but it is not easy to find a good scheme which yields good results on all machines. something which works well on the user machines is wasting processor time on the computation server.
so today i rewrote the program to use profiles. a profile contains information on the number of cores (this is necessary since the computation server has hyperthreading enabled, and thus returns twice the number of cores), the number of processes to be started, and the number of cores to be left free during each hour and day of a week. so on weekends or nights, i choose lower numbers for the free cores for the user machines, while for the computational server the number is always 1.
a profile can look like this (this is from a user machine, the file is called primefinderrunner-user.profile for later reference):

1 CORES 32
2 STARTUP $[CORES-CORES/8]
3 0 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8]
4 1 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
5 2 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
6 3 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
7 4 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
8 5 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
9 6 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16]

the line with prefix CORES gives the number of cores. the line prefixed by STARTUP gives the number of processes to run (at most); here, we use 7/8 of the number of cores. the lines prefixed by a number between 0 (sunday) and 6 (saturday) have 24 entries following: every entry (seperated by exactly one space, as the prefix itself is separated by exactly one space from the entries!) says how many cores should be free at each time of the day. usually during night (up to 7 am) at least 1/16 of the total number of cores should be free, while during workday (8 am to 7 pm) half of the cores should be free. of course, the numbers are different for weekends (saturday and sunday) than for the other working days.
now the script itself looks like this (for reference, the filename is primefinderrunner.sh):

  1 #/bin/bash
  2  
  3 initProfile() {
  4     PROFILEFN=primefinderrunner-$PROFILE.profile
  5     CORES=`grep "^CORES " $PROFILEFN`
  6     CORES=${CORES/CORES }
  7     STARTUP=`grep "^STARTUP " $PROFILEFN`
  8     STARTUP=${STARTUP/STARTUP }
  9     eval STARTUP=$STARTUP
 10 }
 11  
 12 LOADMODIFIER=0
 13 if [ "$1" != "" ]
 14 then
 15     PROFILE=$1
 16 else
 17     PROFILE=`hostname`
 18 fi
 19 if [ "$2" != "" ]
 20 then
 21     LOADMODIFIER=$2
 22 fi
 23 initProfile
 24 if [ "$CORES" == "" ]
 25 then
 26     echo "Cannot load profile $PROFILEFN!"
 27     exit
 28 fi
 29 echo Cores: $CORES
 30 echo Load modifier: $LOADMODIFIER
 31  
 32 computeFreecores() { 
 33     # two arguments: day (0..6) and hour (0..23)
 34     FREECORES=0
 35     DAY=`date +%w`
 36     LINE=`grep "^$DAY " $PROFILEFN`
 37     LINE=${LINE/$DAY }
 38     HOUR=`date +%k`
 39     for ((i=0;i<$HOUR;++i));
 40     do
 41         LINE=${LINE#* }
 42     done
 43     LINE=${LINE/ *}
 44     eval FREECORES=$LINE
 45 }
 46  
 47 computeFreecores
 48  
 49 stopsignal() {
 50     for PID in `jobs -p`;
 51     do
 52         FILE=`lsof -p $PID -F n 2>/dev/null | grep primedatabase | grep -v "\\.nfs"`
 53         A=${FILE#n*}
 54         A=${A/ (nfs*}
 55         echo killing $PID with open file $A
 56         rm $A
 57         kill $PID
 58     done
 59     exit
 60 }
 61  
 62 trap 'stopsignal' 2
 63  
 64 echo "Starting $STARTUP instances"
 65  
 66 determineToAdd() {
 67     computeFreecores
 68     LOAD=`uptime`
 69     LOAD=${LOAD#*average: }
 70     LOAD=${LOAD/,*}
 71     LOAD=${LOAD/.*}
 72     ADD=$[CORES-FREECORES-LOAD-LOADMODIFIER]
 73     echo Load: $[LOAD-LOADMODIFIER], Intended number of free cores: $FREECORES
 74 }
 75  
 76 # Start programs in the background
 77 determineToAdd
 78 for ((i=1;i<=STARTUP;++i));
 79 do
 80     primefinder &amp;
 81     sleep 2
 82 done
 83 sleep 20
 84 if [ $ADD -lt 0 ]
 85 then
 86     ADD=0
 87 fi
 88 for ((i=ADD+1;i<=STARTUP;++i));
 89 do
 90     kill -SIGSTOP %$i
 91 done
 92  
 93 CURRRUNNING=$ADD
 94 RUNNINGSTART=1 # The first one running
 95 RUNNINGSTOP=$CURRRUNNING # The last one running
 96  
 97 startOne() {
 98     # Assume that $CURRRUNNING < $STARTUP
 99     RUNNINGSTOP=$[(RUNNINGSTOP % STARTUP) + 1]
100     kill -SIGCONT %$RUNNINGSTOP
101     CURRRUNNING=$[CURRRUNNING+1]
102 }
103  
104 stopOne() {
105     # Assume that $CURRRUNNING > 0
106     kill -SIGSTOP %$RUNNINGSTART
107     RUNNINGSTART=$[(RUNNINGSTART % STARTUP) + 1]
108     CURRRUNNING=$[CURRRUNNING-1]
109 }
110  
111 # Start mainloop
112 while [ 1 ]
113 do
114     sleep 60
115  
116     # Determine how many threads should be added/removed
117     determineToAdd
118     if [ $ADD -gt 0 ]
119     then
120         if [ $[ADD+CURRRUNNING] -gt $STARTUP ]
121         then
122             ADD=$[STARTUP-CURRRUNNING]
123         fi
124         # Add processes
125         echo ADD:$ADD
126         for ((i=0;i<ADD;++i))
127         do
128             startOne
129         done
130     fi
131     if [ $ADD -lt 0 ]
132     then
133         REM=$[-ADD]
134         # Clip
135         if [ $REM -gt $CURRRUNNING ]
136         then
137             REM=$CURRRUNNING
138         fi
139         # Remove processes
140         echo REMOVE:$REM
141         for ((i=0;i<REM;++i))
142         do
143             stopOne
144         done
145     fi
146     sleep 60
147 done

the script first starts all instances, then stops the ones which are too many, and then starts the main loop. in the main loop, it waits 60 seconds (for the average load to adjust to the new process count), and then decides how many cores should be left free, and what that means for the number of processes (add/remove some). note that the profile file is read every minute, so it can be changed any time without any need to re-run the whole thing.
in case the script is stopped (with control+c), all primefinder processes are killed and their open file is deleted. to determine the open file, i use lsof with some greps. you have to adjust and test that line before using this script!
note that this script is quite a hack, and far from perfect. and it is somehow system dependent, or at least “setup dependent” since it has certain assumptions on the executables, on how the output of lsof looks like, … so better make sure it works before you use it, especially on bigger systems. also note that in the beginning, all instances are ran (they are started with a two second delay between two instances), and then everything is run for 20 seconds before the first adjustment (i.e. stopping processes which are too many) are made. if you share the system with other people, this might already annoy others when they try to measure timings of their programs (especially if hyperthreading is enabled).

posted in: computer
tags:
places:

this long weekend i was staying at my parent’s place. here are some impressions, both from my parent’s garden, as well as from a bike tour to olfen.

garden.

surroundings.

in these pictures, you can see, among other things, konik horses and heck cattle. i wish i would have had my longer telephoto lens with me…