this night, i decided to walk around a bit and take photos. i first walked through the irchelpark to the irchel campus, then continued up to the edge of the forest on the zürichberg, and after that i walked back towards the campus, where i met a black cat, and finally continued into the irchelpark. i started somewhen past 2 am and returned back home after 6:30.
posts for june 2011.
i recently presented a bash script which schedules computational tasks on multi-core machines. in the meanwhile, i fixed a bug in the display, made the program more flexible, and started to use local variables instead of global variables only. the new version is also more intelligent: it tries to adjust the running times of its controlled processes so that the running times are not far apart.
here is the newest version:
1#/bin/bash 2 3initProfile() { 4 PROFILEFN=bigprimerunner-$PROFILE.profile 5 CORES=`grep "^CORES " $PROFILEFN` 6 CORES=${CORES/CORES } 7 STARTUP=`grep "^STARTUP " $PROFILEFN` 8 STARTUP=${STARTUP/STARTUP } 9 eval STARTUP=$STARTUP 10} 11 12# Startup 13LOADMODIFIER=0 14if [ "$1" != "" ] 15then 16 PROFILE=$1 17else 18 PROFILE=`hostname` 19fi 20if [ "$2" != "" ] 21then 22 LOADMODIFIER=$2 23fi 24initProfile 25if [ "$CORES" == "" ] 26then 27 echo "Cannot load profile $PROFILEFN!" 28 exit 29fi 30echo Cores: $CORES 31echo Load modifier: $LOADMODIFIER 32 33# The command to execute 34COMMAND=primefinder 35 36computeFreecores() { 37 FREECORES=0 38 local DAY=`date +%w` 39 local LINE=`grep "^$DAY " $PROFILEFN` 40 local LINE=${LINE/$DAY } 41 local HOUR=`date +%k` 42 for ((i=0;i<$HOUR;++i)); 43 do 44 local LINE=${LINE#* } 45 done 46 local LINE=${LINE/ *} 47 eval FREECORES=$LINE 48 # Also determine how many jobs should be started 49 STARTUP=`grep "^STARTUP " $PROFILEFN` 50 STARTUP=${STARTUP/STARTUP } 51 eval STARTUP=$STARTUP 52} 53 54killProcess() { # One argument: PID of process to kill 55 local PID=$1 56 local FILE=`lsof -p $PID -F n 2>/dev/null | grep primedatabase | grep -v "\.nfs"` 57 kill $PID 2> /dev/null 58 local A=${FILE#n*} 59 local A=${A/ (nfs*} 60 if [ "$A" != "" ] 61 then 62 rm $A 63 echo Killed $PID with open file $A 64 else 65 echo Killed $PID with no open file 66 fi 67} 68 69stopsignal() { 70 local PIDS=`jobs -p` 71 echo 72 echo 73 echo Terminating... 74 echo Killing: $PIDS 75 for PID in $PIDS; 76 do 77 killProcess $PID 78 done 79 echo done. 80 exit 81} 82 83trap 'stopsignal' 2 84 85computeFreecores 86 87echo "Starting $STARTUP instances (in $BINDIR)" 88 89filterRunning() { # Removes all PIDs from the arguments which are currently stopped 90 ps -o pid= -o s= $* | grep R | sed -e "s/R//" 91} 92 93filterStopped() { # Removes all PIDs from the arguments 94 ps -o pid= -o s= $* | grep T | sed -e "s/T//" 95} 96 97determineToAdd() { 98 computeFreecores 99 local LOAD=`uptime` 100 local LOAD=${LOAD#*average: } 101 local LOAD=${LOAD/,*} 102 local LOAD=${LOAD/.*} 103 ADD=$[CORES-FREECORES-(LOAD+LOADMODIFIER)] 104 local JOBS=`jobs -p` 105 local JOBS=`filterRunning $JOBS` 106 echo "Load: $[LOAD+LOADMODIFIER], Intended number of free cores: $FREECORES, Running: `echo $JOBS | wc -w`, Started: `jobs -p | wc -l` (should be $STARTUP)" 107} 108 109continueOne() { 110 local JOBS=`jobs -p` 111 local JOBS=`filterStopped $JOBS` 112 if [ "$JOBS" != "" ] 113 then 114 local PID=`ps -o pid= --sort +time $JOBS | head -1` 115 echo Continuing $PID... 116 kill -SIGCONT $PID 117 fi 118} 119 120stopOne() { 121 local JOBS=`jobs -p` 122 local JOBS=`filterRunning $JOBS` 123 if [ "$JOBS" != "" ] 124 then 125 local PID=`ps -o pid= --sort -time $JOBS | head -1` 126 echo Stopping $PID... 127 kill -SIGSTOP $PID 128 fi 129} 130 131killOne() { 132 local JOBS=`jobs -p` 133 if [ "$JOBS" != "" ] 134 then 135 local PID=`ps -o pid= --sort -time $JOBS | head -1` 136 killProcess $PID 137 fi 138} 139 140launchOne() { 141 echo "Launching \"$COMMAND\"..." 142 $COMMAND & 143 sleep 1.5 144} 145 146computeTotaltimeInSecs() { 147 # Input: $1 148 # Output: $TOTALSECS 149 local I=$1 150 local SECS=${I##*:} 151 local REST=${I%:*} 152 local MINS=${REST##*:} 153 local REST=${REST%:*} 154 local HOURS=${REST##*-} 155 local DAYS=`expr "$REST" : '\([0-9]*-\)'` 156 local DAYS=${DAYS%-} 157 if [ "$DAYS" == "" ] 158 then 159 local DAYS=0 160 fi 161 if [ "$HOURS" == "" ] 162 then 163 local HOURS=0 164 fi 165 if [ "$MINS" == "" ] 166 then 167 local MINS=0 168 fi 169 echo "((($DAYS * 24) + $HOURS) * 60 + $MINS) * 60 + $SECS" | bc 170} 171 172adjustProcesses() { 173 local JOBS=`jobs -p` 174 local JOBS=`filterRunning $JOBS` 175 if [ "$JOBS" != "" ] 176 then 177 local STOPPID=`ps -o pid= --sort -time $JOBS | head -1` 178 local JOBS=`jobs -p` 179 local JOBS=`filterStopped $JOBS` 180 if [ "$JOBS" != "" ] 181 then 182 local CONTPID=`ps -o pid= --sort +time $JOBS | head -1` 183 # Compute times 184 local I=`ps -o time= $STOPPID` 185 local STOPSEC=`computeTotaltimeInSecs $I` 186 local I=`ps -o time= $CONTPID` 187 local CONTSEC=`computeTotaltimeInSecs $I` 188 # Compare times 189 local CT=`echo $CONTSEC+60*5 | bc` 190 if [ $STOPSEC -gt $CT ] 191 then 192 echo Stopping $STOPPID and continuing $CONTPID 193 kill -SIGSTOP $STOPPID 194 kill -SIGCONT $CONTPID 195 fi 196 fi 197 fi 198} 199 200# Start programs in the background 201determineToAdd 202for ((i=1;i<=STARTUP;++i)); 203do 204 launchOne 205 if [ $i -gt $ADD ] 206 then 207 sleep 1 208 kill -SIGSTOP %$i 209 fi 210done 211 212# Start mainloop 213while [ 1 ] 214do 215 sleep 60 216 217 # Determine how many processes should be added/removed 218 determineToAdd 219 220 # Stop/continue processes 221 if [ $ADD -gt 0 ] 222 then 223 # Add processes 224 echo ADD:$ADD 225 for ((i=0;i<ADD;++i)) 226 do 227 continueOne 228 done 229 fi 230 if [ $ADD -lt 0 ] 231 then 232 REM=$[-ADD] 233 # Remove processes 234 echo REMOVE:$REM 235 for ((i=0;i<REM;++i)) 236 do 237 stopOne 238 done; 239 fi 240 241 # Launch new processes or kill running ones 242 CURRLAUNCHED=`jobs -p | wc -l` 243 if [ $STARTUP != $CURRLAUNCHED ] 244 then 245 if [ $STARTUP -lt $CURRLAUNCHED ] 246 then 247 echo kill: $STARTUP $CURRLAUNCHED 248 for ((i=STARTUP;i<CURRLAUNCHED;++i)); 249 do 250 killOne 251 done; 252 else 253 echo add: $CURRLAUNCHED $STARTUP 254 for ((i=CURRLAUNCHED;i<STARTUP;++i)); 255 do 256 launchOne 257 done; 258 fi 259 fi 260 sleep 2 261 262 # Adjust 263 adjustProcesses 264done
lately i’ve been listening a lot to some finnish bands and artists, mainly tenhi (väre, maaäet, airut:aamujen), timo rautiainen ja trio niskalaukaus (kylmä tilä, rajaportti) and eicca toppinen (black ice soundtrack). below you can listen to the fabulous soing suoti from tenhi, a beautiful progressive folk/neofolk piece.
yesterday, a total lunar eclipse (also known as a blood moon in german) occured. the next total one will be visible here in 2015, the previous one was in 2008. the one in 2008 was clouds only for me, as the one yesterday turned out to be as well. anyway, i spent the evening on the üetliberg, and took some nice photos, even though not of the moon.
sunset.
the first photos are from the sunset. in that direction, there weren’t too many clouds. only in the other direction, where the moon was supposed to rise, there were a lot of clouds.
zoomed zürich.
at least, i got some nice zoom shots of zürich. the second photo in the second row shows the eth and university main buildings.
üetliberg.
finally, here are two fisheye shots. the first shows a bit of the cloud cover, and the second one shows the uto kulm. i took the sunset photos from the look-out tower, and the latter ones from around the position that photo was taken. there was a lot of wind upstairs, and for the zürich photos i needed longer exposure times, which resulted in totally wiggled photos when i tried it from up there.
ever had the problem that you have access to a big machine (with many cores), and you want to run many (tens of thousands) small computations, but you want to make sure that not too many cores are used?
i’ve had this problem, and since i now have a pretty nice (i think so) solution, i thought that maybe more people are interested in it. so here’s my setup. i have a program, let’s call it primefinder, which, for a certain input n (where n is a natural number ≤ 21000), computes a prime of n bits with special properties. the program loops over all possible n, and checks for each n if a file n.prime
exists. if it does not, it creates it (with zero content), computes the prime (which can take between minutes and days), writes the prime into the file and continues with the next file. this simple task distribution technique allows me to run the program in parallel on different machines (since the files are in a nfs folder) with many instances on each machine. now at our institute, we have a big computation machine (64 cores) and four user machines (on which the users work, each 32 cores). since the user machines are often not intensively used (and that only during certain times of the day), i want to use these as well. but there should be enough cores free, so the users won’t notice that there are computations going on in the background. on the computation server, also other people want to run something, so there should also be some free cores. optimally, my program would somehow decide how many cores are used by others, and use the rest. or most of them, to leave some free, especially on the user machines.
after a suggestion by our it guys, i started writing a bash script which controls the instances of my program on the same machine. the first version used the time of the day to determine the number of processes. everything was computed in terms of the number of cores of the machine, the load (with a load modifier applied, since some machines have uninterruptable processes running which do not effectively do something, and which won’t go away until the next reboot) and the hour of the day. but it is not easy to find a good scheme which yields good results on all machines. something which works well on the user machines is wasting processor time on the computation server.
so today i rewrote the program to use profiles. a profile contains information on the number of cores (this is necessary since the computation server has hyperthreading enabled, and thus returns twice the number of cores), the number of processes to be started, and the number of cores to be left free during each hour and day of a week. so on weekends or nights, i choose lower numbers for the free cores for the user machines, while for the computational server the number is always 1.
a profile can look like this (this is from a user machine, the file is called primefinderrunner-user.profile
for later reference):
1CORES 32 2STARTUP $[CORES-CORES/8] 30 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] 41 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8] 52 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8] 63 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8] 74 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8] 85 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8] 96 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16]
the line with prefix CORES
gives the number of cores. the line prefixed by STARTUP
gives the number of processes to run (at most); here, we use 7/8 of the number of cores. the lines prefixed by a number between 0 (sunday) and 6 (saturday) have 24 entries following: every entry (seperated by exactly one space, as the prefix itself is separated by exactly one space from the entries!) says how many cores should be free at each time of the day. usually during night (up to 7 am) at least 1/16 of the total number of cores should be free, while during workday (8 am to 7 pm) half of the cores should be free. of course, the numbers are different for weekends (saturday and sunday) than for the other working days.
now the script itself looks like this (for reference, the filename is primefinderrunner.sh
):
1#/bin/bash 2 3initProfile() { 4 PROFILEFN=primefinderrunner-$PROFILE.profile 5 CORES=`grep "^CORES " $PROFILEFN` 6 CORES=${CORES/CORES } 7 STARTUP=`grep "^STARTUP " $PROFILEFN` 8 STARTUP=${STARTUP/STARTUP } 9 eval STARTUP=$STARTUP 10} 11 12LOADMODIFIER=0 13if [ "$1" != "" ] 14then 15 PROFILE=$1 16else 17 PROFILE=`hostname` 18fi 19if [ "$2" != "" ] 20then 21 LOADMODIFIER=$2 22fi 23initProfile 24if [ "$CORES" == "" ] 25then 26 echo "Cannot load profile $PROFILEFN!" 27 exit 28fi 29echo Cores: $CORES 30echo Load modifier: $LOADMODIFIER 31 32computeFreecores() { 33 # two arguments: day (0..6) and hour (0..23) 34 FREECORES=0 35 DAY=`date +%w` 36 LINE=`grep "^$DAY " $PROFILEFN` 37 LINE=${LINE/$DAY } 38 HOUR=`date +%k` 39 for ((i=0;i<$HOUR;++i)); 40 do 41 LINE=${LINE#* } 42 done 43 LINE=${LINE/ *} 44 eval FREECORES=$LINE 45} 46 47computeFreecores 48 49stopsignal() { 50 for PID in `jobs -p`; 51 do 52 FILE=`lsof -p $PID -F n 2>/dev/null | grep primedatabase | grep -v "\\.nfs"` 53 A=${FILE#n*} 54 A=${A/ (nfs*} 55 echo killing $PID with open file $A 56 rm $A 57 kill $PID 58 done 59 exit 60} 61 62trap 'stopsignal' 2 63 64echo "Starting $STARTUP instances" 65 66determineToAdd() { 67 computeFreecores 68 LOAD=`uptime` 69 LOAD=${LOAD#*average: } 70 LOAD=${LOAD/,*} 71 LOAD=${LOAD/.*} 72 ADD=$[CORES-FREECORES-LOAD-LOADMODIFIER] 73 echo Load: $[LOAD-LOADMODIFIER], Intended number of free cores: $FREECORES 74} 75 76# Start programs in the background 77determineToAdd 78for ((i=1;i<=STARTUP;++i)); 79do 80 primefinder & 81 sleep 2 82done 83sleep 20 84if [ $ADD -lt 0 ] 85then 86 ADD=0 87fi 88for ((i=ADD+1;i<=STARTUP;++i)); 89do 90 kill -SIGSTOP %$i 91done 92 93CURRRUNNING=$ADD 94RUNNINGSTART=1 # The first one running 95RUNNINGSTOP=$CURRRUNNING # The last one running 96 97startOne() { 98 # Assume that $CURRRUNNING < $STARTUP 99 RUNNINGSTOP=$[(RUNNINGSTOP % STARTUP) + 1] 100 kill -SIGCONT %$RUNNINGSTOP 101 CURRRUNNING=$[CURRRUNNING+1] 102} 103 104stopOne() { 105 # Assume that $CURRRUNNING > 0 106 kill -SIGSTOP %$RUNNINGSTART 107 RUNNINGSTART=$[(RUNNINGSTART % STARTUP) + 1] 108 CURRRUNNING=$[CURRRUNNING-1] 109} 110 111# Start mainloop 112while [ 1 ] 113do 114 sleep 60 115 116 # Determine how many threads should be added/removed 117 determineToAdd 118 if [ $ADD -gt 0 ] 119 then 120 if [ $[ADD+CURRRUNNING] -gt $STARTUP ] 121 then 122 ADD=$[STARTUP-CURRRUNNING] 123 fi 124 # Add processes 125 echo ADD:$ADD 126 for ((i=0;i<ADD;++i)) 127 do 128 startOne 129 done 130 fi 131 if [ $ADD -lt 0 ] 132 then 133 REM=$[-ADD] 134 # Clip 135 if [ $REM -gt $CURRRUNNING ] 136 then 137 REM=$CURRRUNNING 138 fi 139 # Remove processes 140 echo REMOVE:$REM 141 for ((i=0;i<REM;++i)) 142 do 143 stopOne 144 done 145 fi 146 sleep 60 147done
the script first starts all instances, then stops the ones which are too many, and then starts the main loop. in the main loop, it waits 60 seconds (for the average load to adjust to the new process count), and then decides how many cores should be left free, and what that means for the number of processes (add/remove some). note that the profile file is read every minute, so it can be changed any time without any need to re-run the whole thing.
in case the script is stopped (with control+c), all primefinder processes are killed and their open file is deleted. to determine the open file, i use lsof
with some grep
s. you have to adjust and test that line before using this script!
note that this script is quite a hack, and far from perfect. and it is somehow system dependent, or at least “setup dependent” since it has certain assumptions on the executables, on how the output of lsof
looks like, … so better make sure it works before you use it, especially on bigger systems. also note that in the beginning, all instances are ran (they are started with a two second delay between two instances), and then everything is run for 20 seconds before the first adjustment (i.e. stopping processes which are too many) are made. if you share the system with other people, this might already annoy others when they try to measure timings of their programs (especially if hyperthreading is enabled).
this long weekend i was staying at my parent’s place. here are some impressions, both from my parent’s garden, as well as from a bike tour to olfen.
garden.
surroundings.
in these pictures, you can see, among other things, konik horses and heck cattle. i wish i would have had my longer telephoto lens with me…