skip to main content.

ever had the problem that you have access to a big machine (with many cores), and you want to run many (tens of thousands) small computations, but you want to make sure that not too many cores are used?
i’ve had this problem, and since i now have a pretty nice (i think so) solution, i thought that maybe more people are interested in it. so here’s my setup. i have a program, let’s call it primefinder, which, for a certain input n (where n is a natural number ≤ 21000), computes a prime of n bits with special properties. the program loops over all possible n, and checks for each n if a file n.prime exists. if it does not, it creates it (with zero content), computes the prime (which can take between minutes and days), writes the prime into the file and continues with the next file. this simple task distribution technique allows me to run the program in parallel on different machines (since the files are in a nfs folder) with many instances on each machine. now at our institute, we have a big computation machine (64 cores) and four user machines (on which the users work, each 32 cores). since the user machines are often not intensively used (and that only during certain times of the day), i want to use these as well. but there should be enough cores free, so the users won’t notice that there are computations going on in the background. on the computation server, also other people want to run something, so there should also be some free cores. optimally, my program would somehow decide how many cores are used by others, and use the rest. or most of them, to leave some free, especially on the user machines.
after a suggestion by our it guys, i started writing a bash script which controls the instances of my program on the same machine. the first version used the time of the day to determine the number of processes. everything was computed in terms of the number of cores of the machine, the load (with a load modifier applied, since some machines have uninterruptable processes running which do not effectively do something, and which won’t go away until the next reboot) and the hour of the day. but it is not easy to find a good scheme which yields good results on all machines. something which works well on the user machines is wasting processor time on the computation server.
so today i rewrote the program to use profiles. a profile contains information on the number of cores (this is necessary since the computation server has hyperthreading enabled, and thus returns twice the number of cores), the number of processes to be started, and the number of cores to be left free during each hour and day of a week. so on weekends or nights, i choose lower numbers for the free cores for the user machines, while for the computational server the number is always 1.
a profile can look like this (this is from a user machine, the file is called primefinderrunner-user.profile for later reference):

1CORES 32
2STARTUP $[CORES-CORES/8]
30 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8]
41 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
52 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
63 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
74 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
85 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/2] $[CORES/4] $[CORES/4] $[CORES/4] $[CORES/8]
96 $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/8] $[CORES/16] $[CORES/16] $[CORES/16] $[CORES/16]

the line with prefix CORES gives the number of cores. the line prefixed by STARTUP gives the number of processes to run (at most); here, we use 7/8 of the number of cores. the lines prefixed by a number between 0 (sunday) and 6 (saturday) have 24 entries following: every entry (seperated by exactly one space, as the prefix itself is separated by exactly one space from the entries!) says how many cores should be free at each time of the day. usually during night (up to 7 am) at least 1/16 of the total number of cores should be free, while during workday (8 am to 7 pm) half of the cores should be free. of course, the numbers are different for weekends (saturday and sunday) than for the other working days.
now the script itself looks like this (for reference, the filename is primefinderrunner.sh):

  1#/bin/bash
  2 
  3initProfile() {
  4    PROFILEFN=primefinderrunner-$PROFILE.profile
  5    CORES=`grep "^CORES " $PROFILEFN`
  6    CORES=${CORES/CORES }
  7    STARTUP=`grep "^STARTUP " $PROFILEFN`
  8    STARTUP=${STARTUP/STARTUP }
  9    eval STARTUP=$STARTUP
 10}
 11 
 12LOADMODIFIER=0
 13if [ "$1" != "" ]
 14then
 15    PROFILE=$1
 16else
 17    PROFILE=`hostname`
 18fi
 19if [ "$2" != "" ]
 20then
 21    LOADMODIFIER=$2
 22fi
 23initProfile
 24if [ "$CORES" == "" ]
 25then
 26    echo "Cannot load profile $PROFILEFN!"
 27    exit
 28fi
 29echo Cores: $CORES
 30echo Load modifier: $LOADMODIFIER
 31 
 32computeFreecores() { 
 33    # two arguments: day (0..6) and hour (0..23)
 34    FREECORES=0
 35    DAY=`date +%w`
 36    LINE=`grep "^$DAY " $PROFILEFN`
 37    LINE=${LINE/$DAY }
 38    HOUR=`date +%k`
 39    for ((i=0;i<$HOUR;++i));
 40    do
 41        LINE=${LINE#* }
 42    done
 43    LINE=${LINE/ *}
 44    eval FREECORES=$LINE
 45}
 46 
 47computeFreecores
 48 
 49stopsignal() {
 50    for PID in `jobs -p`;
 51    do
 52        FILE=`lsof -p $PID -F n 2>/dev/null | grep primedatabase | grep -v "\\.nfs"`
 53        A=${FILE#n*}
 54        A=${A/ (nfs*}
 55        echo killing $PID with open file $A
 56        rm $A
 57        kill $PID
 58    done
 59    exit
 60}
 61 
 62trap 'stopsignal' 2
 63 
 64echo "Starting $STARTUP instances"
 65 
 66determineToAdd() {
 67    computeFreecores
 68    LOAD=`uptime`
 69    LOAD=${LOAD#*average: }
 70    LOAD=${LOAD/,*}
 71    LOAD=${LOAD/.*}
 72    ADD=$[CORES-FREECORES-LOAD-LOADMODIFIER]
 73    echo Load: $[LOAD-LOADMODIFIER], Intended number of free cores: $FREECORES
 74}
 75 
 76# Start programs in the background
 77determineToAdd
 78for ((i=1;i<=STARTUP;++i));
 79do
 80    primefinder &amp;
 81    sleep 2
 82done
 83sleep 20
 84if [ $ADD -lt 0 ]
 85then
 86    ADD=0
 87fi
 88for ((i=ADD+1;i<=STARTUP;++i));
 89do
 90    kill -SIGSTOP %$i
 91done
 92 
 93CURRRUNNING=$ADD
 94RUNNINGSTART=1 # The first one running
 95RUNNINGSTOP=$CURRRUNNING # The last one running
 96 
 97startOne() {
 98    # Assume that $CURRRUNNING < $STARTUP
 99    RUNNINGSTOP=$[(RUNNINGSTOP % STARTUP) + 1]
100    kill -SIGCONT %$RUNNINGSTOP
101    CURRRUNNING=$[CURRRUNNING+1]
102}
103 
104stopOne() {
105    # Assume that $CURRRUNNING > 0
106    kill -SIGSTOP %$RUNNINGSTART
107    RUNNINGSTART=$[(RUNNINGSTART % STARTUP) + 1]
108    CURRRUNNING=$[CURRRUNNING-1]
109}
110 
111# Start mainloop
112while [ 1 ]
113do
114    sleep 60
115 
116    # Determine how many threads should be added/removed
117    determineToAdd
118    if [ $ADD -gt 0 ]
119    then
120        if [ $[ADD+CURRRUNNING] -gt $STARTUP ]
121        then
122            ADD=$[STARTUP-CURRRUNNING]
123        fi
124        # Add processes
125        echo ADD:$ADD
126        for ((i=0;i<ADD;++i))
127        do
128            startOne
129        done
130    fi
131    if [ $ADD -lt 0 ]
132    then
133        REM=$[-ADD]
134        # Clip
135        if [ $REM -gt $CURRRUNNING ]
136        then
137            REM=$CURRRUNNING
138        fi
139        # Remove processes
140        echo REMOVE:$REM
141        for ((i=0;i<REM;++i))
142        do
143            stopOne
144        done
145    fi
146    sleep 60
147done

the script first starts all instances, then stops the ones which are too many, and then starts the main loop. in the main loop, it waits 60 seconds (for the average load to adjust to the new process count), and then decides how many cores should be left free, and what that means for the number of processes (add/remove some). note that the profile file is read every minute, so it can be changed any time without any need to re-run the whole thing.
in case the script is stopped (with control+c), all primefinder processes are killed and their open file is deleted. to determine the open file, i use lsof with some greps. you have to adjust and test that line before using this script!
note that this script is quite a hack, and far from perfect. and it is somehow system dependent, or at least “setup dependent” since it has certain assumptions on the executables, on how the output of lsof looks like, … so better make sure it works before you use it, especially on bigger systems. also note that in the beginning, all instances are ran (they are started with a two second delay between two instances), and then everything is run for 20 seconds before the first adjustment (i.e. stopping processes which are too many) are made. if you share the system with other people, this might already annoy others when they try to measure timings of their programs (especially if hyperthreading is enabled).

posted in: computer
tags:
places:

comments.

felix wrote on june 24, 2011 at 12:24:

i published a newer version of the script here. also note that this verison has a small bug in line 73: the load is given incorrectly, the output must be $[LOAD+LOADMODIFIER]. this is fixed in the newer version.