CERN HTCondor GPUs Users Guide


In this tutorial I will teach how to setup your environment, write the scripts and submits your jobs to use the GPUs in HTCondor batch system at CERN. The objective is to train machine learning models using or not jupyter notebooks. The official tutorial can be found at I am taking only the required for GPUs proccesing.

The most basic test

The objective is to get an interactive session in a gpu machine quickly, for that we need to do the next steps.

  1. 1) connect to lxplus

  2. 2) write a script file with the next code

universe                = vanilla
arguments               = 
log                     = test.log
output                  = outfile.$(Cluster).$(Process).out
error                   = errors.$(Cluster).$(Process).err
request_gpus = 1 #required to take a node with GPU                                                                                                                                                                                                                                      
+testJob = True # required to get priority in the queue (just to do short tests)                                                                                                                                                                                                                                              
  1. 3) submit the job with
condor_submit -interactive

When you get the prompt you can load the cvmfs stack for machine learning already compiled with cuda

source /cvmfs/ 

Lets check that the GPU is working in tensorflow, open ipython and write

import tensorflow as tf 

The return should be True.

  1. 4) to check the GPUs stats run

Training with python script

In this section the objetive is to train a basic model in tensorflow without an interactive session. For that porpuse with need to create two scripts, one with the tf code and a second one in bash to load the environment. The model based on this tutorial it is just an ilustrative example where keras is taken from tensorflow.

write the 3 files and to submit the job

from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# load the dataset
dataset = loadtxt('./', delimiter=',')
# split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset, y, epochs=150, batch_size=10)
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

universe                = vanilla
executable              =
arguments               = 
log                     = test.log
output                  = outfile.$(Cluster).$(Process).out
error                   = errors.$(Cluster).$(Process).err
should_transfer_files   = NO
+JobFlavour = "espresso"
#espresso     = 20 minutes
#microcentury = 1 hour
#longlunch    = 2 hours
#workday      = 8 hours
#tomorrow     = 1 day
#testmatch    = 3 days
#nextweek     = 1 week
request_gpus = 1

source /cvmfs/

to submit the job run