Using the course HPC servers
Those of you who are officially enrolled to the course will have access to
dedicated high-performance computing servers (rishon1-4
) provisioned by
Computer Science faculty IT department. Running on the faculty servers will give
you access to more computing power and also fast GPUs (which will greatly
accelerate your deep-learning tasks). This should significantly speed up your
workflow when performing the course homework assignments and when implementing
your final project.
These servers are mainly suited for running batch jobs which you can submit to dedicated job queues and be notified upon completion. We therefore recommend you install and work on the assignments locally (on your own machine), and only use the faculty servers when you need to run a long model training task (we will specify in the assignment).
Logging in
Logging in is performed with your Technion Single Sign-On (SSO) credentials.
Usually this means the username and password of your @campus
or @technion
email address.
If your username is e.g. user
, login like so
ssh user@rishon.technion.ac.il
or, directly using the server’s IP:
ssh user@132.68.39.36
Notes:
- Your credentials will only work after we pass a final list of registered students to the faculty IT department. This will happen during the first 2-3 weeks of the semester.
- These servers are only directly accessible from within the Technion networks.
If connecting over WiFi, do not use the
TechPublic
network, as it won’t allow you to connect. TheTechSec
network will work, as well as other non-open faculty networks (e.g.CS-WIFI
). rishon
is a gateway server that you connect to in order to run jobs on the actual compute nodes (rishon1-4
) as explained below. You should not run any computations onrishon
itself as it does not have a GPU and is limited in CPU resources.- In some internal Technion networks the DNS lookup seems to not find
the
rishon
hostname. If you get acould not resolve hostname
error, use the second option (directly with IP).
Connecting from home
The easiest way is to configure a VPN connection to the Technion. See the instructions on the Technion CIS website regarding how to set this up. After you connect though the VPN, you can connect to the server as normal. Note that we cannot provide you with techical support regarding how to setup/use the VPN. You can contact CIS for support.
Another way is to first SSH into a Technion server that’s
accessible from the outside (e.g. CSM, CSL) and from there you can SSH into
rishon
.
You can do this in one command like so:
ssh -J user@csm.cs.technion.ac.il user@rishon.cs.technion.ac.il
This example will connect through the CSM server in the CS faculty. You should be able to use other Technion servers that you have SSH access to.
This method (-J
) has the useful advantage of forwarding the SSH public key
from your local machine (if available) to the target machine though the
intermediate machine.
Notes:
- Unfortunately the
t2
/lux
student server cannot be used to access the Technion network from outside due to CIS policy. - If you use CSM, note that the credentials for the CSM server and
rishon
are not the same: CSM uses the CS-faculty credentials while therishon
server uses the Technion SSO credentials. - We cannot provide you with credentials to any such server (CSM/CSL/other technion servers).
Server Usage
General
The faculty HPC server cluster is composed of a gateway server, rishon
, into which
you log in with SSH, and four compute nodes rishon1-4
which run the actual
computations. The gateway server is relatively weak and has no attached GPUs, so
it should not be used for running computations.
Your home directory on the gateway server (e.g. /home/user
) is automatically
mounted on all the computation nodes. This ensures that any programs you
install locally under your home folder (for example a conda
environment) will
be available for jobs running on these nodes. In fact, the first thing you
should do after connecting for the first time is to install conda
and the
course conda
environment for your user account.
The computation tasks are manged by a job scheduling system called
slurm
. The system manages the computation nodes
and resources and allocates them to jobs submitted by users into a queue
(“partition”).
If you wish, you can read the slurm
quick start
guide to get a better understanding
of the system and the available commands.
The most useful slurm
commands for our needs are,
Running interactive jobs
An interactive job allows you to view it’s output and interact with it in real time, as if it were running on the machine you’re logged in to.
Submitting an interactive job is performed with the srun
command. Required resources
can be specified and if they’re available the job starts running immediately.
Example
Let’s see how to run an ipython
console session as an interactive job with an
allocated GPU.
(cs236781-hw) avivr@rishon:~/cs236781-hw1$ srun -c 2 --gres=gpu:1 --pty ipython
cpu-bind=MASK - rishon1, task 0 0 [15995]: mask 0x100000001 set
cpu-bind=MASK - rishon1, task 0 0 [15995]: mask 0x100000001 set
Python 3.7.0 (default, Oct 9 2018, 10:31:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: torch.cuda.is_available()
Out[2]: True
In [3]: t = torch.tensor([1,2,3], dtype=torch.float).cuda()
In [4]: t.dot(t)
Out[4]: tensor(14., device='cuda:0')
Here the -c 2
and --gres=gpu:1
options specify that we want to allocate 2 CPU
cores and one GPU to the job, the --pty
option is required for the session
to be interactive and the last argument ipython
is the command to run. You can
specify any command and also add command arguments after it.
Notes:
- You should use interactive jobs for debugging or running short one-off tasks. If you need to run something long, submit a batch job instead.
- When you submit an interactive job, your shell is blocked (by
srun
) until it completes. If you terminatesrun
, it will cancel your job. Crucially, this means that if you log out of the machine while running an interactive job, the job will terminate (as with regular processes you invoke from the shell). You can get around this by either,- Using terminal managers e.g.
screen
andtmux
; - Running with
nohup
; - Running a batch job instead (preferred). See below.
The reason the last method is preferred is that interactive jobs run with
srun
may be terminated after running for a few hours due to policy. - Using terminal managers e.g.
- You should activate your
conda env
before running an interactive job if you need to run python. The shell environment variables will be passed to the process that will run your job on the compute node, so therefore theconda env
will effectively also be active there. - You can specify
bash
as the command to run in an interactive job to get a shell on one of the compute nodes.
Running batch jobs
A batch job is submitted to the queue with the sbatch
command.
It runs non-interactively when resources are available and sends it’s output to
files that you can specify. Additionally, it can notify you by email when the
job starts and finishes.
Running jobs with sbatch
is useful for long-running processes such as training
models. While the job is running, it’s not connected to any specific shell
session and thus it keeps running if you log out of the machine. To view output
from a batch job, you’ll need to read it from the file it writes to.
To use sbatch
, you need to create a script for it to run. It can be any script
with a valid shebang line (#!
) at the top, e.g. a bash script or a python
script.
Example
Lets create a file ~/myscript.sh
on the server with the following contents:
#!/bin/bash
# Setup env
source $HOME/miniconda3/etc/profile.d/conda.sh
conda activate cs236781-hw
echo "hello from $(python --version) in $(which python)"
# Run some arbitrary python
python -c 'import torch; print(f"i can haz gpu? {torch.cuda.is_available()}")'
Then we can run the script as a slurm
batch job as follows:
avivr@rishon:~$ sbatch -c 2 --gres=gpu:1 -o slurm-test.out -J my_job myscript.sh
Submitted batch job 114425
avivr@rishon:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
114425 236781 my_job avivr R 0:01 1 rishon3
avivr@rishon:~$ tail -f slurm-test.out
cpu-bind=MASK - rishon3, task 0 0 [20442]: mask 0x100000001 set
cpu-bind=MASK - rishon3, task 0 0 [20442]: mask 0x100000001 set
hello from Python 3.7.0 in /home/avivr/miniconda3/envs/cs236781-hw/bin/python
i can haz gpu? True
Here the -c 2
and --gres=gpu:1
options specify that we want to allocate 2 CPU
cores and one GPU to the job, the -o slurm-test.out
option specifies where
to write the output from the process and -J my_job
is an arbitrary name we can
assign to the job.
Viewing status
After submitting a batch job, you can use squeue
to view it’s
status in the queue, as shown in the example above. You can see the job name and
it’s id there.
Viewing output
Each job you submit causes a text file you be created in your current directory,
named slurm-<jobid>.out
.
To view the output from a job in real time, you can use tail -f
or less -r
+F
on the output file for the relevant job. less
allows you to also scroll
back.
Canceling
To cancel a batch job you’ve submitted (whether it’s running or waiting in the
queue), run scancel <job-id>
where <job-id>
is the id you received when
starting the batch job.
Course helper script
To slightly simplify your workflow on the server, we provide you with a simple
script to run python code from the course conda env
as a slurm
batch job.
The homework assignment repos contain a script called py-sbatch.sh
. You can
use this script as if it were the python
command, and it will active the
conda env
for you and execute your provided python code with sbatch
.
For example, let’s say we want to run all our notebooks with the main.py
script. Instead of
conda activate cs236781-hw
python main.py run-nb *.ipynb
which will run on the gateway server, do this
./py-sbatch.sh main.py run-nb *.ipynb
This will take care of activating the conda env
and run the script on the more
powerful compute nodes as a batch job. The script has some declared variables
which you can edit to configure the sbatch
parameters such as computational
resources, notification email address and others.
Note that for the above example it may have been more straightforward to use an
interactive job (srun
). However this script may be useful when you need to
create a batch job running a python script, for example to run long training
tasks.
In any case, this script is completely optional since you can always use
sbatch
directly as shown in the previous section.
Running jupyter
You can run jupyter
on a compute node by creating a script that exposes the IP
of the compute node as the jupyer server URL.
For example, if you create a script jupyter-lab.sh
like so
#!/bin/bash
unset XDG_RUNTIME_DIR
jupyter lab --no-browser --ip=$(hostname -I) --port-retries=100
then you can start the jupyter lab server with srun
, e.g.
srun -c 2 --gres=gpu:1 --pty jupyter-lab.sh
The connection URL in the console will show the IP of the compute node that the server is actually running on.
We’ll provide you with a similar script in the assignment repos.
Note: As mentioned previously, interactive jobs are not meant to be
long-running. Please be considerate of other students and use the computing
resources only as needed. For long-running jobs use sbatch
.
Accessing jupyter from home
Although the rishon
servers are only accessible within the Technion networks,
it’s possible to connect from home to a jupyter instance running on them by using a
combination of SSH port forwarding and using an intermediate server.
- Follow the instructions above to start jupyter on one of the compute nodes.
-
Observe the IP and port of the jupyter server specified on the command line. Let’s assume you got this line after jupyer started:
[I 21:39:07.830 LabApp] http://132.68.39.38:8888/?token=abcdef0123...
- If connecting from home using a VPN, simply point your browser to the above URL.
-
If not using a VPN, let’s assume you can you have SSH access from home to another Technion server, such as CSM as in the previous examples. Then you can run the following from your machine (from a different terminal):
ssh -L 9999:132.68.39.38:8888 -J user@csm.cs.technion.ac.il user@rishon
This creates a local port forwarding from port
9999
on yourlocalhost
to132.68.39.38:8888
from the CSM machine through an SSH tunnel, and also gives you a new SSH session onrison
which you can work from. - To connect to the jupyter lab server from home, you can now point your
local browser to
localhost:9999
.
Tips
Pubic-key based authentication
You can use a public-key based authentication to prevent the need for typing your password when connecting to remote servers over SSH.
- Generate an SSH key pair using the
ssh-keygen
tool. More detailed instructions for all platforms can be found here. - Copy the public key. By default it’s in
~/.ssh/id_rsa.pub
. Make sure you copy it exactly without any extra spaces or newlines. - Connect to your user on the machine and paste the public key contents into a new line in
~/.ssh/authorized_keys
.
Notes:
- On macOS and linux, there’s a utility you can use to automate steps 2-3.
After generating the key pair, copy the public key to the server like so:
ssh-copy-id user@rishon.cs.technion.ac.il
- If you use an intermediate server to connect from home, make sure to first also copy your public key to that server.
After generating your key pair, you should also add it to your github
account.
After that, you can use the SSH remote-URLs (instead of HTTPS) to clone repos
and prevent the need to specify your username and password when
push
ing, pull
ing and fetch
ing.
Transferring files to and from the server
The rsync
tool can be your friend. It can automatically sync between local and
remote folders, only uploading/downloading modified files.
For example, to send files or a directory you can do
rsync -Cavz path/to/local/file_or_dir user@rishon:/home/user/path/to/remote/file_or_dir
To send files from home via an intermediate server (in this example CSM):
rsync -Cavz -e 'ssh -A -J user@csm.cs.technion.ac.il' path/to/local/file_or_dir user@rishon:/home/user/path/to/remote/file_or_dir
To download files from the server to your computer, simply change the order of the last two arguments in the above examples.
GUI option for macOS users
Cyberduck is a free remote file browser that you can use to copy files to/from the server using a GUI.
GUI option Windows users
Many people recommend MobaXterm as a good graphical ssh client for windows. Here’s a useful guide for using it to connect to the server.