RSH 14: How to tame the cluster
You've got code, you've got the cluster. Now we connect them
Video
Collaborative notes taken during the session
Icebreaker question
Something you know/use now on a supercomputer that you wish somebody had told/shown you years ago:
- documentation
- safe settings for bash scripts to stop at undefined variables
- submit scripts don't have to be in bash
- how to use scp
- minimalism (minimum environment, minimum options/specification, no cargo culting)
- containers, venv and other means of isolation
- messy setup -> non-reproducible mess
Questions
- Maybe I missed it, but do you check sometimes the different nodes if each of them is used?
- The SLURM scheduler should only be giving you an open node to run your tasks.
- Does this cluster have hyperthreading enabled?
- This is really good to check.
- Is it always like that, #cpus=#threads ?
- for purely shared-memory jobs like this one, yes
- but for hybrid OpenMP+MPI calculations
- --mem vs --mem-per-cpu
- First allocates memory per node, latter allocates memory per core
- How do you determine whether you should use OpenMP/MPI or array jobs for your task?
- If the tasks are independent and don't need to communicate, array jobs is simpler and does not require to change the code. MPI might be good if the tasks need to communicate. Hybrid OpenMP/MPI might be good if we want to use shared memory but at the same time go beyond one node.
References
- https://github.com/AaltoSciComp/hpc-examples
- MPI example can be found here: https://aaltoscicomp.github.io/python-for-scicomp/parallel/#mpi