Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The same principle was applied to GPU resources, where the GPU-hour is a billing unit, and there are proportional memory per GPU and proportional CPUs per GPU defined (consult the table above).

Note that Helios uses a job-exclusive policy, and currently, minimal resources assigned for a job equal to one node. The job will be billed at the minimum for a whole node (192CPUs on CPU and 4GPUs on GPU partition)!

The cost can be expressed as a simple algorithm:

...

Modules' names on Helios are case sensitive.

Sample job scripts

Please note that using bash option -l is crucial for running jobs on Helios, especially on plgrid-gpu-gh200 partition. Please use the following shebang:

#!/bin/bash -l

in your scripts. Example job scripts (without -l option for Ares compatibility) are available on this page: Sample scripts


Python, external libraries, and machine learning on Helios

As noted in the previous sections, nodes with GH-200 GPU superchips have CPUs with arm64 architecture., thus demanding modules built for this architecture specifically.

When working with Python, for virtual environment management, anaconda should NOT be used. This is because conda  environments ship with separate Python interpreter installations, which may experience compatibility issues with the ARM architecture.
To create virtual environments, please use Python standard venv  module.

For deep learning applications, we provide a special module with software often used by AI libraries, called  `ML-bundle/24.06a` Make sure to always load this module before installing/building any packages or running Python programs relying on GPUs on GH-200 nodes.
IMPORTANT: Remember that this module should always be loaded as the first step in the given job, before activation of any virtual environments.

We also provide a custom pip repository, with popular machine learning packages pre-built for different versions with GPU support for ARM architecture.
The packages from this repo can be directly installed via pip, simply by specifying the correct name and version tag of a package. To see available libraries with their tags, check the contents of the repo via:
ls /net/software/aarch64/el8/wheels/ML-bundle/24.06a/simple/  command.

An example script that creates a virtual environment, installs the packages from Helios custom wheel repository and requirements.txt file, then executes Python program:

Code Block
languagebash
titleexample ml batch script
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=4G
#SBATCH --time=01:00:00
#SBATCH --account=<your-grant-account>
#SBATCH --partition=plgrid-gpu-gh200
#SBATCH --output=out_files/out.out
#SBATCH --error=out_files/err.err
#SBATCH --gres=gpu:1

# IMPORTANT: load the modules for machine learning tasks and libraries
ml Ml-bundle/24.06a

cd $SCRATCH

# create and activate the virtual environment python -m venv my_venv_name/
source my_venv_name/bin/activate

# install one of torch versions available at Helios wheel repo
pip install --no-cache torch==2.3.1+cu124.post2

# install the rest of requirements, for example via requirements file
pip install --no-cache -r requirements.txt

# run the program
python my_script_name.py 


Python multiprocessing - potential problems

In some libraries that modify the behavior of multiprocessing in Python, such as Pytroch library, problems can be observed when spawning new processes.


In the case of Pytorch, it can happen that the number of threads in new processes is determined based on environment variables that do not always accurately show a number of resources allocated by a given job.
The solution to this problem is the manual setting of OMP_NUM_THREADS  environment variable.
For example, to limit the number of threads per spawned process to 1:

Code Block
languagebash
# limit number of threads to prevent excessive thread creation
export OMP_NUM_THREADS=1

It is always advised to properly profile the script execution upon first use, especially when using multiprocessing. This can be done via simple tools, such as htop .

More information

Helios is following Prometheus' configuration and usage patterns. Prometheus documentation can be found here: https://kdm.cyfronet.pl/portal/Prometheus:Basics