Preliminary access essentials
Disclaimer
Helios is still under development, and despite our best efforts, Helios might experience unscheduled outages or even data loss.
Content
Support
Please contact the PLGrid Helpdesk: https://helpdesk.plgrid.pl/ regarding any difficulties in using the cluster.
For important information and announcements, please follow this page and the messages displayed in the login message.
Access to Helios
We strongly suggest using SSH keys to access the machine! SSH key management can be done through the PLGrid portal. Password access will be disabled in the near future.
Computing resources on Helios are assigned based on PLGrid computing grants. To perform computations on Helios, you must obtain a computing grant through the PLGrid Portal (https://portal.plgrid.pl/) and apply for Helios access.
If your grant is active, and you have applied for the service access, the request should be accepted in about half an hour. Please report any issues through the helpdesk.
Machine description
Available login nodes:
- ssh <login>@login01.helios.cyfronet.pl
Note that Helios uses PLGrid accounts and grants. Make sure to request the "Helios access" access service in the PLGrid portal.
Helios is using the node job-exclusive policy. This means that nodes are allocated for a dedicated, single job which is using the resources. This also impacts the accounting where the minimum amount of resources used equals to one node.
Helios is a hybrid cluster. CPU nodes use x86_64 CPUs, while the GPU partition is based on GH200 superchips, which include an Nvidia Grace - ARM CPU and Nvidia Hopper GPU. HPE Slingshot is used as an interconnect. The login01 node uses an x86_64 CPU and RHEL 8. Please keep this in mind when compiling software, etc. Knowing the destination CPU architecture and operating system is important for selecting the proper modules and software. Each architecture has its own set of modules, in order to see the complete list of modules you need to run module avail on a node of a chosen type. Node specification can be found below:
Partition | Number of nodes | Operating system | CPU | RAM | Proportional RAM for one CPU | Proportional RAM for one GPU | Proportional CPU for one GPU | Accelerator |
---|---|---|---|---|---|---|---|---|
plgrid (includes plgrid-long) | 272 | RHEL 8 | 192 cores, x86_64, 2x AMD EPYC 9654 96-Core Processor @ 2.4 GHz | 384GB | 2000MB | n/a | n/a | |
plgrid-bigmem | 120 | RHEL 8 | 192 cores, x86_64, 2x AMD EPYC 9654 96-Core Processor @ 2.4 GHz | 768GB | 4000MB | n/a | n/a | |
plgrid-gpu-gh200 | 110 | CrayOS (SLES 15sp5) | 288 cores, aarch64, 4x NVIDIA Grace CPU 72-Core @ 3.1 GHz | 480GB | n/a | 120GB | 72 | 4x NVIDIA GH200 96GB |
Note that Helios will soon be upgraded to RHEL 9. This change will be applied to all CPU and GPU nodes.
Job submission
Helios is using Slurm resource manager, jobs should be submitted to the following partitions:
Name | Timelimit | Resource type (account suffix) | Access requirements | Description |
---|---|---|---|---|
plgrid | 72h | -cpu | Generally available. | Standard partition. |
plgrid-long | 168h | -cpu | Requires a grant with a maximum job runtime of 168h. | Used for jobs with extended runtime. |
plgrid-gpu-gh200 | 48h | -gpu-gh200 | Requires a grant with GPGPU resources. | GPU partition. |
If you are unsure of how to properly configure your job on Helios please consult this guide: Job configuration
Accounts and computing grants
Helios uses a new naming scheme for CPU and GPU computing accounts, which are supplied by the -A parameter in sbatch command. Currently, accounts are named in the following manner:
Resource | account name |
---|---|
CPU | grantname-cpu |
GPU | grantname-gpu-gh200 |
Please mind that sbatch -A grantname
won't work on its own. You need to add the -cpu, -cpu-bigmem, or -gpu-gh200 suffix! Available computing grants, with respective account names (allocations), can be viewed using the hpc-grants
command.
Resource allocated on Helios doesn't use normalization, which was used on Prometheus and previous clusters. 1 hour of CPU time equals 1 hour spent on a computing core with a proportional amount of memory (consult the table above). The billing system accounts for jobs with more memory than the proportional amount. If the job uses more memory for each allocated CPU than the proportional amount, it will be billed as it would have used more CPUs. The billed amount can be calculated by dividing the used memory by the proportional memory per core and rounding the result to the closest and larger integer. Jobs on CPU partitions are always billed in CPU hours.
The same principle was applied to GPU resources, where the GPU-hour is a billing unit, and there are proportional memory per GPU and proportional CPUs per GPU defined (consult the table above).
Note that Helios uses a job-exclusive policy, and currently, minimal resources assigned for a job equal to one node. The job will be billed at the minimum for a whole node (192CPUs on CPU and 4GPUs on GPU partition)!
The cost can be expressed as a simple algorithm:
cost_cpu = job_cpus_used * job_duration cost_memory = ceil(job_memory_used/memory_per_cpu) * job_duration final_cost = max(cost_cpu, cost_memory)
and for GPUs, where a GPU has the respective amount of memory per GPU and CPUs per GPU, respectively:
cost_gpu = job_gpus_used * job_duration cost_cpu = ceil(job_cpus_used/cpus_per_gpu) * job_duration cost_memory = ceil(job_memory_used/memory_per_gpu) * job_duration final_cost = max(cost_gpu, cost_cpu, cost_memory)
Storage
Available storage spaces are described in the following table:
Location | Location in the filesystem | Purpose |
---|---|---|
$HOME | /net/home/plgrid/<login> | Storing own applications, and configuration files. Limited to 10GB. |
$SCRATCH | /net/scratch/hscra/plgrid/<login> | High-speed storage for short-lived data used in computations. Data older than 30 days can be deleted without notice. It is best to rely on the $SCRATCH environment variable. |
$PLG_GROUPS_STORAGE/<group name> | /net/storage/pr3/plgrid/<group name> | Long-term storage for data living for the period of computing grant. Should be used for storing significant amounts of data. |
Current usage, capacity and other storage attributes can be checked by issuing the hpc-fs
command.
System Utilities
Please use the following commands to interact with the account and storage management system:
hpc-grants
-
shows available grants, resource allocations, consumed resourcedhpc-fs
- shows available storagehpc-jobs
- shows currently pending/running jobshpc-jobs-history
- shows information about past jobs
Software
Applications and libraries are available through the modules system. Modules for ARM and x86 CPUs are not interchangeable, and selecting the right module for the destined architecture is critical for getting software to work! Please load the proper modules on the node, inside of the job script! The list of available modules can be obtained by issuing the command:
module avail
This command should be run on a compute node to get a full list of modules available on the given architecture (node type)! The list is searchable by using the '/' key. The specific module can be loaded by the add command:
module add openmpi/4.1.1-gcc-11.2.0
and the environment can be purged by:
module purge
Modules' names on Helios are case sensitive.
Sample job scripts
Please note that using bash option -l
is crucial for running jobs on Helios, especially on plgrid-gpu-gh200
partition. Please use the following shebang:
#!/bin/bash -l
in your scripts. Example job scripts (without -l
option for Ares compatibility) are available on this page: Sample scripts
Python, external libraries, and machine learning on Helios
As noted in the previous sections, nodes with GH-200 GPU superchips have CPUs with arm64 architecture., thus demanding modules built for this architecture specifically.
When working with Python, for virtual environment management, anaconda should NOT be used. This is because conda
environments ship with separate Python interpreter installations, which may experience compatibility issues with the ARM architecture.
To create virtual environments, please use Python standard venv
module.
For deep learning applications, we provide a special module with software often used by AI libraries, called `ML-bundle/24.06a` Make sure to always load this module before installing/building any packages or running Python programs relying on GPUs on GH-200 nodes.
IMPORTANT: Remember that this module should always be loaded as the first step in the given job, before activation of any virtual environments.
We also provide a custom pip repository, with popular machine learning packages pre-built for different versions with GPU support for ARM architecture.
The packages from this repo can be directly installed via pip, simply by specifying the correct name and version tag of a package. To see available libraries with their tags, check the contents of the repo via:ls /net/software/aarch64/el8/wheels/ML-bundle/24.06a/simple/
command.
Example script for creating venv, installing packages and running the jobs.
Python multiprocessing - a word of caution
More information
Helios is following Prometheus' configuration and usage patterns. Prometheus documentation can be found here: https://kdm.cyfronet.pl/portal/Prometheus:Basics