Submitting Simulations to a Computing Cluster#
This tutorial will teach you:
How to submit a job to a computing cluster running SLURM.
How to automate restarting simulations for long runs.
All examples here were tested on the levante computing cluster at the German Climate Computing Center (DKRZ).
Prerequisites#
To follow this tutorial, create a new empty directory and navigate into it:
mkdir cluster_tutorial
cd cluster_tutorial
Setting Up the Simulation Script#
In this directory, create a Python script named main.py with the following content:
"""An example script to run a shallow water model setup on a computing cluster"""
import numpy as np
import fridom.shallowwater as sw
# Set the log level to INFO for more information about the simulation
sw.log.setLevel("INFO")
# Create the grid and model settings
grid = sw.grid.cartesian.Grid(N=(2048,2048), L=(1,1), periodic_bounds=(True, True))
mset = sw.ModelSettings(grid=grid, f0=1, csqr=1)
mset.time_stepper.dt = 0.7e-4
# Create the netCDF writer and add it to the diagnostics
writer = sw.modules.NetCDFWriter(
write_trigger=sw.ClockTrigger(time_interval=0.5),
filename="output.cdf")
mset.diagnostics.add_module(writer)
# Add a restart module for automatic restarts
restart = sw.modules.RestartModule(
realtime_interval=np.timedelta64(1, "m"),
)
mset.restart_module = restart
# Setup the model settings
mset.setup()
# Create the initial condition
z = sw.initial_conditions.Jet(mset, width=0.1, wavenum=2, waveamp=0.05)
# Create the model and run it
model = sw.Model(mset)
model.z = z # set the initial condition
model.run(runlen=2.5)
Understanding the Setup#
Before submitting the job, let’s review the setup and compare it to the example in the quickstart tutorial.
The resolution is increased to ensure the model runs longer than just a few seconds.
The log level is set to INFO for more detailed progress updates. (The default log level is SILENT; see the logging tutorial for details.)
A
NetCDFWriteris added to the diagnostics to save simulation outputA
RestartModuleis included to handle automatic restarts, which will be discussed later in this tutorial.
Creating a Job Submission Script#
To submit the job to the cluster, create a shell script named run_fridom.sh with the following content:
#!/bin/bash
#SBATCH --job-name=sw_jet
#SBATCH --output=log/%x_%j.out
#SBATCH --error=log/%x_%j.err
#SBATCH --partition=<partition>
#SBATCH --account=<account>
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
source activate fridom # assuming you have a conda environment named fridom
srun -l python3 main.py
Replace <partition> and <account> with the appropriate values for your cluster. For the levante cluster at DKRZ, use gpu for the partition.
Adjust the –time parameter to reflect the expected runtime of your simulation.
This example runs the model on a single node with a single GPU. For information on running the model on multiple devices, see the parallelization tutorial.
Submitting the Job#
Make the script executable (only needed once):
chmod +x run_fridom.sh
Submit the job to the cluster:
sbatch run_fridom.sh
Reviewing the Output#
After the jobs finish, inspect the directory structure:
ls -l
drwxr-sr-x 2 u301533 uo0780 4096 Jan 15 18:57 log
-rw-r--r-- 1 u301533 uo0780 1063 Jan 15 18:47 main.py
drwxr-sr-x 2 u301533 uo0780 4096 Jan 15 18:57 restart
-rwxr-xr-x 1 u301533 uo0780 276 Jan 15 18:54 run_fridom.sh
drwxr-sr-x 2 u301533 uo0780 4096 Jan 15 18:58 snapshots
Three directories are created:
`log`: Contains the job output.
`snapshots`: Contains netCDF files written by the netCDF writer.
`restart`: Contains restart files written by the restart module
We will now discuss the contents of these directories in more detail.
Restart Files#
In the main script, we added a restart module with a parameter realtime_interval set to 1 minute. This means that after one minute of runtime, the model will save its state to a restart file, stop the current execution, and submit a new job to the cluster by running sbatch run_fridom.sh again. This way, the same main script is executed in a new job.
You may wonder how the model knows to restart from the restart file. This is handled by the restart module. When the method model.run(…) is called, the restart module checks the restart directory for available restart files. If it finds any, it loads the latest restart file and continues the simulation from the saved state. In this example, the model restarted twice: once after iteration 12668 and once after iteration 24556.
If you submit the job again with sbatch run_fridom.sh, the model will continue from the last restart file. Make sure to delete the restart files if you wish to start a new simulation from scratch.
Note
The RestartModule is optional and should be used for long simulations that might be interrupted by the cluster scheduler.
Log Files#
Let’s examine the log directory:
ls -l log/
-rw-r--r-- 1 u301533 uo0780 0 Jan 15 18:54 sw_jet_14792351.err
-rw-r--r-- 1 u301533 uo0780 1460931 Jan 15 18:56 sw_jet_14792351.out
-rw-r--r-- 1 u301533 uo0780 0 Jan 15 18:56 sw_jet_14792358.err
-rw-r--r-- 1 u301533 uo0780 1442561 Jan 15 18:57 sw_jet_14792358.out
-rw-r--r-- 1 u301533 uo0780 0 Jan 15 18:57 sw_jet_14792373.err
-rw-r--r-- 1 u301533 uo0780 1387574 Jan 15 18:58 sw_jet_14792373.out
The .out files contain the job outputs, while the .err files capture error messages, which should be empty if everything went smoothly.
Snapshots#
In the main.py script, we added a NetCDFWriter to the diagnostics. This writer saves snapshots of the model state to a netCDF file every 0.5 time units. These files are stored in the snapshots directory:
ls -l snapshots/
-rw-r--r-- 1 u301533 uo0780 201385500 Jan 15 18:57 output_01s_0.20ms.cdf
-rw-r--r-- 1 u301533 uo0780 201385500 Jan 15 18:58 output_02s_0.40ms.cdf
-rw-r--r-- 1 u301533 uo0780 201385500 Jan 15 18:56 output_0s.cdf
For each restart, a new snapshot file is created. This behavior might be inconvenient in some cases. For example, if you want a single netCDF file per simulation month. If a restart occurs mid-month, you end up with two files for the same month. Ideally, the new data would append to the existing file, but this feature is not yet supported by the netCDF writer.
As a workaround, you can configure the restart module to trigger restarts based on simulation time rather than real-time runtime.
More Advanced Restart Strategies#
Trigger Restart by Simulation Time#
In the previous example, the realtime_interval parameter of the restart module was used to trigger restarts based on the model’s runtime. Alternatively, restarts can be triggered based on simulation time by using the clock_trigger parameter and assigning it a ClockTrigger object. For detailed information on the ClockTrigger class, refer to the output writer tutorial. Below is an example of setting a restart to occur every model year:
restart = sw.modules.RestartModule(
clock_trigger=sw.ClockTrigger(time_interval=np.timedelta64(1, "Y")),
)
Ensure that the time required to simulate one year is less than the time limit of the job to avoid interruptions.
Custom Restart Command#
By default, the restart command is determined automatically using scontrol to retrieve the job submission command. However, this approach may fail in certain scenarios, such as when the submission script requires arguments. For example, consider the following script:
#!/bin/bash
# <insert appropriate sbatch options here>
source activate fridom
srun -l python3 "$1"
In this case, the restart command should be sbatch run_fridom.sh main.py. However, the restart module might submit only sbatch run_fridom.sh, causing the restart to fail because the script name is missing.
To address this issue, you can manually specify the restart_command parameter of the restart module as follows:
restart = sw.modules.RestartModule(
realtime_interval=np.timedelta64(1, "m"),
restart_command="sbatch run_fridom.sh main.py",
)
Alternatively, the restart_command parameter can be set to a custom function that handles the restart process. For example, using the subprocess module:
import subprocess
import sys
def restart_job():
subprocess.run(["sbatch", "run_fridom.sh", "main.py"])
sys.exit(0)
restart = sw.modules.RestartModule(
realtime_interval=np.timedelta64(1, "m"),
restart_command=restart_job,
)
This approach allows for full customization of the restart command, enabling the use of other tools, such as pyslurm, to submit jobs efficiently.