概述

概述

批量运行器类旨在简化编写利用 HPC 中多节点作业的 Python 脚本。

例如,如果我们为基于 Slurm 的系统编写一个 Python 脚本并使用 srun -n 6 python myscript.py 命令调用它,该脚本将在 HPC 上由 Slurm 在 6 个不同的节点/核心上并行调用 6 次。然后,Dask Runner 类使用 Slurm 进程 ID 环境变量来决定每个进程应扮演的角色,并使用共享文件系统通过调度器文件来引导通信。

# myscript.py
from dask.distributed import Client
from dask_jobqueue.slurm import SLURMRunner

# When entering the SLURMRunner context manager processes will decide if they should be
# the client, schdeduler or a worker.
# Only process ID 1 executes the contents of the context manager.
# All other processes start the Dask components and then block here forever.
with SLURMRunner(scheduler_file="/path/to/shared/filesystem/scheduler-{job_id}.json") as runner:

    # The runner object contains the scheduler address info and can be used to construct a client.
    with Client(runner) as client:

        # Wait for all the workers to be ready before continuing.
        client.wait_for_workers(runner.n_workers)

        # Then we can submit some work to the Dask scheduler.
        assert client.submit(lambda x: x + 1, 10).result() == 11
        assert client.submit(lambda x: x + 1, 20, workers=2).result() == 21

# When process ID 1 exits the SLURMRunner context manager it sends a graceful shutdown to the Dask processes.