配置示例
目录
配置示例¶
我们提供了已知超级计算机的配置文件。希望这些配置能帮助使用这些机器的其他用户,以及希望看到类似集群示例的新用户。
欢迎在此处提供其他集群的更多示例。
Cheyenne¶
NCAR 的Cheyenne 超级计算机使用 PBS(用于 Cheyenne 本身)和 Slurm(用于附属的 DAV 集群 Geyser/Caldera)。
distributed:
scheduler:
bandwidth: 1000000000 # GB MB/s estimated worker-worker bandwidth
worker:
memory:
target: 0.90 # Avoid spilling to disk
spill: False # Avoid spilling to disk
pause: 0.80 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker
comm:
compression: null
jobqueue:
pbs:
name: dask-worker
cores: 36 # Total number of cores per job
memory: '109 GB' # Total amount of memory per job
processes: 9 # Number of Python processes per job
interface: ib0 # Network interface to use like eth0 or ib0
queue: regular
walltime: '00:30:00'
resource-spec: select=1:ncpus=36:mem=109GB
slurm:
name: dask-worker
# Dask worker options
cores: 1 # Total number of cores per job
memory: '25 GB' # Total amount of memory per job
processes: 1 # Number of Python processes per job
interface: ib0
account: PXYZ123
walltime: '00:30:00'
job-extra: {-C geyser}
NERSC Cori¶
需要注意的是,以下配置文件假定您在工作节点上运行调度器。目前,登录节点似乎无法与工作节点双向通信。因此,您需要使用以下方式申请一个交互式节点:
$ salloc -N 1 -C haswell --qos=interactive -t 04:00:00
然后您将在该交互式节点上直接运行 dask jobqueue。请注意分布式部分设置,该设置旨在避免 dask 写入磁盘。这是由于本地文件系统的一些奇怪行为导致的。
或者,您可以使用NERSC jupyterhub,它将在 Cori 的保留大内存节点上启动一个笔记本服务器。在这种情况下,不需要特殊的交互式会话,dask jobqueue 将按预期运行。您也可以直接访问 Dask 仪表板。请参阅示例笔记本
distributed:
worker:
memory:
target: False # Avoid spilling to disk
spill: False # Avoid spilling to disk
pause: 0.80 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker
jobqueue:
slurm:
cores: 64
memory: 115GB
processes: 4
queue: debug
walltime: '00:10:00'
job-extra: ['-C haswell', '-L project, SCRATCH, cscratch1']
ARM Stratus¶
美国能源部大气辐射测量 (DOE-ARM) Stratus 超级计算机.
jobqueue:
pbs:
name: dask-worker
cores: 36
memory: 270GB
processes: 6
interface: ib0
local-directory: $localscratch
queue: high_mem # Can also select batch or gpu_ssd
account: arm
walltime: 00:30:00 #Adjust this to job size
job-extra: ['-W group_list=cades-arm']
SDSC Comet¶
圣地亚哥超级计算中心 (SDSC) 的Comet 集群,美国科学家可通过XSEDE访问。另外请注意,登录节点和计算节点都开放了端口 8787,因此您可以直接访问 Dask 的仪表板。
jobqueue:
slurm:
name: dask-worker
# Dask worker options
cores: 24 # Total number of cores per job
memory: 120GB # Total amount of memory per job (total 128GB per node)
processes: 1 # Number of Python processes per job
interface: ib0 # Network interface to use like eth0 or ib0
death-timeout: 60 # Number of seconds to wait if a worker can not find a scheduler
local-directory: /scratch/$USER/$SLURM_JOB_ID # local SSD
# SLURM resource manager options
queue: compute
# account: xxxxxxx # choose account other than default
walltime: '00:30:00'
job-mem: 120GB # Max memory that can be requested to SLURM
Ifremer DATARMOR¶
有关 Ifremer DATARMOR 集群的更多详细信息,请参阅此内容(法文)或此内容(通过 Google 翻译的英文)。
有关此dask-jobqueue
配置的更多详细信息,请参阅此内容。
jobqueue:
pbs:
name: dask-worker
# Dask worker options
# number of processes and core have to be equal to avoid using multiple
# threads in a single dask worker. Using threads can generate netcdf file
# access errors.
cores: 28
processes: 28
# this is using all the memory of a single node and corresponds to about
# 4GB / dask worker. If you need more memory than this you have to decrease
# cores and processes above
memory: 120GB
interface: ib0
# This should be a local disk attach to your worker node and not a network
# mounted disk. See
# https://jobqueue.dask.org.cn/en/latest/configuration-setup.html#local-storage
# for more details.
local-directory: $TMPDIR
# PBS resource manager options
queue: mpi_1
account: myAccount
walltime: '48:00:00'
resource-spec: select=1:ncpus=28:mem=120GB
# disable email
job-extra: ['-m n']