配置示例

我们提供了已知超级计算机的配置文件。希望这些配置能帮助使用这些机器的其他用户,以及希望看到类似集群示例的新用户。

欢迎在此处提供其他集群的更多示例。

Cheyenne

NCAR 的Cheyenne 超级计算机使用 PBS(用于 Cheyenne 本身)和 Slurm(用于附属的 DAV 集群 Geyser/Caldera)。

distributed:
  scheduler:
    bandwidth: 1000000000     # GB MB/s estimated worker-worker bandwidth
  worker:
    memory:
      target: 0.90  # Avoid spilling to disk
      spill: False  # Avoid spilling to disk
      pause: 0.80  # fraction at which we pause worker threads
      terminate: 0.95  # fraction at which we terminate the worker
  comm:
    compression: null

jobqueue:
  pbs:
    name: dask-worker
    cores: 36                   # Total number of cores per job
    memory: '109 GB'            # Total amount of memory per job
    processes: 9                # Number of Python processes per job
    interface: ib0              # Network interface to use like eth0 or ib0

    queue: regular
    walltime: '00:30:00'
    resource-spec: select=1:ncpus=36:mem=109GB

  slurm:
    name: dask-worker

    # Dask worker options
    cores: 1                    # Total number of cores per job
    memory: '25 GB'             # Total amount of memory per job
    processes: 1                # Number of Python processes per job

    interface: ib0

    account: PXYZ123
    walltime: '00:30:00'
    job-extra: {-C geyser}

NERSC Cori

NERSC Cori 超级计算机

需要注意的是,以下配置文件假定您在工作节点上运行调度器。目前,登录节点似乎无法与工作节点双向通信。因此,您需要使用以下方式申请一个交互式节点:

$ salloc -N 1 -C haswell --qos=interactive -t 04:00:00

然后您将在该交互式节点上直接运行 dask jobqueue。请注意分布式部分设置,该设置旨在避免 dask 写入磁盘。这是由于本地文件系统的一些奇怪行为导致的。

或者,您可以使用NERSC jupyterhub,它将在 Cori 的保留大内存节点上启动一个笔记本服务器。在这种情况下,不需要特殊的交互式会话,dask jobqueue 将按预期运行。您也可以直接访问 Dask 仪表板。请参阅示例笔记本

distributed:
  worker:
    memory:
      target: False  # Avoid spilling to disk
      spill: False  # Avoid spilling to disk
      pause: 0.80  # fraction at which we pause worker threads
      terminate: 0.95  # fraction at which we terminate the worker

jobqueue:
    slurm:
        cores: 64
        memory: 115GB
        processes: 4
        queue: debug
        walltime: '00:10:00'
        job-extra: ['-C haswell', '-L project, SCRATCH, cscratch1']

ARM Stratus

美国能源部大气辐射测量 (DOE-ARM) Stratus 超级计算机.

jobqueue:
  pbs:
    name: dask-worker
    cores: 36
    memory: 270GB
    processes: 6
    interface: ib0
    local-directory: $localscratch
    queue: high_mem # Can also select batch or gpu_ssd
    account: arm
    walltime: 00:30:00 #Adjust this to job size
    job-extra: ['-W group_list=cades-arm']

SDSC Comet

圣地亚哥超级计算中心 (SDSC) 的Comet 集群,美国科学家可通过XSEDE访问。另外请注意,登录节点和计算节点都开放了端口 8787,因此您可以直接访问 Dask 的仪表板。

jobqueue:
  slurm:
    name: dask-worker

    # Dask worker options
    cores: 24                   # Total number of cores per job
    memory: 120GB               # Total amount of memory per job (total 128GB per node)
    processes: 1                # Number of Python processes per job

    interface: ib0              # Network interface to use like eth0 or ib0
    death-timeout: 60           # Number of seconds to wait if a worker can not find a scheduler
    local-directory: /scratch/$USER/$SLURM_JOB_ID # local SSD

    # SLURM resource manager options
    queue: compute
    # account: xxxxxxx # choose account other than default
    walltime: '00:30:00'
    job-mem: 120GB              # Max memory that can be requested to SLURM

Ifremer DATARMOR

有关 Ifremer DATARMOR 集群的更多详细信息,请参阅此内容(法文)或此内容(通过 Google 翻译的英文)。

有关此dask-jobqueue配置的更多详细信息,请参阅此内容

jobqueue:
  pbs:
    name: dask-worker

    # Dask worker options
    # number of processes and core have to be equal to avoid using multiple
    # threads in a single dask worker. Using threads can generate netcdf file
    # access errors.
    cores: 28
    processes: 28
    # this is using all the memory of a single node and corresponds to about
    # 4GB / dask worker. If you need more memory than this you have to decrease
    # cores and processes above
    memory: 120GB
    interface: ib0
    # This should be a local disk attach to your worker node and not a network
    # mounted disk. See
    # https://jobqueue.dask.org.cn/en/latest/configuration-setup.html#local-storage
    # for more details.
    local-directory: $TMPDIR

    # PBS resource manager options
    queue: mpi_1
    account: myAccount
    walltime: '48:00:00'
    resource-spec: select=1:ncpus=28:mem=120GB
    # disable email
    job-extra: ['-m n']