GASNet ucx-conduit documentation Boris I. Karasev Artem Y. Polyakov @ TOC: @ @ Section: Overview @ @ Section: Where this conduit runs @ @ Section: Build-time Configuration @ @ Section: Job Spawning @ @ Section: Runtime Configuration @ @ Section: Multi-rail Support @ @ Section: On-Demand Paging (ODP) Support @ @ Section: HCA Configuration @ @ Section: Known Problems @ @ Section: Design Overview @ @ Section: Graceful exits @ @ Section: References @ @ Section: Overview @ Ucx-conduit implements GASNet over the Unified Communication X (UCX) framework (see http://www.openucx.org/ for general information on UCX). This is the first version of the conduit is feature complete and supports Active Messages and hardware-offloaded One-Sided and Atomic operaions. A performance assessment and fine-tuning is planned for the next release. As this is an initial version of the conduit, it is disabled by default. It can be enabled at GASNet configure time using the `--enable-ucx` option. @ Section: Where this conduit runs @ The conduit is based on Unified Communication X (UCX) communication library (see http://www.openucx.org/ for general information on UCX). UCX is an open-source project developed in collaboration between industry, laboratories, and academia to create an open-source production grade communication framework for data-centric and high-performance applications. The UCX library can be downloaded from repositories (e.g., Fedora/RedHat yum repositories). The UCX library is also part of Mellanox OFED and Mellanox HPC-X binary distributions. The conduit is tested and known to work with - Mellanox InfiniBand devices starting from ConnectX-5, while UCX also supports Mellanox RoCE devices, it wasn't yet experimentally confirmed to work with GASNet. - UCX library version 1.6 and above - Linux platform. For Mellanox adapters, UCX conduit supports all transports: RC, UD and DC. For large-scale applications, it is recommended to use DC transport for scalability reasons. @ Section: Build-time Configuration @ In order to enable the conduit, '--with-ucx[=]' parameter needs to be specified. By default, ucx-conduit will not be built. Currently, the conduit does not provide any other configuration-time parameters. See the extended-ref README for the GASNet general-purpose configuration parameters. @ Section: Job Spawning @ If using UPC, Titanium, etc. the language-specific commands should be used to launch applications. Otherwise, applications can be launched using the gasnetrun_ucx utility: + usage summary: gasnetrun_ucx -n [options] [--] prog [program args] options: -n number of processes to run (required) -N number of nodes to run on (not supported by all MPIs) -E list of environment vars to propagate -v be verbose about what is happening -t test only, don't execute anything (implies -v) -k keep any temporary files created (implies -v) -spawner=(ssh|mpi|pmi) force use of a specific spawner (if available) There are as many as three possible methods (ssh, mpi and pmi) by which one can launch an ucx-conduit application. Ssh-based spawning is always available, and mpi- and pmi-based spawning are available if the respective support was located at configure time. The default is established at configure time (see section "Build-time Configuration"). To select a non-default spawner one may either use the "-spawner=" command- line argument or set the environment variable GASNET_UCX_SPAWNER to "ssh", "mpi" or "pmi". If both are used, then the command line argument takes precedence. @ Section: Runtime Configuration @ In order to control UCX parameters (i.e. device or transport selection), environment variables are used. Most commonly used variables are described on the UCX Wiki page https://github.com/openucx/ucx/wiki/UCX-environment-parameters For the full list of tuning knobs supported by a particular UCX version, see the output of `ucx_info -f`. For software stacks allowing concurrent use of UCX library in its different components or layers, UCX conduit supports personalized prefix "UCX_GASNET" (i.e. "UCX_GASNET_TLS", "UCX_GASNET_NET_DEVICES") that allows specifying unique parameters for the ucx-conduit that will not affect or conflict with global parameters (i.e. "UCX_TLS", "UCX_NET_DEVICES") and/or other personalized parameters. "UCX_TLS" environment variable controls transport selection and is one of the most commonly used UCX parameters. Available options are: $ ucx_info -f | grep UCX_TLS -B 23 | head -n 20 # # Comma-separated list of transports to use. The order is not meaningful. # - all : use all the available transports. # - sm/shm : all shared memory transports (mm, cma, knem). # - mm : shared memory transports - only memory mappers. # - ugni : ugni_smsg and ugni_rdma (uses ugni_udt for bootstrap). # - ib : all infiniband transports (rc/rc_mlx5, ud/ud_mlx5, dc_mlx5). # - rc_v : rc verbs (uses ud for bootstrap). # - rc_x : rc with accelerated verbs (uses ud_mlx5 for bootstrap). # - rc : rc_v and rc_x (preferably if available). # - ud_v : ud verbs. # - ud_x : ud with accelerated verbs. # - ud : ud_v and ud_x (preferably if available). # - dc/dc_x : dc with accelerated verbs. # - tcp : sockets over TCP/IP. # - cuda : CUDA (NVIDIA GPU) memory support. # - rocm : ROCm (AMD GPU) memory support. # Using a \ prefix before a transport name treats it as an explicit transport name # and disables aliasing. # In order to make sure that ucx-conduit is being used with Mellanox devices, the following transport set is recommended: "UCX_GASNET_TLS=ib,sm,self". Ucx-conduit supports all of the standard and the optional GASNET_EXITTIMEOUT GASNet environment variables. See GASNet's top-level README for documentation. @ Section: Multi-rail Support @ While UCX supports multi-rail configurations, the conduit wasn't tested in this mode and currently considered as not-supporting this feature. @ Section: On-Demand Paging (ODP) Support @ The conduit supports ODP through UCX, see "UCX_IB_REG_METHODS" UCX environemtn variable. @ Section: HCA Configuration @ Consult corresponding section in ibv-conduit README file. TODO: Check if UCX provides any aditional/conflicting advices. @ Section: Known Problems @ + See the GASNet Bugzilla server for details on known bugs: https://gasnet-bugs.lbl.gov/ + This version is known to have issues with Mellanox ConnectX-4 adapters + Support for segment-everything mode is experimental and is known to suffer from unbounded memory use in AMLong (and amref-based Put/Get). @ Section: Design Overview @ The UCX conduit implements GASNet core and extended API for UCX-compatible network adapters. Where: (a) Core API includes conduit resource initialization and cleanup functionality along with Active Message communication; (b) Extended API includes one-sided and atomic operations support. References: + UCX API documentation: http://openucx.github.io/ucx/api/v1.6/html/index.html Resource allocation and initialization: During GASNet initialization, every process creates a UCX worker, participates in the exchange of worker addresses (with Allgather communication pattern), and creates UCP endpoints representing connections to all other processes in the GASNet job. In addition, a set of service buffers for Active Messages is allocated and all required UCP memory registrations and memory key exchanges are performed. Active Message: UCX conduit implementation is using a pool of pre-allocated AM header buffers that are used for all types of AMs. For Medium size messages, for both sends and recvs, the payload is transferred through the bounce buffers to reduce the latency in absence of local completion options ("lc_opt") support. This behavior will be reconsidered in future. For Large AMs payload transfer is implemented using RDMA Put operation. RDMA/AMO: Implementation of RDMA Put and Atomic operations is a thin layer on top of UCP primitives. In addition, "ucp_ep_flush_nb" is used to implement remote completion tracking for single-rail configurations. Currently, UCX conduit doesn't support multi-rail configurations. UCX directly supports a fixed set of atomic operations, unsupported operations are handled by generic GASNet implementation. The status of atomic operation support is described in the table below: I32/U32/I64/U64 --- --------------- SET: Y GET: Y SWAP: Y (F)CAS: Y (F)ADD: Y (F)SUB: Y (F)INC: Y (F)DEC: Y (F)MULT: N (F)MIN: N (F)MAX: N (F)AND: Y (F)OR: Y (F)XOR: Y Y = offload is supported N = offload is not supported @ Section: Graceful exits @ Graceful exits are supported by the conduit. The design is based on ib-conduit description (see ib-conduit README). The main difference from ibv-conduit design is an "election" procedure in case global exit experienced timeout. In ibv-conduit, rank 0 is used to arbitrate between other ranks in competition for "leader" role. In ucx-conduit, rank 0 is always considered to be "leader". It takes this role when it receives the request to perform exit from other ranks. There might be a further improvement of the non-global exit protocol by sequentially trying ranks 1, 2, ..., if rank 0 failed to confirm it has taken "leader" role (noted in the Further work section). @ Section: Future work @ + Implement support for local completion options ("lc_opt"). + Additional round of performance analysis and tuning. + Investigate older UCX versions and hardware compatibility and update README. + Support Graceful exits if rank 0 has failed. + Fix robustness problems with AMLong in `--enable-segment-everything` mode. + Apply ODP to optimize `--enable-segment-everything` mode.